DPH is a work-in-progress, and not everything is yet easy to realize. But you should still prepare yourself as a Haskell programmer, because this is really where Haskell can set off into uncharted areas of popularity and innovation.
As an exercise for a programming course, I developed a spell checker, that runs in parallel and was motivated to write this short howto on getting to grips with DPH. First of all, DPH comes as a builtin library with GHC 6.10.1+, so it's easy to get started. This howto is compatible with 6.10.1. But the DPH team has, as of March 2009, warned that significant fixes are added in the GHC trunk, and you should run the following command before installing (ie. recompiling GHC), so I advice you to do so for your own benefit.
$ ./darcs-all --dph get
Now, what you need to know is: DPH consists of two things, a compiler module that transforms your code for parallelism ('vectorization') and some special libraries that you should use in your code, as they are compatible with the vectorizer.
DPH can help you achieve two kinds of parallelism: Normal (unlifted) data parallelism and nested data parallelism. What the vectorizer works on is the latter, and for simple one-dimentional parallelism, you don't need to know much, so let's just get that done in a few lines:
module SpellCheckUnlifted where
import Types
import Trie
import Data.Array.Parallel.Unlifted
import GHC.PArr
type DocumentU = Array Word
spellCheck :: Dictionary -> Word -> Bool
spellCheck dic w = lookupTrie dic w
spellCheckU :: Dictionary -> Document -> [Word]
spellCheckU dic doc =
fromP ( filterP (\s -> not (spellCheck dic s)) (toP doc) )
As you can see, it's fairly simple: Import Data.Array.Parallel.Unlifted and GHC.PArr. Then use functions toP and fromP to transform a list into a parallel array and back. The filterP function will execute functions in parallel on your array.
Download all source code of spell checker example.
Nested parallel example
The strong feature of DPH is nested parallelism, and you should think your problem through before engaging in this. In my approach, I start out with a non-parallel problem, and I think this is probably the classic example; we model our problems non-parallel and then consider how to transform them. My example is not suitable for nested parallelism, as it obviously has too little complexity and too much overhead for parallelism, but nonetheless there's some interesting points to be gained:
The problem set (a list of words) is easily transformed
The sub-problems (looking up in a Trie data structure) is nested
The data types are non-trivial and need some convertion
{-# LANGUAGE PArr #-}
{-# OPTIONS -fvectorise #-}
module SpellCheckDPH where
import qualified Prelude
import Data.Array.Parallel.Prelude
import Data.Array.Parallel.Prelude.Word8
import qualified Data.Array.Parallel.Prelude.Int as I
type WordP = [:Word8:]
type DocumentP = [:WordP:]
type DictionaryP = [:TrieP:]
data TrieP = NodeP Word8 DictionaryP | LeafP Word8
isNode :: Word8 -> TrieP -> Bool
isNode w t = case t of
(NodeP w' dic) -> w == w'
_ -> False
isLeaf :: Word8 -> TrieP -> Bool
isLeaf w t = case t of
(LeafP w') -> w == w'
_ -> False
lookupTrieP :: DictionaryP -> I.Int -> WordP -> Bool
lookupTrieP dic i w =
let len = lengthP w
in if (I.<) ((I.+) i 1) len
then let node = filterP (isNode (w!:i)) dic
in if (I.==) (lengthP node) 0
then False
else case node!:0 of
(NodeP _ d) -> lookupTrieP d ((I.+) i 1) w
_ -> False
else (I.==) (lengthP (filterP (isLeaf (w!:i)) dic)) 1
spellCheckP :: DictionaryP -> DocumentP -> DocumentP
spellCheckP dic doc = filterP (\s -> not (lookupTrieP dic 0 s)) doc
Notice the first two lines. These tell GHC to add some syntax to handle the parallel arrays ([:a:]) and to let the vectorizer handle the code. Secondly, we import the DPH prelude and hide the normal Prelude. You should keep normal GHC library functions separate from DPH! spellCheckP does not convert a normal datatype into a parallel array, because we don't want anything that cannot be vectorized in this module.
lookupTrieP is not a pretty function, but it has to work with arrays and a limited number of utility functions in the DPH Prelude. And BEWARE! Not all functions in GHC.Parr have been implemented in the DPH Prelude, which is because they simply don't work with nested parallel arrays, yet, so you simly shouldn't import this module.
You'll also notice how I avoided polymorphism. Since DPH only supports Ints, Doubles and Word8 (unsigned 8-bit int), it isn't of any use. When you transform your datatypes into nested parallel types, you will loose generality. But you can still use data types with different constructors etc. And that's the power of DPH!
Now we only have to transform our nested data structure and problem into parallel arrays. This is done in a seperate module (SpellCheck.hs in the example).
c2w :: Char -> W.Word8
c2w c = fromIntegral (ord c)
w2c :: W.Word8 -> Char
w2c w = chr (fromIntegral w)
wArr2Str :: [W.Word8] -> String
wArr2Str wArr = map w2c wArr
t2P :: Trie Char -> TrieP
t2P (Leaf c) = LeafP (c2w c)
t2P (Node c d) = NodeP (c2w c) (dic2P d)
dic2P :: Dictionary -> DictionaryP
dic2P ds = toP $ map t2P ds
doc2P :: Document -> DocumentP
doc2P doc = let doc' = filter (\s -> not $ s=="") doc
in toP $ map (toP.(\s -> map c2w s)) doc'
Download all source code of spell checker example.
Compiling and testing
When you compile, you have to use '-fdph-par' for parallel libraries and '-fpar-seq' for a similar, but sequential library. Also, you have to add '-Odph' to instruct GHC to use the parallel compiler module, '-threaded' because we're using threads and '-fcpr-off' to turn off the CPR optimizer (a trade-off, but I don't know how severe).
$ ghc --make -fdph-par -Odph -threaded SpellCheck
Now, let's test it and see the results. To run with to cores, we add '+RTS -N2 -RTS', and to get some diagnostic output, we add '+RTS -N2 -sstderr -RTS'. My results are not very good, because I only have one core :)
$ ./SpellCheck +RTS -N2 -sstderr -RTS /usr/share/dict/words document.txt 1
1,024,145,348 bytes allocated in the heap
580,407,632 bytes copied during GC
33,537,816 bytes maximum residency (16 sample(s))
15,817,400 bytes maximum slop
97 MB total memory in use (1 MB lost due to fragmentation)
Generation 0: 1945 collections, 25 parallel, 3.92s, 4.47s elapsed
Generation 1: 16 collections, 13 parallel, 1.98s, 2.38s elapsed
Parallel GC work balance: 1.57 (79357576 / 50670026, ideal 2)
Task 0 (worker) : MUT time: 7.88s ( 2.43s elapsed)
GC time: 0.00s ( 0.00s elapsed)
Task 1 (worker) : MUT time: 7.88s ( 2.47s elapsed)
GC time: 0.00s ( 0.00s elapsed)
Task 2 (worker) : MUT time: 1.99s ( 2.47s elapsed)
GC time: 5.90s ( 6.85s elapsed)
Task 3 (worker) : MUT time: 7.88s ( 2.47s elapsed)
GC time: 0.00s ( 0.00s elapsed)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.99s ( 2.47s elapsed)
GC time 5.90s ( 6.85s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 7.88s ( 9.32s elapsed)
%GC time 74.8% (73.5% elapsed)
Alloc rate 515,130,747 bytes per MUT second
Productivity 25.2% of total user, 21.3% of total elapsed