Friday, May 4, 2012

How to write hybrid CPU/GPU programs with Haskell


What’s better than programming a GPU with a high-level, Haskell-embedded DSL (domain-specific-language)? Well, perhaps writing portable CPU/GPU programs that utilize both pieces of silicon—with dynamic load-balancing between them—would fit the bill.

This is one of the heterogeneous programming scenarios supported by our new meta-par packages. A draft paper can be found here, which explains the mechanism for building parallel schedulers out of "mix-in" components. In this post, however, we will skip over that and take a look at CPU/GPU programming specifically.

This post assumes familiarity with the monad-par parallel programming library, which is described in this paper.

Getting Started

First, we install the just-released meta-par-accelerate package:
cabal install c2hs
cabal install meta-par-accelerate
If you have CUDA 4.1 installed, then just remove the "-f-cuda" (or remove the entire middle line).  And then we import the following module:
import Control.Monad.Par.Meta.AccSMP
This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par multicore CPU) scheduling: It also reexports the ParAccelerate type class which provides the ability to launch GPU computations from within a Par computation.

Next, we also import Accelerate itself to so that we can express Acc computations that can run on the GPU:
import Data.Array.Accelerate
import Data.Array.Accelerate.CUDA as CUDA
(By the way, this blog post is an executable literate Haskell file that can be found here.)
Now we are ready to create a trivial Accelerate computation:
triv :: Acc (Scalar Float)
triv = let arr = generate (constant (Z :. (10::Int))) 
                          (\ i -> 3.3 )
       in fold (+) 0 arr
This creates a ten element one-dimensional array, and then sums up the elements. We could run this directly using CUDA, which would print out Array (Z) [33.0]. That’s just Accelerate’s fancy way of saying "33.0"—a scalar is a zero-dimensional array. Here’s how we invoke Accelerate/CUDA:
runDirect = print (CUDA.run triv)
If we are inside a Par computation, on the other hand, we simply use runACC or runAccWith:
runWPar1   = print (runPar (runAcc triv))
runWPar2   = print (runPar (runAccWith CUDA.run triv))
The former uses the default Accelerate implementation. The latter specifies which Accelerate implementation to use. After all, there should ultimately be several: OpenCL, CUDA, plus various CPU backends. (In the future, we plan to add the ability to change the default Accelerate backend either at the runPar site, or statically. Stay tuned for that. But for now just use runAccWith.)

The reader might protest that it is possible to use CUDA.run directly within a Par computation, so why go to the extra trouble? The advantage of using runAcc is that it informs the Par scheduler of what’s going on. The scheduler can therefore execute other work on the CPU core that would otherwise be waiting for the GPU. An application could achieve the same effect by creating a dedicated thread to talk to the GPU, but that wouldn’t jive well with a pure computation (forkIO), and it’s easier to let meta-par handle it anyway.

The second benefit of integrated scheduling is that the scheduler can automatically divide work between the CPU and GPU. Eventually, when there are full-featured, efficient CPU-backends for Accelerate, this will happen transparently. For now you need to use unsafeHybrid described in the next section. Finally, our soon-forthcoming CPU/GPU/Distributed schedulers can make more intelligent decisions if they know where all the calls to GPU computations occur.

Hybrid CPU/GPU workloads.

The meta-par and meta-par-accelerate packages, as currently released, include a generalized work-stealing infrastructure. The relevant point for our purposes here, is that the CPU and GPU can each steal work from one another. Work-stealing is by no means the most sophisticated CPU/GPU partitioning on the scene. Much literature has been written on the subject, and it can get quite sophisticated (for example, modeling memory transfer time). However, as on regular multicores, work-stealing provides an admirable combination of simplicity and efficacy. For example, if a given program runs much better on the CPU or the GPU, respectively, then that device will end up doing more of the work.

In the current release, we use unsafeHybridWith, documented here, to spawn a task with two separate implementations—one CPU and one GPU—leaving the scheduler to choose between them. Here’s a silly example:
hybrid :: Par (IVar Float)
hybrid = unsafeHybridWith CUDA.run (`indexArray` Z                          (return 33.0, triv)
runHybrid = print (runPar (hybrid >>= get))
The call to unsafeHybridWith is passed a task that consists of a separate CPU (return 33.0) and GPU (triv) component. We must guarantee that the two computations yield the same result (hence the "unsafe" moniker). The indexArray bit is a conversion function that converts the result from the GPU computation to a type that matches the result from the CPU alternative.

Typically, you would write a recursive algorithm like the following:
-- | Recursive mergesort that bottoms out to CPU or GPU:
parSort :: Vector Int -> Par (Vector Int)
parSort vec =
  if length vec <= threshold
  then unsafeHybridWith CUDA.run undefined
                        (cpuSort vec, gpuSort vec) >>= get
  else let n      = (length vec) `div` 2
           (l, r) = splitAt n vec
       in do lf <- spawn_ (parSort l)
             r' <-         parSort r
             l' <- get lf
             parMerge l' r'
parMerge :: Vector Int -> Vector Int -> Par (Vector Int)
cpuSort :: Vector Int -> Par (Vector Int)
gpuSort :: Vector Int -> Acc (Array DIM1 Int)
threshold :: Int
The user of this API still has to make some tough decisions. Setting the threshold determines the granularity at which work will be farmed out between the CPU and GPU. To achieve effective load balance, the problem must be subdivided into sufficiently small pieces to deal with the slower rate at which the CPU may complete work items. Further, in this particular example the GPU may not have enough memory to process the entire array, meaning that the problem must be subdivided at least until it fits in memory.

On the other hand, subdividing the problem further means more round trips through the GPU, more driver calls, synchronization. In this particular example the total volume of data transferred is the same (it’s linear), but for other algorithms that may not be the case.

Notes for Hackers

If you want to work with the github repositories, you need to have GHC 7.4 and the latest cabal-install (0.14.0). You can check everything out here:
git clone git://github.com/simonmar/monad-par.git --recursive
Try make mega-install-gpu if you already have CUDA installed on your machine. Explore the README’s here and here for more information.

Nightly Testing

The GPU-programming portions of meta-par require the latest and greatest GHC (7.4). Their nightly testing results are on a Jenkins instance here. The Jenkins setup tests Linux and Windows. Mac OS we test regularly on our laptops, but don’t have a Jenkins instance for yet.

The core meta-par package and the original monad-par package do not depend on any recent GHC additions. Nightly testing results for those packages, on GHC 6.12, 7.0, and 7.2, can be found here.

Because Hackage can’t always build documentation for packages with foreign dependencies, here are some documentation links for the relevant packages:

2 comments: