Friday, May 4, 2012

How to write hybrid CPU/GPU programs with Haskell


What’s better than programming a GPU with a high-level, Haskell-embedded DSL (domain-specific-language)? Well, perhaps writing portable CPU/GPU programs that utilize both pieces of silicon—with dynamic load-balancing between them—would fit the bill.

This is one of the heterogeneous programming scenarios supported by our new meta-par packages. A draft paper can be found here, which explains the mechanism for building parallel schedulers out of "mix-in" components. In this post, however, we will skip over that and take a look at CPU/GPU programming specifically.

This post assumes familiarity with the monad-par parallel programming library, which is described in this paper.

Getting Started

First, we install the just-released meta-par-accelerate package:
cabal install c2hs
cabal install meta-par-accelerate
If you have CUDA 4.1 installed, then just remove the "-f-cuda" (or remove the entire middle line).  And then we import the following module:
import Control.Monad.Par.Meta.AccSMP
This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par multicore CPU) scheduling: It also reexports the ParAccelerate type class which provides the ability to launch GPU computations from within a Par computation.

Next, we also import Accelerate itself to so that we can express Acc computations that can run on the GPU:
import Data.Array.Accelerate
import Data.Array.Accelerate.CUDA as CUDA
(By the way, this blog post is an executable literate Haskell file that can be found here.)
Now we are ready to create a trivial Accelerate computation:
triv :: Acc (Scalar Float)
triv = let arr = generate (constant (Z :. (10::Int))) 
                          (\ i -> 3.3 )
       in fold (+) 0 arr
This creates a ten element one-dimensional array, and then sums up the elements. We could run this directly using CUDA, which would print out Array (Z) [33.0]. That’s just Accelerate’s fancy way of saying "33.0"—a scalar is a zero-dimensional array. Here’s how we invoke Accelerate/CUDA:
runDirect = print (CUDA.run triv)
If we are inside a Par computation, on the other hand, we simply use runACC or runAccWith:
runWPar1   = print (runPar (runAcc triv))
runWPar2   = print (runPar (runAccWith CUDA.run triv))
The former uses the default Accelerate implementation. The latter specifies which Accelerate implementation to use. After all, there should ultimately be several: OpenCL, CUDA, plus various CPU backends. (In the future, we plan to add the ability to change the default Accelerate backend either at the runPar site, or statically. Stay tuned for that. But for now just use runAccWith.)

The reader might protest that it is possible to use CUDA.run directly within a Par computation, so why go to the extra trouble? The advantage of using runAcc is that it informs the Par scheduler of what’s going on. The scheduler can therefore execute other work on the CPU core that would otherwise be waiting for the GPU. An application could achieve the same effect by creating a dedicated thread to talk to the GPU, but that wouldn’t jive well with a pure computation (forkIO), and it’s easier to let meta-par handle it anyway.

The second benefit of integrated scheduling is that the scheduler can automatically divide work between the CPU and GPU. Eventually, when there are full-featured, efficient CPU-backends for Accelerate, this will happen transparently. For now you need to use unsafeHybrid described in the next section. Finally, our soon-forthcoming CPU/GPU/Distributed schedulers can make more intelligent decisions if they know where all the calls to GPU computations occur.

Hybrid CPU/GPU workloads.

The meta-par and meta-par-accelerate packages, as currently released, include a generalized work-stealing infrastructure. The relevant point for our purposes here, is that the CPU and GPU can each steal work from one another. Work-stealing is by no means the most sophisticated CPU/GPU partitioning on the scene. Much literature has been written on the subject, and it can get quite sophisticated (for example, modeling memory transfer time). However, as on regular multicores, work-stealing provides an admirable combination of simplicity and efficacy. For example, if a given program runs much better on the CPU or the GPU, respectively, then that device will end up doing more of the work.

In the current release, we use unsafeHybridWith, documented here, to spawn a task with two separate implementations—one CPU and one GPU—leaving the scheduler to choose between them. Here’s a silly example:
hybrid :: Par (IVar Float)
hybrid = unsafeHybridWith CUDA.run (`indexArray` Z                          (return 33.0, triv)
runHybrid = print (runPar (hybrid >>= get))
The call to unsafeHybridWith is passed a task that consists of a separate CPU (return 33.0) and GPU (triv) component. We must guarantee that the two computations yield the same result (hence the "unsafe" moniker). The indexArray bit is a conversion function that converts the result from the GPU computation to a type that matches the result from the CPU alternative.

Typically, you would write a recursive algorithm like the following:
-- | Recursive mergesort that bottoms out to CPU or GPU:
parSort :: Vector Int -> Par (Vector Int)
parSort vec =
  if length vec <= threshold
  then unsafeHybridWith CUDA.run undefined
                        (cpuSort vec, gpuSort vec) >>= get
  else let n      = (length vec) `div` 2
           (l, r) = splitAt n vec
       in do lf <- spawn_ (parSort l)
             r' <-         parSort r
             l' <- get lf
             parMerge l' r'
parMerge :: Vector Int -> Vector Int -> Par (Vector Int)
cpuSort :: Vector Int -> Par (Vector Int)
gpuSort :: Vector Int -> Acc (Array DIM1 Int)
threshold :: Int
The user of this API still has to make some tough decisions. Setting the threshold determines the granularity at which work will be farmed out between the CPU and GPU. To achieve effective load balance, the problem must be subdivided into sufficiently small pieces to deal with the slower rate at which the CPU may complete work items. Further, in this particular example the GPU may not have enough memory to process the entire array, meaning that the problem must be subdivided at least until it fits in memory.

On the other hand, subdividing the problem further means more round trips through the GPU, more driver calls, synchronization. In this particular example the total volume of data transferred is the same (it’s linear), but for other algorithms that may not be the case.

Notes for Hackers

If you want to work with the github repositories, you need to have GHC 7.4 and the latest cabal-install (0.14.0). You can check everything out here:
git clone git://github.com/simonmar/monad-par.git --recursive
Try make mega-install-gpu if you already have CUDA installed on your machine. Explore the README’s here and here for more information.

Nightly Testing

The GPU-programming portions of meta-par require the latest and greatest GHC (7.4). Their nightly testing results are on a Jenkins instance here. The Jenkins setup tests Linux and Windows. Mac OS we test regularly on our laptops, but don’t have a Jenkins instance for yet.

The core meta-par package and the original monad-par package do not depend on any recent GHC additions. Nightly testing results for those packages, on GHC 6.12, 7.0, and 7.2, can be found here.

Because Hackage can’t always build documentation for packages with foreign dependencies, here are some documentation links for the relevant packages:

Thursday, April 26, 2012

Dropbox wiki gone -- Why we little people must Clone the Cloud

Clone the cloud.  It's better for everyone.

With github or with Google Drive this happens transparently.  The user has a full copy of what's on the server.  Github even does this for wiki and web page data, which is great, and some people create static HTML blogs using github and tools like Jekyll.


But unfortunately, most text that's poured into comments, forums, wikis, and blogs seems to have a dubious shelf life.  The case in point is that Dropbox took down their wiki a while back, making this link in one of my previous posts dead.  Poof.  In this particular case I kept a copy in the text file I pasted it from, so I am republishing it below.

"Data liberation" doesn't quite fix the problem, either.  Some companies are better than others about providing optional downloads of your data.  Google, in particular, is very good.  But laborious opt-in downloads (that most people don't use) aren't good enough.  Always-on synchronization or mirroring is called for.

I'm optimistic.  With the mass movement towards synchronization based approaches (SkyDrive, iCloud, Google Drive, etc) and VCS (github, bitbucket) I hope we are moving in the right direction.  And why not?  Everyone wins: 
  • For the cloud service provider it means an extra backup copy that reduces the risk of angry customers if a software failure (or more likely, account security breach) destroys data.  It also provides local caching, which always makes good engineering sense.
  • For the user, you get to access your data on the plane and know that it will live on when the company in question decides to retire the cloud service you've been using.


Monday, March 26, 2012

Potential GSOC: Tweak memory-reuse analysis tools for GHC compatibility

Some program instrumentation and analysis tools are language agnostic. Pin and Valgrind use binary rewriting to instrument an x86 binary on the fly and thus in theory could be used just as well for a Haskell binary as for one compiled by C. Indeed, if you download Pin from pintool.org, you can use the included open source tools to immediately begin analyzing properties of Haskell workloads -- for example the total instruction mix during execution.

The problem is that aggregate data for an entire execution is rather coarse. It's not correlated temporally with phases of program execution, nor are specific measured phenomena related to anything in the Haskell source.

This could be improved. A simple example would be to measure memory-reuse distance (an architecture-independent characterization of locality) but to distinguish garbage collection from normal memory access. It would be quite nice to see a histogram of reuse-distances in which GC accesses appear as a separate layer (different color) from normal accesses.

How to go about this? Fortunately, the existing MICA pintool can build today (v0.4) and measure memory reuse distances. In fact, it already produces per-phase measurements where phases are delimited by dynamic instruction counts (i.e. every 100M instructions). All that remains is to tweak that definition of phase to transition when GC switches on or off.

How to do that? Well, Pin has existing methods for targeted instrumentation of specific C functions. By targeting appropriate functions in the GHC RTS, this analysis tool could probably work without requiring any GHC modification at all.

A further out goal would be to correlate events observed by the binary rewriting tool and those recorded by GHC's traceEvent.

Finally, as it turns out this would NOT be the first crossing of paths between GHC and binary rewriting. Julian Seward worked on GHC before developing valgrind:


Wednesday, February 8, 2012

Potential GSoC: Haskell Lock-free Data Structure Implementations

The GHC Haskell compiler recently gained the capability to generate atomic compare-and-swap (CAS) assembly instructions. This opens up a new world of data-structure implementation possibilities.

Furthermore, it's an important time for concurrent data structures. Not only is the need great, but the design of concurrent data structures has been a very active area in recent years, as summarized well by Nir Shavit in this article.

Because Haskell objects containing pointers can't efficiently be stored outside the Haskell heap, it is necessary to reimplement these data structures for Haskell, rather than use the FFI to access external implementations. There are already a couple of data structures implemented in the following library (queues and deques) :


But, this leaves many others, such as:
  • Concurrent Bags
  • Concurrent Hashtables
  • Concurrent Priority Queues
A good point of reference would be the libcds collection of concurrent data structures for C++ (or those that come with Java or .NET):

One of the things that makes implementing these data structures fun is that they have very short algorithmic descriptions but a high density of thought-provoking complexity.

A good GSoC project would be to implement 2-3 data structures from the literature, and benchmark them against libcds implementations.

UPDATE: Here's a link to the Trac ticket.