Going parallel and larger-than-memory with graphs

02:30 PM - 03:25 PM on August 16, 2015, Room 701

Blake Griffith

Audience level:: intermediate
Watch:: http://youtu.be/yDlCNjtZvLw

Description

Dask is an open source, pure python library that enables parallel larger-than-memory computation in a novel way. We represent programs as Directed Acyclic Graphs (DAG) of function calls. These graphs are executed by dask's schedulers with different optimizations (synchronous, threaded, parallel, distributed-memory). Dask has modules geared towards data analysis, which provide a friendly interface to building graps. One module, dask.array, mimics a subset of NumPy operations. With dask.array we can work with NumPy like arrays that are larger than RAM and parallelization comes for free by leveraging the underlying DAG.

Abstract

We layout core dask components: Directed Acyclic Graphs (DAG), schedulers, collections. Then demonstrate using dask.array with benchmarks and comparisons to NumPy.

Tasks are python tuples whose first item is callable, the zero or more following items are its arguments. The arguments can also be tasks (Lisp users will recall S-expressions). Since a task can depend on other tasks, we can consider a task to be a vertex in a DAG.

To "execute" a DAG, starting from the initial vertex, we want to traverse all vertexes in the graph. By inspecting a non-trivial graph we can see that there are several ways to traverse them, breadth-first, depth-first, in parallel, etc. So dask includes schedulers which are functions for executing DAG's. The implemented schedulers are: synchronous, threaded, multiprocessing, and distributed-memory.

However writing a DAG of tasks by hand is not fun, and quickly becomes unmanageable. So dask has several collections, a collection provides some interface to creating dasks. We have dask.array as an answer to NumPy, dask.bag for semi-structured data (like JSON blobs), and dask.dataframe for pandas.dataframe.

With dask.array we can do things NumPy cannot. Like work with arrays that are larger than RAM, and map functions in parallel over the array. An obvious API difference is creating dask.arrays requires a "chunks" argument which can be thought of as size of the chunks you want the array broken into. Each chunk should fit in memory. We demonstrate some motivating examples along with benchmarks.

However dask.array only implements a subset of NumPy and is limited to operations that return predictably shaped arrays. So we can't do things like numpy.argwhere.

You can find dask on github: https://github.com/ContinuumIO/dask