Fast parallel code generation for data analytics frameworks. Developed at Stanford University.
Grizzly is a subset of the Pandas data analytics library integrated with Weld. Grizzly uses lazy evaluation to accelerate Pandas workloads by optimizing across individual operators.
Grizzly currently supports Weld-optimized versions of several commonly used operators, including:
You can install Grizzly via PyPi:
$ pip install grizzly
This rest of this page walks through a simple example of how to setup and use Grizzly in an application.
To get data for this tutorial run:
$ wget https://raw.githubusercontent.com/jvns/pandas-cookbook/master/data/311-service-requests.csv
First, import the Pandas library and Grizzly:
$ python
>>> import pandas as pd
>>> import grizzly.grizzly as gr
Grizzly depends on native Pandas for file I/O, so to read from a file, call Pandas’ read_csv
function. For the purposes of this tutorial, let’s read from a CSV file called 311-service-requests.csv
:
>>> na_values = ['NO CLUE', 'N/A', '0']
>>> raw_requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})
Grizzly exposes a DataFrameWeld
object that serves as a wrapper around the native Pandas DataFrame
object; all of DataFrameWeld
’s exposed methods are lazily-evaluated (that is, execution is only forced when the evaluate()
method is called). To create a DataFrameWeld
object from the DataFrame
we just read:
>>> requests = gr.DataFrameWeld(raw_requests)
We can then use standard Pandas operators on this DataFrameWeld
object. requests
has a column of zipcodes; some of these are “00000”. To convert them all to nan
, we can first compute a predicate using the ==
operator (which returns a SeriesWeld
object that wraps a native Pandas Series
object), and then subsequently mask:
>>> zero_zips = requests['Incident Zip'] == '00000'
>>> requests['Incident Zip'][zero_zips] = "nan"
To see all resulting unique zipcodes, we could do:
>>> result = requests['Incident Zip'].unique()
Note that unique
returns a LazyOp
object. To convert to a standard NumPy array (that is, to force execution), call:
>>> print result.evaluate()
More examples of workloads that make use of Grizzly are in the examples/python/grizzly directory.