Stanford Infolab

Weld

Fast parallel code generation for data analytics frameworks.

Grizzly

Grizzly is a subset of the Pandas data analytics library integrated with Weld. Grizzly uses lazy evaluation to accelerate Pandas workloads by optimizing across individual operators.

Grizzly currently supports Weld-optimized versions of several commonly used operators, including:

You can install Grizzly via PyPi:

$ pip install grizzly

This rest of this page walks through a simple example of how to setup and use Grizzly in an application.

Tutorial Data Acquisition

To get data for this tutorial run:

$ wget https://raw.githubusercontent.com/jvns/pandas-cookbook/master/data/311-service-requests.csv

A Step-by-Step Walkthrough

First, import the Pandas library and Grizzly:

$ python
>>> import pandas as pd
>>> import grizzly.grizzly as gr

Grizzly depends on native Pandas for file I/O, so to read from a file, call Pandas’ read_csv function. For the purposes of this tutorial, let’s read from a CSV file called 311-service-requests.csv:

>>> na_values = ['NO CLUE', 'N/A', '0']
>>> raw_requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})

Grizzly exposes a DataFrameWeld object that serves as a wrapper around the native Pandas DataFrame object; all of DataFrameWeld’s exposed methods are lazily-evaluated (that is, execution is only forced when the evaluate() method is called). To create a DataFrameWeld object from the DataFrame we just read:

>>> requests = gr.DataFrameWeld(raw_requests)

We can then use standard Pandas operators on this DataFrameWeld object. requests has a column of zipcodes; some of these are “00000”. To convert them all to nan, we can first compute a predicate using the == operator (which returns a SeriesWeld object that wraps a native Pandas Series object), and then subsequently mask:

>>> zero_zips = requests['Incident Zip'] == '00000'
>>> requests['Incident Zip'][zero_zips] = "nan"

To see all resulting unique zipcodes, we could do:

>>> result = requests['Incident Zip'].unique()

Note that unique returns a LazyOp object. To convert to a standard NumPy array (that is, to force execution), call:

>>> print result.evaluate()

More examples of workloads that make use of Grizzly are in the examples/python/grizzly directory.