Scikit-Learn-style API¶

This example demontrates compatability with scikit-learn’s basic fit API. For demonstration, we’ll use the perennial NYC taxi cab dataset.

In [1]:

import os
import s3fs
import pandas as pd
import dask.array as da
import dask.dataframe as dd
from distributed import Client

from dask import persist
from dask_glm.estimators import LogisticRegression

In [2]:

if not os.path.exists('trip.csv'):
    s3 = S3FileSystem(anon=True)
    s3.get("dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", "trip.csv")

In [3]:

client = Client()

In [4]:

ddf = dd.read_csv("trip.csv")

We can use the dask.dataframe API to explore the dataset, and notice that some of the values look suspicious:

In [5]:

ddf[['trip_distance', 'fare_amount']].describe().compute()

Out[5]:

	trip_distance	fare_amount
count	1.274899e+07	1.274899e+07
mean	1.345913e+01	1.190566e+01
std	9.844094e+03	1.030254e+01
min	0.000000e+00	-4.500000e+02
25%	1.000000e+00	6.500000e+00
50%	1.700000e+00	9.000000e+00
75%	3.100000e+00	1.350000e+01
max	1.542000e+07	4.008000e+03

Scikit-learn doesn’t currently support filtering observations inside a pipeline (yet), so we’ll do this before anything else.

In [6]:

# these filter out less than 1% of the observations
ddf = ddf[(ddf.trip_distance < 20) &
          (ddf.fare_amount < 150)]

Now, we’ll split our DataFrame into a train and test set, and select our feature matrix and target column (whether the passenger tipped).

In [7]:

df_train, df_test = ddf.random_split([0.80, 0.20], random_state=2)

columns = ['VendorID', 'passenger_count', 'trip_distance', 'payment_type', 'fare_amount']

X_train, y_train = df_train[columns], df_train['tip_amount'] > 0
X_test, y_test = df_test[columns], df_test['tip_amount'] > 0

X_train, y_train, X_test, y_test = persist(
    X_train, y_train, X_test, y_test
)

With our training data in hand, we fit our logistic regression. Nothing here should be surprising to those familiar with scikit-learn.

In [8]:

%%time
# this is a *dask-glm* LogisticRegresion, not scikit-learn
lm = LogisticRegression(fit_intercept=False)
lm.fit(X_train.values, y_train.values)

CPU times: user 35.9 s, sys: 8.69 s, total: 44.6 s
Wall time: 9min 2s

Again, following the lead of scikit-learn we can measure the performance of the estimator on the training dataset:

In [9]:

lm.score(X_train.values, y_train.values).compute()

Out[9]:

0.90022477759757635

and on the test dataset:

In [10]:

lm.score(X_test.values, y_test.values).compute()

Out[10]:

0.90030262922441306