Scikit-Learn-style APIΒΆ

This example demontrates compatability with scikit-learn’s basic fit API. For demonstration, we’ll use the perennial NYC taxi cab dataset.

In [1]:
import os
import s3fs
import pandas as pd
import dask.array as da
import dask.dataframe as dd
from distributed import Client

from dask import persist
from dask_glm.estimators import LogisticRegression
In [2]:
if not os.path.exists('trip.csv'):
    s3 = S3FileSystem(anon=True)
    s3.get("dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", "trip.csv")
In [3]:
client = Client()
In [4]:
ddf = dd.read_csv("trip.csv")

We can use the dask.dataframe API to explore the dataset, and notice that some of the values look suspicious:

In [5]:
ddf[['trip_distance', 'fare_amount']].describe().compute()
trip_distance fare_amount
count 1.274899e+07 1.274899e+07
mean 1.345913e+01 1.190566e+01
std 9.844094e+03 1.030254e+01
min 0.000000e+00 -4.500000e+02
25% 1.000000e+00 6.500000e+00
50% 1.700000e+00 9.000000e+00
75% 3.100000e+00 1.350000e+01
max 1.542000e+07 4.008000e+03

Scikit-learn doesn’t currently support filtering observations inside a pipeline (yet), so we’ll do this before anything else.

In [6]:
# these filter out less than 1% of the observations
ddf = ddf[(ddf.trip_distance < 20) &
          (ddf.fare_amount < 150)]

Now, we’ll split our DataFrame into a train and test set, and select our feature matrix and target column (whether the passenger tipped).

In [7]:
df_train, df_test = ddf.random_split([0.80, 0.20], random_state=2)

columns = ['VendorID', 'passenger_count', 'trip_distance', 'payment_type', 'fare_amount']

X_train, y_train = df_train[columns], df_train['tip_amount'] > 0
X_test, y_test = df_test[columns], df_test['tip_amount'] > 0

X_train, y_train, X_test, y_test = persist(
    X_train, y_train, X_test, y_test

With our training data in hand, we fit our logistic regression. Nothing here should be surprising to those familiar with scikit-learn.

In [8]:
# this is a *dask-glm* LogisticRegresion, not scikit-learn
lm = LogisticRegression(fit_intercept=False), y_train.values)
CPU times: user 35.9 s, sys: 8.69 s, total: 44.6 s
Wall time: 9min 2s

Again, following the lead of scikit-learn we can measure the performance of the estimator on the training dataset:

In [9]:
lm.score(X_train.values, y_train.values).compute()

and on the test dataset:

In [10]:
lm.score(X_test.values, y_test.values).compute()