Scikit-Learn-style APIΒΆ
This example demontrates compatability with scikit-learn’s basic fit
API. For demonstration, we’ll use the perennial NYC taxi cab dataset.
In [1]:
import os
import s3fs
import pandas as pd
import dask.array as da
import dask.dataframe as dd
from distributed import Client
from dask import persist
from dask_glm.estimators import LogisticRegression
In [2]:
if not os.path.exists('trip.csv'):
s3 = S3FileSystem(anon=True)
s3.get("dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", "trip.csv")
In [3]:
client = Client()
In [4]:
ddf = dd.read_csv("trip.csv")
We can use the dask.dataframe
API to explore the dataset, and notice
that some of the values look suspicious:
In [5]:
ddf[['trip_distance', 'fare_amount']].describe().compute()
Out[5]:
trip_distance | fare_amount | |
---|---|---|
count | 1.274899e+07 | 1.274899e+07 |
mean | 1.345913e+01 | 1.190566e+01 |
std | 9.844094e+03 | 1.030254e+01 |
min | 0.000000e+00 | -4.500000e+02 |
25% | 1.000000e+00 | 6.500000e+00 |
50% | 1.700000e+00 | 9.000000e+00 |
75% | 3.100000e+00 | 1.350000e+01 |
max | 1.542000e+07 | 4.008000e+03 |
Scikit-learn doesn’t currently support filtering observations inside a pipeline (yet), so we’ll do this before anything else.
In [6]:
# these filter out less than 1% of the observations
ddf = ddf[(ddf.trip_distance < 20) &
(ddf.fare_amount < 150)]
Now, we’ll split our DataFrame into a train and test set, and select our feature matrix and target column (whether the passenger tipped).
In [7]:
df_train, df_test = ddf.random_split([0.80, 0.20], random_state=2)
columns = ['VendorID', 'passenger_count', 'trip_distance', 'payment_type', 'fare_amount']
X_train, y_train = df_train[columns], df_train['tip_amount'] > 0
X_test, y_test = df_test[columns], df_test['tip_amount'] > 0
X_train, y_train, X_test, y_test = persist(
X_train, y_train, X_test, y_test
)
With our training data in hand, we fit our logistic regression. Nothing
here should be surprising to those familiar with scikit-learn
.
In [8]:
%%time
# this is a *dask-glm* LogisticRegresion, not scikit-learn
lm = LogisticRegression(fit_intercept=False)
lm.fit(X_train.values, y_train.values)
CPU times: user 35.9 s, sys: 8.69 s, total: 44.6 s
Wall time: 9min 2s
Again, following the lead of scikit-learn we can measure the performance of the estimator on the training dataset:
In [9]:
lm.score(X_train.values, y_train.values).compute()
Out[9]:
0.90022477759757635
and on the test dataset:
In [10]:
lm.score(X_test.values, y_test.values).compute()
Out[10]:
0.90030262922441306