Cheatsheet: Scikit-learn
Python
Refer to the jupyter notebook for rendered code.
Load the usual libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Data exploration
data.head()
data.describe()
data.shape# check colnames data.columns
Visualize data
sns.histplot(data.variable)= data.var1, y = data.var2)
sns.scatterplot(x # pair plot for low dimension data
= 'categorical_var') sns.pairplot(data, hue
Feature engineering
# split the response and explanatory
= data.drop('response', axis = 1)
X = data['response'] y
Train test split
from sklearn.model_selection import train_test_split
= train_test_split(X, y, random_state = 0) Xtrain, Xtest, y_train, y_test
Can check the size of the data using Xtrain.shape
Cross validation
from sklearn.model_selection import cross_val_score
= cross_val_score(model, Xtrain, y_train, cv = 10) cvs
Fit ML model and predictions
General workflow
from sklearn.MODEL import CLASSIFIER
= CLASSIFIER(PARA)
YOUR_MODEL
YOUR_MODEL.fit(Xtrain, ytrain)
# predict class
YOUR_MODEL.predict(Xtest)
Common parameters to tune
C
: float, default is 1. Inverse of regularization strength. Must be a positive float. Smaller values specify stronger regularizationrandom_state
: integersolver
: depends on the classifier. Should check the docs.
Logistic regression
Scikit-learn doc: Logistic Regression
from sklearn.linear_model import LogisticRegression
= LogisticRegression(C=C, penalty="l2", solver="saga", max_iter=10000) LR
Usually for two classes, but also common in multiclass problems, such as OneVsRestClassifier
Decision tree and random forest
They can work on more than two classes, as well as regression.
For decision tree, probability prediction isn’t available.
from sklearn.tree import DecisionTreeClassifier
= DecisionTreeClassifier().fit(Xtrain, ytrain)
tree
# predict class
tree.predict(Xtest)
Random forest
from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier(max_depth=3, random_state=42)
forest
forest.fit(Xtrain, ytrain)
# predict class
forest.predict(Xtest)
# predict probability
forest.predict_proba(Xtest)
Naive Bayes
from sklearn.naive_bayes import GaussianNB
= GaussianNB()
model
model.fit(Xtrain, y_train)
# can also be written as model = GaussianNB().fit(X,y)
= model.predict(Xtest) y_model
Performance
Class accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_model)# can also try to compare the results manually
sum(y_test == y_model) np.
Probabilities
Carry out row-wise summation, see if they sum up to 1
sum(axis = 1) rf_prob.