Cheatsheet: Scikit-learn

Python

Refer to the jupyter notebook for rendered code.

Author

Chi Zhang

Published

February 26, 2025

Load the usual libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Data exploration

data.head()
data.describe()
data.shape
data.columns # check colnames

Visualize data

sns.histplot(data.variable)
sns.scatterplot(x = data.var1, y = data.var2)
# pair plot for low dimension data
sns.pairplot(data, hue = 'categorical_var')

Feature engineering

# split the response and explanatory
X = data.drop('response', axis = 1)
y = data['response']

Train test split

from sklearn.model_selection import train_test_split
Xtrain, Xtest, y_train, y_test = train_test_split(X, y, random_state = 0)

Can check the size of the data using Xtrain.shape

Cross validation

from sklearn.model_selection import cross_val_score
cvs = cross_val_score(model, Xtrain, y_train, cv = 10)

Fit ML model and predictions

General workflow

from sklearn.MODEL import CLASSIFIER
YOUR_MODEL = CLASSIFIER(PARA)
YOUR_MODEL.fit(Xtrain, ytrain)

# predict class
YOUR_MODEL.predict(Xtest)

Common parameters to tune

  • C: float, default is 1. Inverse of regularization strength. Must be a positive float. Smaller values specify stronger regularization
  • random_state: integer
  • solver: depends on the classifier. Should check the docs.

Logistic regression

Scikit-learn doc: Logistic Regression

from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=C, penalty="l2", solver="saga", max_iter=10000)

Usually for two classes, but also common in multiclass problems, such as OneVsRestClassifier

Decision tree and random forest

They can work on more than two classes, as well as regression.

For decision tree, probability prediction isn’t available.

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(Xtrain, ytrain)

# predict class
tree.predict(Xtest)

Random forest

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(max_depth=3, random_state=42)
forest.fit(Xtrain, ytrain)

# predict class
forest.predict(Xtest)

# predict probability
forest.predict_proba(Xtest)

Naive Bayes

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain, y_train)

# can also be written as model = GaussianNB().fit(X,y)
y_model = model.predict(Xtest)

Performance

Class accuracy

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_model)
# can also try to compare the results manually
np.sum(y_test == y_model) 

Probabilities

Carry out row-wise summation, see if they sum up to 1

rf_prob.sum(axis = 1)

ROC curve