Cheatsheet: Pandas DataFrame

Python

Refer to the jupyter notebook for rendered code.

Author

Chi Zhang

Published

February 12, 2025

Data analysis workflow

Note

When working with data that need to be transformed, create a copy so that the original data is not affected.

planet_sub = planets.copy()[planets['Mass']>15]

Initial summary

df.head()
df.columns
df.shape
df.describe()
df.dtypes
df.isnull().sum() # check missing

Unique values, count frequency

df['col'].unique()
df['col'].value_counts() # same as R table()

Clean

# rename column
df = df.rename(columns = {'old_name': 'new_name'})

Convert data type, rounding

df['num_col'].round(2) # 2 digits
df['num_col'].astype(str) # change into string

Missing value handling

df = df.dropna() # drop na
df['col'] = df['col'].fillna(value) # impute with known value
df['col'] = df['col'].fillna(df['col'].mean()) # mean imputation

Subsetting, filtering

df_sub = df[['col1', 'col2']]
df_filter = df[df['col1'] < 25]

Might have to reset index afterwards, df_new.reset_index()

Transform

Create new variable based on another

# solution 1
planet_sub['pre2005'] = 'yes'
planet_sub.loc[planet_sub.year > 2005, "pre2005"] = "no"
# solution 2
planet_sub['pre2005'] = np.where(planet_sub['year']<2005, 'yes', 'no')

Sort

planet_sub.sort_values('Mass', ascending=False)
planet_sub.sort_values(by = ['orbital_period', 'Mass'], ascending=False)

Transform with lambda

df.transform(lambda x: x-x.mean())

Visualization (basics)

import seaborn as sns
import matplotlib.pyplot as plt

# histogram
sns.histplot(df['numvar'])
plt.show()

# boxplot
sns.boxplot(df['numvar'])

# barplot (categorical, raw data)
sns.countplot(df['cate_var'])

# scatter (2 var)
sns.scatterplot(x = 'var1', y = 'var2', hue = 'var3', data = df)

Details

Create a dataframe

There are multiple ways to do it:

  • pd.Series
  • dictionary
  • 2-d array
  • combining mixed types

Create a dataframe by combining Series

import pandas as pd
import numpy as np

# define population
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

# define area
area_dict = {'California': 423967,
             'Texas': 695662,
             'New York': 141297,
             'Florida': 170312,
             'Illinois': 149995}
area = pd.Series(area_dict)

# now combine these two
states = pd.DataFrame({
    'population': population,
    'area': area
})
states
population area
California 38332521 423967
Texas 26448193 695662
New York 19651127 141297
Florida 19552860 170312
Illinois 12882135 149995

It is convenient to create a new column based on other columns,

states['density'] = states['population'] / states['area']

Create a df from dictionary

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
a b c
0 1.0 2 NaN
1 NaN 3 4.0

Create a df from a 2-d array

pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])
foo bar
a 0.267536 0.947471
b 0.039290 0.090745
c 0.215675 0.245663

Mix types combination

This is equivalent to R cbind. Assuming that we have the following np.array: X (n by 2), y (n by 1), yhat (n by 1). We want to combine them together for easier visualization.

  • use np.column.stack(()) first, then convert into pandas dataframe; This is similar to creating a numeric matrix, then convert it to dataframe in R.
  • or, create a dataframe based on one of the variables first; then attach a second variable using df['new_var'] = var2. This is similar to df$new_var <- var2 in R.
# option 1
mat = np.column_stack((X, y, yhat)) # double bracket
df1 = pd.DataFrame(mat, columns = ['x1', 'x2', 'y', 'yhat'])

# option 2
df2 = pd.DataFrame(X, columns = ['x1', 'x2'])
df2['y'] = y
df2['yhat'] = yhat

Create new columns

After the above steps (in previous section), we want to also create an indicator based on two of the variables. Can use np.where function.

df['mismatch'] = np.where(df['y'] != df['yhat'], 1, 0)

Selection

  • rownames: index
  • colnames: column
  • select columns
states.index # rownames (index)
states.columns # colnames
states['area'] # select column by name

Subsetting with loc, iloc

# implicit index
states.iloc[:3, :2]

# explicit index
states.iloc[:'Illinois', :'pop']

Combine filtering

states.loc[data.density > 100, ['pop', 'density']]