Cheatsheet: NumPy, Pandas Series

Python

Refer to the jupyter notebook for rendered code.

Author

Chi Zhang

Published

January 22, 2025

Resources: Python Data Science Handbook by Jake VanderPlas

Note

Content for Pandas DataFrame are in a separate section as it is a very big topic. The data structures here are

Lists
Numpy Array (1D, higher-D)
Pandas Series

Generate data

List sequence

Create a list with range()

# create a list directly
[1, 2, 3]

# the following does not print out the lists
range(3)
range(0, 10, 2)

There are a few ways to print out elements from a list generated from range() * list comprehension ([function(i) for i in LIST]) * directly listing out (list(LIST)) * for loop, more tedious and requires a placeholder to be created first.

# directly listing out
list(range(0, 10, 2))

# list comprehension
[i for i in range(0, 10, 2)] # omit the function

# for loop
lst = []
for i in range(0, 10, 2):
    lst.append(i)

np.array, sequences with `np.arange()`

Distinguish list and array, the way they are created. range() is built-in, and produces a list. np.arange() is from numpy, and produces a numpy array. The arguments are the same: np.arange(start, end, step)

import numpy as np
np.arange(3)
np.arange(0, 10, 2)

Other useful functions to generate a sequence:

# np.linspace(start, end, nelements)
np.linspace(0, 1, 5) # 0, 0.25, 0.5, 0.75, 1

# repeat the same values
np.zeros(10)
np.ones(5)

Random number `np.random`

Uniform (between 0, 1): np.random.random((nrow, ncol))
Normal: np.random.normal(mu, sd, (nrow, ncol))
Random integer: np.random.randint(0, 10, (nrow, ncol))

Pandas Series

A pandas Series is a column variable, rather than a vector. It is similar to a dataframe with one column, hence they have row names (index).

Inherently Series is an array. Can change the index

data = pd.Series([0.25, 0.5, 0.75, 1.0])
# access value
data.values

# customize the index
data2 = pd.Series([0.25, 0.5, 0.75, 1.0],
                  index = ['a', 'b', 'c', 'd'])
data2['a']

Series is also a dictionary, so it can be created as such

my_dict = {'a': 100,
           'b': 200,
           'c': 300}
my_dict = pd.Series(my_dict)

Common attributes

That can be accessed by calling obj.size, obj.value etc Size, dimension

.value

.index

for dataframe, .column

Computation

Axis: The axis is quite convenient: axis = 0 conducts column-wise computations, and axis = 1 is row-wise. This needs to be distinguished with R where the first axis is row (apply(matrix, 1, function) does operation per row).

Selection

Generally counting starts from 0, access the index with square brackets.

indexing: generally refer to column
slicing: refer to row
concatenate and splitting

Numpy array

selection

1-D array

x = np.array([1,2,3,4,5])
x[0] # first
x[-1] # last
x[:4] # slicing
x[::2] # every other element
x[::-1] # reversing the array

Higher-D array

import numpy as np

rng = np.random.RandomState(42)
x = rng.randint(0, 10, (3, 4))
print(x)
print(x[0, 0]) # rowid, colid
print(x[0, :]) # entire row: slicing
print(x[0]) # entire row
print(x[:2, :3])

[[6 3 7 4]
 [6 9 2 6]
 [7 4 3 7]]
6
[6 3 7 4]
[6 3 7 4]
[[6 3 7]
 [6 9 2]]

Indexing

Always pay attention to where the index starts, typically 0.

x = np.arange(0, 10) # sequence from 1 to 10
ind = [3, 7, 2]
x[ind]

array([3, 7, 2])

concatenate and splitting

1-D arrays

x = np.array([1,2,3])
y = np.array([4,5,6])
np.concatenate([x, y]) # the square brackets remains

Spliting

x = np.arange(10)
x1, x2, x3 = np.split(x, [2, 4])
print(x1, x2, x3)

[0 1] [2 3] [4 5 6 7 8 9]

Higher-D arrays

grid = np.array([[1,2,3],
                [4,5,6]])

# concatenate, by default is axis=0         
np.concatenate([grid, grid])

# by column, result is 4 by 3
np.concatenate([grid, grid], axis = 0)
# by row, result is 2 by 6
np.concatenate([grid, grid], axis = 1)

# vstack, hstack
np.vstack([x, grid])

Pandas Series

Indexing

Pandas indices are customizable. It is useful to check them. data.index

If a data has explicit index, can also access the element with index names

data = pd.Series([0.25, 0.5, 0.75, 1.0])
# it is like a 1-d array
data[0]

data2 = pd.Series([0.25, 0.5, 0.75, 1.0],
                  index = ['a', 'b', 'c', 'd'])
data2['a']

Difference between loc and iloc

loc refers to the index user defined, typically starts from 1.
iloc refers to the implicit python index, starting from 0.

import pandas as pd

data = pd.Series(['a', 'b', 'c'], index = [1,3,5])

# loc
print(data.loc[1])
print(data.loc[1:3])

# iloc: implicit python style index
print(data.iloc[1]) # 2nd
print(data.iloc[1:3]) # 2,3rd

a
1    a
3    b
dtype: object
b
3    b
5    c
dtype: object

Missing value for array and pd.series

Both None and np.nan are null.

Detect null values with isnull(), notnull()

Generate data

List sequence

np.array, sequences with np.arange()

Random number np.random

Pandas Series

Common attributes

Computation

Selection

Numpy array

selection

Indexing

concatenate and splitting

Pandas Series

Indexing

Missing value for array and pd.series

np.array, sequences with `np.arange()`

Random number `np.random`