Cheatsheet: NumPy, Pandas Series

Python

Refer to the jupyter notebook for rendered code.

Author

Chi Zhang

Published

January 22, 2025

Resources: Python Data Science Handbook by Jake VanderPlas

Note

Content for Pandas DataFrame are in a separate section as it is a very big topic. The data structures here are

  • Lists
  • Numpy Array (1D, higher-D)
  • Pandas Series

Generate data

List sequence

Create a list with range()

# create a list directly
[1, 2, 3]

# the following does not print out the lists
range(3)
range(0, 10, 2)

There are a few ways to print out elements from a list generated from range() * list comprehension ([function(i) for i in LIST]) * directly listing out (list(LIST)) * for loop, more tedious and requires a placeholder to be created first.

# directly listing out
list(range(0, 10, 2))

# list comprehension
[i for i in range(0, 10, 2)] # omit the function

# for loop
lst = []
for i in range(0, 10, 2):
    lst.append(i)

np.array, sequences with np.arange()

Distinguish list and array, the way they are created. range() is built-in, and produces a list. np.arange() is from numpy, and produces a numpy array. The arguments are the same: np.arange(start, end, step)

import numpy as np
np.arange(3)
np.arange(0, 10, 2)

Other useful functions to generate a sequence:

# np.linspace(start, end, nelements)
np.linspace(0, 1, 5) # 0, 0.25, 0.5, 0.75, 1

# repeat the same values
np.zeros(10)
np.ones(5)

Random number np.random

  • Uniform (between 0, 1): np.random.random((nrow, ncol))
  • Normal: np.random.normal(mu, sd, (nrow, ncol))
  • Random integer: np.random.randint(0, 10, (nrow, ncol))

Pandas Series

A pandas Series is a column variable, rather than a vector. It is similar to a dataframe with one column, hence they have row names (index).

Inherently Series is an array. Can change the index

data = pd.Series([0.25, 0.5, 0.75, 1.0])
# access value
data.values

# customize the index
data2 = pd.Series([0.25, 0.5, 0.75, 1.0],
                  index = ['a', 'b', 'c', 'd'])
data2['a']

Series is also a dictionary, so it can be created as such

my_dict = {'a': 100,
           'b': 200,
           'c': 300}
my_dict = pd.Series(my_dict)

Common attributes

That can be accessed by calling obj.size, obj.value etc Size, dimension

.value

.index

for dataframe, .column

Computation

Axis: The axis is quite convenient: axis = 0 conducts column-wise computations, and axis = 1 is row-wise. This needs to be distinguished with R where the first axis is row (apply(matrix, 1, function) does operation per row).

Selection

Generally counting starts from 0, access the index with square brackets.

  • indexing: generally refer to column
  • slicing: refer to row
  • concatenate and splitting

Numpy array

selection

1-D array

x = np.array([1,2,3,4,5])
x[0] # first
x[-1] # last
x[:4] # slicing
x[::2] # every other element
x[::-1] # reversing the array

Higher-D array

import numpy as np

rng = np.random.RandomState(42)
x = rng.randint(0, 10, (3, 4))
print(x)
print(x[0, 0]) # rowid, colid
print(x[0, :]) # entire row: slicing
print(x[0]) # entire row
print(x[:2, :3])
[[6 3 7 4]
 [6 9 2 6]
 [7 4 3 7]]
6
[6 3 7 4]
[6 3 7 4]
[[6 3 7]
 [6 9 2]]

Indexing

Always pay attention to where the index starts, typically 0.

x = np.arange(0, 10) # sequence from 1 to 10
ind = [3, 7, 2]
x[ind]
array([3, 7, 2])

concatenate and splitting

1-D arrays

x = np.array([1,2,3])
y = np.array([4,5,6])
np.concatenate([x, y]) # the square brackets remains

Spliting

x = np.arange(10)
x1, x2, x3 = np.split(x, [2, 4])
print(x1, x2, x3)
[0 1] [2 3] [4 5 6 7 8 9]

Higher-D arrays

grid = np.array([[1,2,3],
                [4,5,6]])

# concatenate, by default is axis=0         
np.concatenate([grid, grid])

# by column, result is 4 by 3
np.concatenate([grid, grid], axis = 0)
# by row, result is 2 by 6
np.concatenate([grid, grid], axis = 1)

# vstack, hstack
np.vstack([x, grid])

Pandas Series

Indexing

Pandas indices are customizable. It is useful to check them. data.index

If a data has explicit index, can also access the element with index names

data = pd.Series([0.25, 0.5, 0.75, 1.0])
# it is like a 1-d array
data[0]

data2 = pd.Series([0.25, 0.5, 0.75, 1.0],
                  index = ['a', 'b', 'c', 'd'])
data2['a']

Difference between loc and iloc

  • loc refers to the index user defined, typically starts from 1.
  • iloc refers to the implicit python index, starting from 0.
import pandas as pd

data = pd.Series(['a', 'b', 'c'], index = [1,3,5])

# loc
print(data.loc[1])
print(data.loc[1:3])

# iloc: implicit python style index
print(data.iloc[1]) # 2nd
print(data.iloc[1:3]) # 2,3rd
a
1    a
3    b
dtype: object
b
3    b
5    c
dtype: object

Missing value for array and pd.series

Both None and np.nan are null.

Detect null values with isnull(), notnull()