Content for Pandas DataFrame are in a separate section as it is a very big topic. The data structures here are
Lists
Numpy Array (1D, higher-D)
Pandas Series
Generate data
List sequence
Create a list with range()
# create a list directly[1, 2, 3]# the following does not print out the listsrange(3)range(0, 10, 2)
There are a few ways to print out elements from a list generated from range() * list comprehension ([function(i) for i in LIST]) * directly listing out (list(LIST)) * for loop, more tedious and requires a placeholder to be created first.
# directly listing outlist(range(0, 10, 2))# list comprehension[i for i inrange(0, 10, 2)] # omit the function# for looplst = []for i inrange(0, 10, 2): lst.append(i)
np.array, sequences with np.arange()
Distinguish list and array, the way they are created. range() is built-in, and produces a list. np.arange() is from numpy, and produces a numpy array. The arguments are the same: np.arange(start, end, step)
import numpy as npnp.arange(3)np.arange(0, 10, 2)
Other useful functions to generate a sequence:
# np.linspace(start, end, nelements)np.linspace(0, 1, 5) # 0, 0.25, 0.5, 0.75, 1# repeat the same valuesnp.zeros(10)np.ones(5)
That can be accessed by calling obj.size, obj.value etc Size, dimension
.value
.index
for dataframe, .column
Computation
Axis: The axis is quite convenient: axis = 0 conducts column-wise computations, and axis = 1 is row-wise. This needs to be distinguished with R where the first axis is row (apply(matrix, 1, function) does operation per row).
Selection
Generally counting starts from 0, access the index with square brackets.
indexing: generally refer to column
slicing: refer to row
concatenate and splitting
Numpy array
selection
1-D array
x = np.array([1,2,3,4,5])x[0] # firstx[-1] # lastx[:4] # slicingx[::2] # every other elementx[::-1] # reversing the array
grid = np.array([[1,2,3], [4,5,6]])# concatenate, by default is axis=0 np.concatenate([grid, grid])# by column, result is 4 by 3np.concatenate([grid, grid], axis =0)# by row, result is 2 by 6np.concatenate([grid, grid], axis =1)# vstack, hstacknp.vstack([x, grid])
Pandas Series
Indexing
Pandas indices are customizable. It is useful to check them. data.index
If a data has explicit index, can also access the element with index names
data = pd.Series([0.25, 0.5, 0.75, 1.0])# it is like a 1-d arraydata[0]data2 = pd.Series([0.25, 0.5, 0.75, 1.0], index = ['a', 'b', 'c', 'd'])data2['a']
Difference between loc and iloc
loc refers to the index user defined, typically starts from 1.
iloc refers to the implicit python index, starting from 0.