NumPy is the foundation of the Python data science stack. Its N-dimensional arrays and vectorized operations are orders of magnitude faster than pure Python loops.

Installation and Basics

  pip install numpy
  
  import numpy as np

arr = np.array([1, 2, 3, 4, 5])
print(arr.dtype)    # int64
print(arr.shape)    # (5,)
print(arr.ndim)     # 1
  

Creating Arrays

  np.zeros((3, 4))          # 3x4 array of zeros
np.ones((2, 3))           # 2x3 array of ones
np.full((2, 2), 7)         # fill with 7
np.eye(3)                  # 3x3 identity matrix
np.arange(0, 10, 2)        # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5)       # 5 evenly spaced values 0 to 1
np.random.rand(3, 3)       # uniform random [0, 1)
np.random.randn(3, 3)      # standard normal
np.random.randint(0, 100, size=(3, 3))
  

Array Operations — Vectorized

  a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

a + b          # [11, 22, 33, 44]
a * 2          # [2, 4, 6, 8]
a ** 2         # [1, 4, 9, 16]
np.sqrt(a)     # element-wise sqrt
np.sin(a)      # element-wise sin
  

No loops needed — operations apply to every element simultaneously.

Multi-Dimensional Arrays

  matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix.shape)   # (3, 3)
print(matrix[1, 2])   # 6
print(matrix[:, 1])   # [2, 5, 8] — second column
print(matrix[1, :])   # [4, 5, 6] — second row
  

Broadcasting

NumPy automatically expands dimensions for operations:

  matrix = np.array([[1, 2, 3], [4, 5, 6]])  # shape (2, 3)
row = np.array([10, 20, 30])                # shape (3,)

matrix + row  # broadcasts row to each row of matrix
# [[11, 22, 33], [14, 25, 36]]
  

Aggregations

  data = np.array([[1, 2, 3], [4, 5, 6]])

data.sum()          # 21
data.mean()         # 3.5
data.max()          # 6
data.sum(axis=0)    # [5, 7, 9]  column sums
data.sum(axis=1)    # [6, 15]    row sums
data.std()          # standard deviation
np.median(data)     # median
  

Boolean Indexing

  arr = np.array([1, 5, 3, 8, 2, 9, 4])
arr[arr > 4]              # [5, 8, 9]
arr[(arr > 2) & (arr < 8)]  # [5, 3, 4]

matrix = np.random.randint(0, 10, size=(4, 4))
matrix[matrix % 2 == 0] = 0  # zero out even values
  

Linear Algebra

  A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

np.dot(A, B)         # matrix multiplication
A @ B              # same (Python 3.5+)
np.linalg.inv(A)     # inverse
np.linalg.det(A)     # determinant
np.linalg.eig(A)     # eigenvalues and eigenvectors

# Solve Ax = b
b = np.array([1, 2])
x = np.linalg.solve(A, b)
  

Reshaping

  arr = np.arange(12)
arr.reshape(3, 4)     # 3x4 matrix
arr.reshape(2, 3, 2)  # 3D array
arr.flatten()         # back to 1D
arr.T                 # transpose
  

Performance: NumPy vs Python

  import time

size = 1_000_000
python_list = list(range(size))
numpy_arr = np.arange(size)

start = time.time()
result = [x ** 2 for x in python_list]
print(f"Python list: {time.time() - start:.4f}s")

start = time.time()
result = numpy_arr ** 2
print(f"NumPy array: {time.time() - start:.4f}s")
# NumPy is typically 10-100x faster
  

NumPy is the building block for Pandas, Scikit-learn, Matplotlib, and virtually all scientific Python libraries.

Data Types (dtype)

NumPy arrays are homogeneous — all elements share one type:

  arr = np.array([1, 2, 3], dtype=np.float64)
arr_int = np.array([1.7, 2.3, 3.9], dtype=np.int32)  # truncates to [1, 2, 3]

# Memory-efficient types for large datasets
big = np.zeros(1_000_000, dtype=np.float32)  # half the memory of float64
  

Common dtypes: int32, int64, float32, float64, bool, complex128.

Advanced Indexing

  arr = np.arange(10)

# Fancy indexing — select specific positions
arr[[0, 3, 7]]           # [0, 3, 7]

matrix = np.arange(12).reshape(3, 4)
rows = [0, 2]
cols = [1, 3]
matrix[rows, cols]       # elements at (0,1) and (2,3)

# np.where — conditional selection
arr = np.array([1, 5, 3, 8, 2])
np.where(arr > 4, arr, 0)  # keep values > 4, else 0 → [0, 5, 0, 8, 0]
  

Stacking and Splitting

  a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

np.vstack([a, b])    # vertical stack → 2x3
np.hstack([a, b])    # horizontal → [1, 2, 3, 4, 5, 6]
np.concatenate([a, b])

big = np.arange(12)
np.split(big, 3)     # three equal arrays of length 4
  

Saving and Loading Arrays

  arr = np.random.rand(1000, 10)

# Binary format — fast, preserves dtype
np.save("data.npy", arr)
loaded = np.load("data.npy")

# Compressed
np.savez_compressed("data.npz", train=arr[:800], test=arr[800:])

# Text (human-readable, slower)
np.savetxt("data.csv", arr, delimiter=",")
text = np.loadtxt("data.csv", delimiter=",")
  

For production pipelines, prefer .npy/.npz over CSV for raw numeric arrays.

Universal Functions (ufuncs)

NumPy functions work element-wise on entire arrays:

  np.add(a, b)       # same as a + b
np.maximum(a, b)   # element-wise max
np.clip(arr, 0, 10)  # values below 0 → 0, above 10 → 10
np.log(arr + 1)    # safe log (avoid log(0))
  

Common Pitfalls

  1. Views vs copies — slicing often returns a view; modifying it affects the original
  2. Wrong axissum(axis=0) vs axis=1 behaves differently on 2D arrays
  3. Mixing lists and arrays — convert early: np.array(my_list)
  4. Looping over rows — prefer vectorized operations; loops defeat NumPy’s purpose

Next: Pandas builds on NumPy for labeled tabular data.