Performance Optimization
Profile and optimize Python code with cProfile, line_profiler, caching, algorithmic improvements, and C extensions for bottlenecks.
Python prioritizes developer productivity over raw speed. When performance matters, measure first, then optimize the right bottlenecks.
Rule #1: Profile Before Optimizing
Never guess where your code is slow. Use profilers:
cProfile — Function-Level Profiling
import cProfile
import pstats
def slow_function():
total = 0
for i in range(1_000_000):
total += i ** 2
return total
cProfile.run('slow_function()', 'profile.stats')
stats = pstats.Stats('profile.stats')
stats.sort_stats('cumulative').print_stats(10)
Run from CLI:
python -m cProfile -s cumulative your_script.py
timeit — Micro-Benchmarks
import timeit
time_list = timeit.timeit(
"[x**2 for x in range(1000)]",
number=10000
)
time_gen = timeit.timeit(
"(x**2 for x in range(1000))",
number=10000
)
print(f"List comp: {time_list:.4f}s, Generator: {time_gen:.4f}s")
Algorithmic Optimization
The biggest wins come from better algorithms, not faster loops:
# O(n²) — slow for large inputs
def has_duplicate_slow(items):
for i, a in enumerate(items):
for b in items[i+1:]:
if a == b:
return True
return False
# O(n) — use a set
def has_duplicate_fast(items):
seen = set()
for item in items:
if item in seen:
return True
seen.add(item)
return False
Built-in Optimizations
Use Built-in Functions and Libraries
Built-ins are implemented in C and are much faster:
# Slow
total = 0
for x in data:
total += x
# Fast
total = sum(data)
NumPy, Pandas, and itertools are optimized C implementations — use them for numerical and iteration-heavy work.
List Comprehensions vs Loops
List comprehensions are generally faster than equivalent for loops:
# Prefer
squares = [x**2 for x in range(1000)]
# Over
squares = []
for x in range(1000):
squares.append(x**2)
Generators for Large Data
Generators use constant memory instead of building entire lists:
def read_large_file(path):
with open(path) as f:
for line in f:
yield line.strip()
Caching with functools.lru_cache
Memoize expensive pure functions:
from functools import lru_cache
@lru_cache(maxsize=None)
def fibonacci(n):
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
print(fibonacci(100)) # instant
slots for Memory
Reduce memory per instance when creating millions of objects:
class Point:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x = x
self.y = y
When to Reach for C/Rust Extensions
If profiling shows a specific hot loop that can’t be vectorized:
- Cython — compile Python-like code to C
- PyO3 / maturin — write Rust extensions
- Numba — JIT compile numerical functions
from numba import jit
@jit(nopython=True)
def fast_sum(arr):
total = 0.0
for x in arr:
total += x
return total
Optimization Checklist
- Measure with cProfile or py-spy
- Fix algorithms — O(n²) → O(n log n) beats micro-optimizations
- Use the right data structure — set/dict for lookups, deque for queues
- Leverage libraries — NumPy, pandas, orjson
- Cache repeated pure computations
- Parallelize CPU work with multiprocessing
- Only then consider C extensions
Premature optimization wastes time. Profile-driven optimization delivers real results.