Python & Algorithms for ML
Array & Matrix Manipulation
Why This Matters
ML data is fundamentally multi-dimensional. Every ML interview will test your ability to manipulate arrays and matrices efficiently. Master these patterns and you'll solve 60% of ML coding problems faster.
Core Matrix Operations
1. Transpose and Reshape
Transpose - Swap Rows and Columns:
import numpy as np
# Transpose swaps dimensions
X = np.array([[1, 2, 3],
[4, 5, 6]]) # Shape: (2, 3)
X_T = X.T # Shape: (3, 2)
# [[1, 4],
# [2, 5],
# [3, 6]]
# Common use: Matrix multiplication
# If X is (n_samples, n_features), X.T is (n_features, n_samples)
covariance = X.T @ X # (n_features, n_features)
Reshape - Change Dimensions:
# Flatten to 1D
matrix = np.array([[1, 2], [3, 4], [5, 6]]) # (3, 2)
flat = matrix.reshape(-1) # (6,) or matrix.flatten()
# [1, 2, 3, 4, 5, 6]
# Reshape to different dimensions
arr = np.arange(12) # (12,)
matrix_3x4 = arr.reshape(3, 4) # (3, 4)
matrix_2x6 = arr.reshape(2, 6) # (2, 6)
# Use -1 to infer dimension
matrix_auto = arr.reshape(3, -1) # (3, 4) - infers 4
Interview Question:
"You have a flattened 1D array of 784 elements representing a 28x28 image. Convert it back to image format."
def array_to_image(flat_arr):
"""
Convert 784-element array to 28x28 image
Time: O(1) - reshaping is a view, no copy
Space: O(1)
"""
return flat_arr.reshape(28, 28)
# Test
flat = np.arange(784)
image = array_to_image(flat)
print(image.shape) # (28, 28)
2. Aggregations and Statistics
Axis-Aware Operations:
X = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# axis=0: Down columns (column-wise)
col_sums = X.sum(axis=0) # [12, 15, 18]
col_means = X.mean(axis=0) # [4., 5., 6.]
# axis=1: Across rows (row-wise)
row_sums = X.sum(axis=1) # [6, 15, 24]
row_means = X.mean(axis=1) # [2., 5., 8.]
# No axis: Entire array
total = X.sum() # 45
Keep Dimensions for Broadcasting:
# Problem: Subtract row means from each row
X = np.array([[1, 2, 3], [4, 5, 6]])
# Wrong: This creates 1D array
row_means = X.mean(axis=1) # Shape: (2,)
# centered = X - row_means # Shape mismatch error!
# Correct: Keep dimensions
row_means = X.mean(axis=1, keepdims=True) # Shape: (2, 1)
centered = X - row_means # Broadcasting works!
# [[-1, 0, 1],
# [-1, 0, 1]]
Common Statistical Functions:
X = np.array([[1, 2, 3], [4, 5, 6]])
# Basic stats
X.min() # 1
X.max() # 6
X.mean() # 3.5
X.std() # Standard deviation
X.var() # Variance
# Percentiles
np.percentile(X, 50) # Median
np.percentile(X, [25, 75]) # Quartiles
# Along axis
X.argmax(axis=0) # Index of max in each column
X.argmin(axis=1) # Index of min in each row
3. Slicing and Indexing
Basic Slicing:
X = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
# Get first 2 rows
first_two = X[:2, :] # [[1,2,3,4], [5,6,7,8]]
# Get last 2 columns
last_two_cols = X[:, -2:] # [[3,4], [7,8], [11,12]]
# Get specific row and column range
sub = X[1:, 1:3] # [[6,7], [10,11]]
Boolean Indexing:
X = np.array([[1, -2, 3], [-4, 5, -6]])
# Select positive values
positive = X[X > 0] # [1, 3, 5]
# Replace negative values with 0
X[X < 0] = 0
# [[1, 0, 3], [0, 5, 0]]
# Row-wise filtering: Keep rows where first column > 0
condition = X[:, 0] > 0
filtered = X[condition]
Fancy Indexing:
X = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Select specific rows and columns
rows = [0, 2]
cols = [1, 2]
sub = X[rows, :][:, cols] # [[2, 3], [8, 9]]
# Advanced: Select diagonal
indices = np.arange(3)
diagonal = X[indices, indices] # [1, 5, 9]
Interview Question:
"Given a 2D array of test scores (students × subjects), return students who scored above average in all subjects."
def high_performers(scores):
"""
scores: (n_students, n_subjects) array
Returns: indices of students who beat average in ALL subjects
Time: O(n * m) where n=students, m=subjects
Space: O(n * m) for boolean array
"""
# Calculate mean for each subject (column)
subject_means = scores.mean(axis=0, keepdims=True) # (1, n_subjects)
# Check if each student beats mean in each subject
above_avg = scores > subject_means # (n_students, n_subjects)
# Keep students who beat average in ALL subjects
all_above = above_avg.all(axis=1) # (n_students,)
return np.where(all_above)[0]
# Test
scores = np.array([[85, 90, 88], # Student 0: above avg in all
[70, 75, 72], # Student 1: not all above avg
[95, 92, 94]]) # Student 2: above avg in all
print(high_performers(scores)) # [0, 2]
4. Matrix Multiplication
Dot Product and Matrix Multiplication:
# Vectors: Dot product
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b) # 1*4 + 2*5 + 3*6 = 32
# Matrices: Matrix multiplication
A = np.array([[1, 2], [3, 4]]) # (2, 2)
B = np.array([[5, 6], [7, 8]]) # (2, 2)
# Method 1: np.dot()
C = np.dot(A, B)
# Method 2: @ operator (Python 3.5+)
C = A @ B
# Result: [[19, 22], [43, 50]]
Broadcasting in Matrix Operations:
# Add bias to each sample
X = np.array([[1, 2], [3, 4], [5, 6]]) # (3, 2) - 3 samples, 2 features
bias = np.array([10, 20]) # (2,)
result = X + bias # Broadcasting: (3, 2) + (2,) = (3, 2)
# [[11, 22], [13, 24], [15, 26]]
Linear Regression with Matrices:
def fit_linear_regression(X, y):
"""
Fit linear regression using normal equation
X: (n_samples, n_features)
y: (n_samples,)
Returns: weights (n_features,)
Formula: w = (X^T X)^{-1} X^T y
"""
# Add intercept term
ones = np.ones((X.shape[0], 1))
X_with_intercept = np.hstack([ones, X])
# Normal equation
XTX = X_with_intercept.T @ X_with_intercept
XTy = X_with_intercept.T @ y
weights = np.linalg.inv(XTX) @ XTy
return weights
# Test
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])
weights = fit_linear_regression(X, y)
print(weights) # [~0, ~2] - intercept, slope
5. Concatenation and Stacking
Horizontal and Vertical Stacking:
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Vertical stack (add rows)
V = np.vstack([A, B]) # [[1,2], [3,4], [5,6], [7,8]] - Shape: (4, 2)
# Horizontal stack (add columns)
H = np.hstack([A, B]) # [[1,2,5,6], [3,4,7,8]] - Shape: (2, 4)
# General concatenation
cat_rows = np.concatenate([A, B], axis=0) # Same as vstack
cat_cols = np.concatenate([A, B], axis=1) # Same as hstack
Practical Use: Train/Test Split:
def create_batches(X, y, batch_size):
"""
Split data into batches
X: (n_samples, n_features)
y: (n_samples,)
Returns: list of (X_batch, y_batch) tuples
"""
n_samples = X.shape[0]
batches = []
for i in range(0, n_samples, batch_size):
end = min(i + batch_size, n_samples)
X_batch = X[i:end]
y_batch = y[i:end]
batches.append((X_batch, y_batch))
return batches
# Test
X = np.arange(20).reshape(10, 2)
y = np.arange(10)
batches = create_batches(X, y, batch_size=3)
print(len(batches)) # 4 batches (3+3+3+1)
6. Common ML Preprocessing Patterns
Min-Max Normalization:
def min_max_normalize(X):
"""
Scale features to [0, 1] range
X: (n_samples, n_features)
Returns: normalized X
"""
min_val = X.min(axis=0, keepdims=True)
max_val = X.max(axis=0, keepdims=True)
return (X - min_val) / (max_val - min_val)
# Test
X = np.array([[1, 200], [2, 300], [3, 400]])
normalized = min_max_normalize(X)
# [[0., 0.], [0.5, 0.5], [1., 1.]]
One-Hot Encoding:
def one_hot_encode(labels, num_classes):
"""
Convert class labels to one-hot vectors
labels: (n_samples,) array of class indices
num_classes: total number of classes
Returns: (n_samples, num_classes) array
"""
n_samples = len(labels)
one_hot = np.zeros((n_samples, num_classes))
one_hot[np.arange(n_samples), labels] = 1
return one_hot
# Test
labels = np.array([0, 2, 1, 0])
one_hot = one_hot_encode(labels, num_classes=3)
# [[1, 0, 0],
# [0, 0, 1],
# [0, 1, 0],
# [1, 0, 0]]
Missing Value Imputation:
def impute_mean(X):
"""
Replace NaN values with column mean
X: (n_samples, n_features) array with NaN values
Returns: X with NaN replaced by column means
"""
# Create a copy to avoid modifying original
X_imputed = X.copy()
# For each column
for col in range(X.shape[1]):
# Get non-NaN values
col_data = X[:, col]
mask = ~np.isnan(col_data)
# Calculate mean of non-NaN values
col_mean = col_data[mask].mean()
# Replace NaN with mean
X_imputed[~mask, col] = col_mean
return X_imputed
# Test
X = np.array([[1, 2, np.nan],
[3, np.nan, 6],
[5, 4, 7]])
imputed = impute_mean(X)
# [[1, 2, 6.5],
# [3, 3, 6],
# [5, 4, 7]]
Interview Problem: Distance Matrix
Problem: Given N points in D-dimensional space, compute the pairwise Euclidean distance matrix.
def pairwise_distances(X):
"""
Compute pairwise Euclidean distances
X: (n_samples, n_features)
Returns: (n_samples, n_samples) distance matrix
Distance between i and j: sqrt(sum((X[i] - X[j])^2))
Efficient method using broadcasting:
||x - y||^2 = ||x||^2 + ||y||^2 - 2*x·y
"""
# Compute squared norms for each point
# X^2 summed across features: (n_samples, 1)
X_squared = (X ** 2).sum(axis=1, keepdims=True)
# Dot product matrix: (n_samples, n_samples)
XXT = X @ X.T
# Use formula: ||xi - xj||^2 = ||xi||^2 + ||xj||^2 - 2*xi·xj
distances_squared = X_squared + X_squared.T - 2 * XXT
# Take square root (use maximum to avoid negative values from numerical errors)
distances = np.sqrt(np.maximum(distances_squared, 0))
return distances
# Test
X = np.array([[0, 0], [3, 4], [1, 1]])
D = pairwise_distances(X)
# [[0., 5., 1.414],
# [5., 0., 3.606],
# [1.414, 3.606, 0.]]
# Verify: distance between [0,0] and [3,4] = sqrt(9+16) = 5 ✓
Performance Tips
1. Vectorize Instead of Loops:
# Slow: Loop
result = []
for i in range(len(arr)):
result.append(arr[i] * 2)
# Fast: Vectorized
result = arr * 2
2. Avoid Unnecessary Copies:
# Creates copy
arr_copy = arr.reshape(-1)
# Returns view (no copy)
arr_view = arr.ravel()
3. Use axis Parameters:
# Slow: Loop over rows
sums = [row.sum() for row in X]
# Fast: Vectorized
sums = X.sum(axis=1)
Key Takeaways
- Master axis parameter - Understand axis=0 (columns) vs axis=1 (rows)
- Use keepdims=True - Prevents dimension mismatch in broadcasting
- Boolean indexing is powerful - Filter data without loops
- Vectorize operations - 10-100x faster than loops
- Understand views vs copies - Avoid unnecessary memory allocation
What's Next?
In the next lesson, we'll cover common algorithm patterns (two pointers, sliding window, binary search) adapted for ML interview contexts.
:::