Array & Matrix Manipulation

Why This Matters

ML data is fundamentally multi-dimensional. Every ML interview will test your ability to manipulate arrays and matrices efficiently. Master these patterns and you'll solve 60% of ML coding problems faster.

Core Matrix Operations

1. Transpose and Reshape

Transpose - Swap Rows and Columns:

import numpy as np

# Transpose swaps dimensions
X = np.array([[1, 2, 3],
              [4, 5, 6]])  # Shape: (2, 3)

X_T = X.T                   # Shape: (3, 2)
# [[1, 4],
#  [2, 5],
#  [3, 6]]

# Common use: Matrix multiplication
# If X is (n_samples, n_features), X.T is (n_features, n_samples)
covariance = X.T @ X  # (n_features, n_features)

Reshape - Change Dimensions:

# Flatten to 1D
matrix = np.array([[1, 2], [3, 4], [5, 6]])  # (3, 2)
flat = matrix.reshape(-1)                    # (6,) or matrix.flatten()
# [1, 2, 3, 4, 5, 6]

# Reshape to different dimensions
arr = np.arange(12)              # (12,)
matrix_3x4 = arr.reshape(3, 4)   # (3, 4)
matrix_2x6 = arr.reshape(2, 6)   # (2, 6)

# Use -1 to infer dimension
matrix_auto = arr.reshape(3, -1)  # (3, 4) - infers 4

Interview Question:

"You have a flattened 1D array of 784 elements representing a 28x28 image. Convert it back to image format."

def array_to_image(flat_arr):
    """
    Convert 784-element array to 28x28 image

    Time: O(1) - reshaping is a view, no copy
    Space: O(1)
    """
    return flat_arr.reshape(28, 28)

# Test
flat = np.arange(784)
image = array_to_image(flat)
print(image.shape)  # (28, 28)

2. Aggregations and Statistics

Axis-Aware Operations:

X = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

# axis=0: Down columns (column-wise)
col_sums = X.sum(axis=0)     # [12, 15, 18]
col_means = X.mean(axis=0)   # [4., 5., 6.]

# axis=1: Across rows (row-wise)
row_sums = X.sum(axis=1)     # [6, 15, 24]
row_means = X.mean(axis=1)   # [2., 5., 8.]

# No axis: Entire array
total = X.sum()              # 45

Keep Dimensions for Broadcasting:

# Problem: Subtract row means from each row
X = np.array([[1, 2, 3], [4, 5, 6]])

# Wrong: This creates 1D array
row_means = X.mean(axis=1)          # Shape: (2,)
# centered = X - row_means           # Shape mismatch error!

# Correct: Keep dimensions
row_means = X.mean(axis=1, keepdims=True)  # Shape: (2, 1)
centered = X - row_means                    # Broadcasting works!
# [[-1,  0,  1],
#  [-1,  0,  1]]

Common Statistical Functions:

X = np.array([[1, 2, 3], [4, 5, 6]])

# Basic stats
X.min()            # 1
X.max()            # 6
X.mean()           # 3.5
X.std()            # Standard deviation
X.var()            # Variance

# Percentiles
np.percentile(X, 50)   # Median
np.percentile(X, [25, 75])  # Quartiles

# Along axis
X.argmax(axis=0)   # Index of max in each column
X.argmin(axis=1)   # Index of min in each row

3. Slicing and Indexing

Basic Slicing:

X = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]])

# Get first 2 rows
first_two = X[:2, :]          # [[1,2,3,4], [5,6,7,8]]

# Get last 2 columns
last_two_cols = X[:, -2:]     # [[3,4], [7,8], [11,12]]

# Get specific row and column range
sub = X[1:, 1:3]              # [[6,7], [10,11]]

Boolean Indexing:

X = np.array([[1, -2, 3], [-4, 5, -6]])

# Select positive values
positive = X[X > 0]           # [1, 3, 5]

# Replace negative values with 0
X[X < 0] = 0
# [[1, 0, 3], [0, 5, 0]]

# Row-wise filtering: Keep rows where first column > 0
condition = X[:, 0] > 0
filtered = X[condition]

Fancy Indexing:

X = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

# Select specific rows and columns
rows = [0, 2]
cols = [1, 2]
sub = X[rows, :][:, cols]     # [[2, 3], [8, 9]]

# Advanced: Select diagonal
indices = np.arange(3)
diagonal = X[indices, indices]  # [1, 5, 9]

Interview Question:

"Given a 2D array of test scores (students × subjects), return students who scored above average in all subjects."

def high_performers(scores):
    """
    scores: (n_students, n_subjects) array
    Returns: indices of students who beat average in ALL subjects

    Time: O(n * m) where n=students, m=subjects
    Space: O(n * m) for boolean array
    """
    # Calculate mean for each subject (column)
    subject_means = scores.mean(axis=0, keepdims=True)  # (1, n_subjects)

    # Check if each student beats mean in each subject
    above_avg = scores > subject_means  # (n_students, n_subjects)

    # Keep students who beat average in ALL subjects
    all_above = above_avg.all(axis=1)   # (n_students,)

    return np.where(all_above)[0]

# Test
scores = np.array([[85, 90, 88],  # Student 0: above avg in all
                   [70, 75, 72],  # Student 1: not all above avg
                   [95, 92, 94]]) # Student 2: above avg in all
print(high_performers(scores))  # [0, 2]

4. Matrix Multiplication

Dot Product and Matrix Multiplication:

# Vectors: Dot product
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b)      # 1*4 + 2*5 + 3*6 = 32

# Matrices: Matrix multiplication
A = np.array([[1, 2], [3, 4]])    # (2, 2)
B = np.array([[5, 6], [7, 8]])    # (2, 2)

# Method 1: np.dot()
C = np.dot(A, B)

# Method 2: @ operator (Python 3.5+)
C = A @ B

# Result: [[19, 22], [43, 50]]

Broadcasting in Matrix Operations:

# Add bias to each sample
X = np.array([[1, 2], [3, 4], [5, 6]])  # (3, 2) - 3 samples, 2 features
bias = np.array([10, 20])                # (2,)

result = X + bias  # Broadcasting: (3, 2) + (2,) = (3, 2)
# [[11, 22], [13, 24], [15, 26]]

Linear Regression with Matrices:

def fit_linear_regression(X, y):
    """
    Fit linear regression using normal equation

    X: (n_samples, n_features)
    y: (n_samples,)
    Returns: weights (n_features,)

    Formula: w = (X^T X)^{-1} X^T y
    """
    # Add intercept term
    ones = np.ones((X.shape[0], 1))
    X_with_intercept = np.hstack([ones, X])

    # Normal equation
    XTX = X_with_intercept.T @ X_with_intercept
    XTy = X_with_intercept.T @ y
    weights = np.linalg.inv(XTX) @ XTy

    return weights

# Test
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])
weights = fit_linear_regression(X, y)
print(weights)  # [~0, ~2] - intercept, slope

5. Concatenation and Stacking

Horizontal and Vertical Stacking:

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Vertical stack (add rows)
V = np.vstack([A, B])  # [[1,2], [3,4], [5,6], [7,8]] - Shape: (4, 2)

# Horizontal stack (add columns)
H = np.hstack([A, B])  # [[1,2,5,6], [3,4,7,8]] - Shape: (2, 4)

# General concatenation
cat_rows = np.concatenate([A, B], axis=0)  # Same as vstack
cat_cols = np.concatenate([A, B], axis=1)  # Same as hstack

Practical Use: Train/Test Split:

def create_batches(X, y, batch_size):
    """
    Split data into batches

    X: (n_samples, n_features)
    y: (n_samples,)
    Returns: list of (X_batch, y_batch) tuples
    """
    n_samples = X.shape[0]
    batches = []

    for i in range(0, n_samples, batch_size):
        end = min(i + batch_size, n_samples)
        X_batch = X[i:end]
        y_batch = y[i:end]
        batches.append((X_batch, y_batch))

    return batches

# Test
X = np.arange(20).reshape(10, 2)
y = np.arange(10)
batches = create_batches(X, y, batch_size=3)
print(len(batches))  # 4 batches (3+3+3+1)

6. Common ML Preprocessing Patterns

Min-Max Normalization:

def min_max_normalize(X):
    """
    Scale features to [0, 1] range

    X: (n_samples, n_features)
    Returns: normalized X
    """
    min_val = X.min(axis=0, keepdims=True)
    max_val = X.max(axis=0, keepdims=True)
    return (X - min_val) / (max_val - min_val)

# Test
X = np.array([[1, 200], [2, 300], [3, 400]])
normalized = min_max_normalize(X)
# [[0., 0.], [0.5, 0.5], [1., 1.]]

One-Hot Encoding:

def one_hot_encode(labels, num_classes):
    """
    Convert class labels to one-hot vectors

    labels: (n_samples,) array of class indices
    num_classes: total number of classes
    Returns: (n_samples, num_classes) array
    """
    n_samples = len(labels)
    one_hot = np.zeros((n_samples, num_classes))
    one_hot[np.arange(n_samples), labels] = 1
    return one_hot

# Test
labels = np.array([0, 2, 1, 0])
one_hot = one_hot_encode(labels, num_classes=3)
# [[1, 0, 0],
#  [0, 0, 1],
#  [0, 1, 0],
#  [1, 0, 0]]

Missing Value Imputation:

def impute_mean(X):
    """
    Replace NaN values with column mean

    X: (n_samples, n_features) array with NaN values
    Returns: X with NaN replaced by column means
    """
    # Create a copy to avoid modifying original
    X_imputed = X.copy()

    # For each column
    for col in range(X.shape[1]):
        # Get non-NaN values
        col_data = X[:, col]
        mask = ~np.isnan(col_data)

        # Calculate mean of non-NaN values
        col_mean = col_data[mask].mean()

        # Replace NaN with mean
        X_imputed[~mask, col] = col_mean

    return X_imputed

# Test
X = np.array([[1, 2, np.nan],
              [3, np.nan, 6],
              [5, 4, 7]])
imputed = impute_mean(X)
# [[1, 2, 6.5],
#  [3, 3, 6],
#  [5, 4, 7]]

Interview Problem: Distance Matrix

Problem: Given N points in D-dimensional space, compute the pairwise Euclidean distance matrix.

def pairwise_distances(X):
    """
    Compute pairwise Euclidean distances

    X: (n_samples, n_features)
    Returns: (n_samples, n_samples) distance matrix

    Distance between i and j: sqrt(sum((X[i] - X[j])^2))

    Efficient method using broadcasting:
    ||x - y||^2 = ||x||^2 + ||y||^2 - 2*x·y
    """
    # Compute squared norms for each point
    # X^2 summed across features: (n_samples, 1)
    X_squared = (X ** 2).sum(axis=1, keepdims=True)

    # Dot product matrix: (n_samples, n_samples)
    XXT = X @ X.T

    # Use formula: ||xi - xj||^2 = ||xi||^2 + ||xj||^2 - 2*xi·xj
    distances_squared = X_squared + X_squared.T - 2 * XXT

    # Take square root (use maximum to avoid negative values from numerical errors)
    distances = np.sqrt(np.maximum(distances_squared, 0))

    return distances

# Test
X = np.array([[0, 0], [3, 4], [1, 1]])
D = pairwise_distances(X)
# [[0., 5., 1.414],
#  [5., 0., 3.606],
#  [1.414, 3.606, 0.]]

# Verify: distance between [0,0] and [3,4] = sqrt(9+16) = 5 ✓

Performance Tips

1. Vectorize Instead of Loops:

# Slow: Loop
result = []
for i in range(len(arr)):
    result.append(arr[i] * 2)

# Fast: Vectorized
result = arr * 2

2. Avoid Unnecessary Copies:

# Creates copy
arr_copy = arr.reshape(-1)

# Returns view (no copy)
arr_view = arr.ravel()

3. Use axis Parameters:

# Slow: Loop over rows
sums = [row.sum() for row in X]

# Fast: Vectorized
sums = X.sum(axis=1)

Key Takeaways

Master axis parameter - Understand axis=0 (columns) vs axis=1 (rows)
Use keepdims=True - Prevents dimension mismatch in broadcasting
Boolean indexing is powerful - Filter data without loops
Vectorize operations - 10-100x faster than loops
Understand views vs copies - Avoid unnecessary memory allocation

What's Next?

In the next lesson, we'll cover common algorithm patterns (two pointers, sliding window, binary search) adapted for ML interview contexts.

:::