Machine Learning: A Hands-On Guide for 2026
April 16, 2026
TL;DR
This guide covers everything you need to start building machine learning models in Python as of 2026. You will set up your development environment, master data preprocessing, implement core algorithms (linear regression, logistic regression, decision trees, SVMs, and neural networks), evaluate models with cross-validation, and work through real-world case studies. The guide includes runnable code examples, visualizations, and a section on ethical considerations in AI development — because shipping a model that works is only half the job.
What You'll Learn
- How to set up a modern Python ML environment with scikit-learn, PyTorch, and TensorFlow
- Data preprocessing essentials: handling missing values, feature scaling, and encoding
- Core algorithms from linear regression to neural networks, with runnable code
- Model evaluation techniques including cross-validation and the bias-variance tradeoff
- Real-world case studies: fraud detection, image classification, and recommendation systems
- Ethical considerations: detecting bias, measuring fairness, and building responsibly
Introduction to Machine Learning
Machine learning (ML) is a subset of artificial intelligence (AI) that enables systems to automatically learn and improve from experience without being explicitly programmed. While AI encompasses the broader concept of machines performing tasks that typically require human intelligence, ML specifically focuses on developing algorithms that can learn from and make predictions or decisions based on data.
Deep learning, a specialized subset of machine learning, uses neural networks with multiple layers (hence "deep") to model complex patterns in large amounts of data. It has driven recent breakthroughs in areas like computer vision and natural language processing.
Types of Machine Learning
-
Supervised Learning: The algorithm learns from labeled training data, making predictions based on input-output pairs. Common tasks include:
- Classification (predicting categories)
- Regression (predicting continuous values)
-
Unsupervised Learning: The algorithm finds patterns in unlabeled data without explicit guidance. Typical applications include:
- Clustering (grouping similar data points)
- Dimensionality reduction (simplifying data while preserving structure)
-
Reinforcement Learning: An agent learns to make decisions by performing actions and receiving rewards or penalties. This approach is used in:
- Game playing (e.g., AlphaGo)
- Robotics
- Autonomous vehicles
The Machine Learning Workflow
A typical ML project follows these steps:
- Problem definition and data collection
- Data preprocessing and exploration
- Model selection and training
- Model evaluation and optimization
- Deployment and monitoring
Setting Up Your Environment
Let's set up a robust Python environment for machine learning:
Option 1: Local Installation with Anaconda
# Download and install Anaconda from https://www.anaconda.com/download
# Create a new environment
conda create -n ml-env python=3.12
conda activate ml-env
# Install core packages
conda install numpy pandas matplotlib scikit-learn jupyter
conda install -c conda-forge xgboost lightgbm imbalanced-learn
conda install pytorch torchvision torchaudio -c pytorch
conda install tensorflow
Option 2: Google Colab
Google Colab provides a free, cloud-based environment with GPU support:
- Go to https://colab.research.google.com/
- Create a new notebook
- Install required packages:
!pip install numpy pandas matplotlib scikit-learn xgboost lightgbm imbalanced-learn tensorflow torch torchvision
Essential Libraries
# Core data handling and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Machine learning
from sklearn import datasets, model_selection, preprocessing, metrics
import xgboost as xgb
import lightgbm as lgb
import tensorflow as tf
import torch
import torch.nn as nn
Data Preprocessing: The Foundation of ML
Quality data preprocessing is crucial for successful machine learning. Let's explore essential techniques:
Handling Missing Values
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
# Create sample data with missing values
data = pd.DataFrame({
'age': [25, 30, np.nan, 35, 40, 45, np.nan, 50],
'income': [50000, np.nan, 75000, 80000, np.nan, 110000, 120000, 130000],
'department': ['HR', 'IT', 'IT', np.nan, 'Finance', 'Finance', 'HR', 'IT']
})
# 1. Simple imputation
# For numerical features
num_imputer = SimpleImputer(strategy='mean')
data[['age', 'income']] = num_imputer.fit_transform(data[['age', 'income']])
# For categorical features
cat_imputer = SimpleImputer(strategy='most_frequent')
data['department'] = cat_imputer.fit_transform(data[['department']]).ravel()
# 2. KNN imputation for more sophisticated handling
knn_imputer = KNNImputer(n_neighbors=2)
data_imputed = knn_imputer.fit_transform(data[['age', 'income']])
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Sample data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
# Standardization (Z-score normalization)
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
# Min-Max scaling
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(X)
# Robust scaling (less sensitive to outliers)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)
Encoding Categorical Variables
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
# Sample data
categories = ['red', 'blue', 'green', 'red', 'blue']
# One-hot encoding
onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoded = onehot_encoder.fit_transform(np.array(categories).reshape(-1, 1))
# Label encoding
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(categories)
# Ordinal encoding for ordered categories
size_categories = ['S', 'M', 'L', 'XL', 'XXL']
ordinal_encoder = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL', 'XXL']])
ordinal_encoded = ordinal_encoder.fit_transform(np.array(size_categories).reshape(-1, 1))
Data Splitting
from sklearn.model_selection import train_test_split
# Load sample dataset
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# Basic train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# For time-series data, use TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Core Machine Learning Algorithms
Linear Regression
Linear regression models the relationship between a dependent variable and one or more independent variables using a linear equation.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Create and train the model
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Make predictions
X_new = np.array([[0], [2]])
y_pred = lin_reg.predict(X_new)
# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Data')
plt.plot(X_new, y_pred, 'r-', linewidth=2, label='Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Example')
plt.legend()
plt.grid(True)
plt.show()
# Model evaluation
y_pred_all = lin_reg.predict(X)
mse = mean_squared_error(y, y_pred_all)
r2 = r2_score(y, y_pred_all)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
print(f"Intercept: {lin_reg.intercept_[0]:.2f}")
print(f"Coefficient: {lin_reg.coef_[0][0]:.2f}")
Logistic Regression
Logistic regression is used for binary classification problems, modeling the probability that an instance belongs to a particular class.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, random_state=42, n_clusters_per_class=1)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
# Make predictions
y_pred = log_reg.predict(X_test)
y_pred_proba = log_reg.predict_proba(X_test)[:, 1]
# Plot decision boundary
def plot_decision_boundary(X, y, model):
h = 0.02 # Step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Logistic Regression Decision Boundary')
plt.figure(figsize=(10, 6))
plot_decision_boundary(X_test, y_test, log_reg)
plt.show()
# Model evaluation
print("Classification Report:")
print(classification_report(y_test, y_pred))
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Decision Trees and Random Forests
Decision trees are versatile algorithms that can perform both classification and regression tasks. Random forests combine multiple decision trees to improve performance and reduce overfitting.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Decision Tree
dt_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_clf.fit(X_train, y_train)
# Visualize the decision tree
plt.figure(figsize=(20,10))
plot_tree(dt_clf, feature_names=feature_names, class_names=class_names,
filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree Visualization")
plt.show()
# Random Forest
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
# Feature importance
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices])
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Model evaluation
dt_pred = dt_clf.predict(X_test)
rf_pred = rf_clf.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, dt_pred))
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
# Compare performance with different number of trees
n_estimators = [1, 5, 10, 50, 100, 200, 500]
train_scores = []
test_scores = []
for n in n_estimators:
rf = RandomForestClassifier(n_estimators=n, random_state=42)
rf.fit(X_train, y_train)
train_scores.append(rf.score(X_train, y_train))
test_scores.append(rf.score(X_test, y_test))
plt.figure(figsize=(10, 6))
plt.plot(n_estimators, train_scores, label='Train Score')
plt.plot(n_estimators, test_scores, label='Test Score')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('Random Forest Performance vs. Number of Trees')
plt.legend()
plt.grid(True)
plt.xscale('log')
plt.show()
Support Vector Machines (SVMs)
SVMs are powerful algorithms for both classification and regression tasks, particularly effective in high-dimensional spaces.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.inspection import DecisionBoundaryDisplay
# Generate sample data
X, y = datasets.make_classification(n_samples=300, n_features=2, n_redundant=0,
n_informative=2, random_state=42, n_clusters_per_class=1)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train different SVM kernels
kernels = ['linear', 'rbf', 'poly']
svm_models = {}
for kernel in kernels:
svm = SVC(kernel=kernel, random_state=42, probability=True)
svm.fit(X_train_scaled, y_train)
svm_models[kernel] = svm
print(f"{kernel.upper()} Kernel - Train Accuracy: {svm.score(X_train_scaled, y_train):.3f}, "
f"Test Accuracy: {svm.score(X_test_scaled, y_test):.3f}")
# Plot decision boundaries using sklearn's DecisionBoundaryDisplay
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, (kernel, svm) in zip(axes, svm_models.items()):
DecisionBoundaryDisplay.from_estimator(
svm, X_train_scaled, ax=ax, response_method="predict", alpha=0.4
)
ax.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, edgecolors='k', s=20)
ax.set_title(f'SVM with {kernel.upper()} Kernel')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
# Hyperparameter tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.1, 1, 10],
'kernel': ['rbf', 'poly', 'sigmoid']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score: {:.3f}".format(grid_search.best_score_))
print("Test set score: {:.3f}".format(grid_search.score(X_test_scaled, y_test)))
Neural Networks: A Gentle Introduction
Neural networks are computing systems inspired by biological neural networks, capable of learning complex patterns in data.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt
# Load and preprocess MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Normalize pixel values
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0
# Reshape the data
X_train = X_train.reshape(-1, 28*28)
X_test = X_test.reshape(-1, 28*28)
# Convert labels to one-hot encoding
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)
# Build the neural network
model = Sequential([
Dense(512, activation='relu', input_shape=(784,)),
Dropout(0.2),
Dense(256, activation='relu'),
Dropout(0.2),
Dense(128, activation='relu'),
Dropout(0.2),
Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])
# Model summary
model.summary()
# Train the model
history = model.fit(X_train, y_train,
batch_size=128,
epochs=20,
validation_split=0.2,
verbose=1)
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_accuracy:.4f}")
# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.tight_layout()
plt.show()
# Make predictions
predictions = model.predict(X_test[:5])
predicted_classes = np.argmax(predictions, axis=1)
# Display some predictions
plt.figure(figsize=(15, 3))
for i in range(5):
plt.subplot(1, 5, i+1)
plt.imshow(X_test[i].reshape(28, 28), cmap='gray')
plt.title(f"Pred: {predicted_classes[i]}")
plt.axis('off')
plt.show()
Model Evaluation and Selection
Proper model evaluation is crucial for assessing performance and making informed decisions about model selection.
Classification Metrics
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix,
classification_report, roc_curve, auc)
import seaborn as sns
# For binary classification
def evaluate_classification(y_true, y_pred, y_pred_proba=None):
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1-Score:", f1_score(y_true, y_pred))
if y_pred_proba is not None:
print("ROC AUC:", roc_auc_score(y_true, y_pred_proba))
# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
# ROC Curve
if y_pred_proba is not None:
fpr, tpr, _ = roc_curve(y_true, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
# For multi-class classification
def evaluate_multiclass(y_true, y_pred, class_names):
print("Classification Report:")
print(classification_report(y_true, y_pred, target_names=class_names))
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.show()
Cross-Validation Techniques
from sklearn.model_selection import (cross_val_score, KFold, StratifiedKFold,
cross_validate, learning_curve)
import numpy as np
def perform_cross_validation(model, X, y, cv=5):
# Simple cross-validation
cv_scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Stratified K-Fold for imbalanced datasets
skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f"\nStratified CV Scores: {stratified_scores}")
print(f"Mean Stratified CV Accuracy: {stratified_scores.mean():.3f} ± {stratified_scores.std():.3f}")
# Cross-validate with multiple metrics
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
cv_results = cross_validate(model, X, y, cv=cv, scoring=scoring)
print("\nDetailed Cross-Validation Results:")
for metric in scoring:
scores = cv_results[f'test_{metric}']
print(f"{metric}: {scores.mean():.3f} ± {scores.std():.3f}")
# Plot learning curves
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=cv, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy'
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, color='blue', marker='o',
markersize=5, label='Training accuracy')
plt.fill_between(train_sizes, train_mean + train_std, train_mean - train_std,
alpha=0.15, color='blue')
plt.plot(train_sizes, test_mean, color='green', linestyle='--',
marker='s', markersize=5, label='Validation accuracy')
plt.fill_between(train_sizes, test_mean + test_std, test_mean - test_std,
alpha=0.15, color='green')
plt.grid()
plt.xlabel('Number of training examples')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.title('Learning Curves')
plt.show()
# Example usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
model = RandomForestClassifier(n_estimators=100, random_state=42)
perform_cross_validation(model, X, y)
Bias-Variance Tradeoff
Understanding the bias-variance tradeoff is essential for model selection and preventing overfitting or underfitting.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate sample data
np.random.seed(42)
X = np.linspace(-3, 3, 100)
y = np.sin(X) + np.random.normal(0, 0.1, 100)
# Reshape for sklearn
X = X.reshape(-1, 1)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Function to fit polynomial regression
def fit_polynomial(degree):
polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([
("polynomial_features", polynomial_features),
("linear_regression", linear_regression)
])
pipeline.fit(X_train, y_train)
return pipeline
# Calculate bias and variance
def calculate_bias_variance(model, X_test, y_test, n_iterations=100):
y_preds = np.zeros((n_iterations, len(X_test)))
for i in range(n_iterations):
# Add noise to training data
y_train_noisy = y_train + np.random.normal(0, 0.1, len(y_train))
model.fit(X_train, y_train_noisy)
y_preds[i] = model.predict(X_test)
# Calculate variance and bias
variance = np.var(y_preds, axis=0)
bias = (y_test - np.mean(y_preds, axis=0))**2
return np.mean(bias), np.mean(variance)
# Evaluate different polynomial degrees
degrees = range(1, 15)
biases = []
variances = []
train_errors = []
test_errors = []
for degree in degrees:
model = fit_polynomial(degree)
bias, variance = calculate_bias_variance(model, X_test, y_test)
biases.append(bias)
variances.append(variance)
# Training and test errors
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
train_errors.append(mean_squared_error(y_train, train_pred))
test_errors.append(mean_squared_error(y_test, test_pred))
# Plot bias-variance tradeoff
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(degrees, biases, label='Bias²')
plt.plot(degrees, variances, label='Variance')
plt.plot(degrees, np.array(biases) + np.array(variances), 'k--', label='Bias² + Variance')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(degrees, train_errors, 'b-', label='Training Error')
plt.plot(degrees, test_errors, 'r-', label='Test Error')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Training vs Test Error')
plt.legend()
plt.tight_layout()
plt.show()
# Find optimal model complexity
optimal_degree = degrees[np.argmin(test_errors)]
print(f"Optimal polynomial degree: {optimal_degree}")
# Plot best fit
best_model = fit_polynomial(optimal_degree)
X_plot = np.linspace(-3, 3, 100).reshape(-1, 1)
y_plot = best_model.predict(X_plot)
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training Data')
plt.scatter(X_test, y_test, color='red', alpha=0.5, label='Test Data')
plt.plot(X_plot, y_plot, 'k-', linewidth=2, label=f'Polynomial (degree {optimal_degree})')
plt.plot(X_plot, np.sin(X_plot), 'g--', label='True Function')
plt.xlabel('X')
plt.ylabel('y')
plt.title(f'Best Fit Polynomial (Degree {optimal_degree})')
plt.legend()
plt.grid(True)
plt.show()
Real-World Case Studies
Fraud Detection: Using ML to Identify Fraudulent Transactions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
# Load and preprocess the dataset
# Note: In practice, use your own dataset or download from a reliable source
# For this example, we'll create a synthetic dataset
np.random.seed(42)
n_samples = 10000
n_features = 20
# Generate synthetic data
X = np.random.randn(n_samples, n_features)
# Create synthetic fraud labels (highly imbalanced)
fraud = np.random.binomial(1, 0.01, n_samples)
# Add some patterns to the fraud cases
X[fraud == 1, 0] += 3 # First feature is higher for fraud
X[fraud == 1, 1] -= 2 # Second feature is lower for fraud
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, fraud, test_size=0.3, stratify=fraud, random_state=42
)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)
# Train Isolation Forest for anomaly detection
iso_forest = IsolationForest(contamination=0.01, random_state=42)
iso_forest.fit(X_train_scaled)
# Predict anomalies (1 for inliers, -1 for outliers)
y_pred_iso = iso_forest.predict(X_test_scaled)
# Convert to binary (0 for inliers, 1 for outliers)
y_pred_iso = (y_pred_iso == -1).astype(int)
# Train a classifier (XGBoost) on the balanced data
import xgboost as xgb
xgb_model = xgb.XGBClassifier(
scale_pos_weight=len(y_train_balanced[y_train_balanced==0])/len(y_train_balanced[y_train_balanced==1]),
random_state=42
)
xgb_model.fit(X_train_balanced, y_train_balanced)
# Make predictions
y_pred_xgb = xgb_model.predict(X_test_scaled)
y_pred_proba_xgb = xgb_model.predict_proba(X_test_scaled)[:, 1]
# Evaluate models
def evaluate_fraud_detection(y_true, y_pred, model_name):
print(f"\n{model_name} Performance:")
print("Confusion Matrix:")
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix - {model_name}')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Legitimate', 'Fraud']))
evaluate_fraud_detection(y_test, y_pred_iso, "Isolation Forest")
evaluate_fraud_detection(y_test, y_pred_xgb, "XGBoost")
# Plot feature importance
plt.figure(figsize=(12, 6))
xgb.plot_importance(xgb_model, max_num_features=10)
plt.title('Feature Importance - XGBoost')
plt.show()
# Threshold optimization
from sklearn.metrics import precision_recall_curve, average_precision_score
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_xgb)
average_precision = average_precision_score(y_test, y_pred_proba_xgb)
plt.figure(figsize=(10, 6))
plt.step(recall, precision, where='post', label=f'Precision-Recall (AP = {average_precision:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()
# Find optimal threshold based on F1-score
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-9)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"Optimal F1-score: {f1_scores[optimal_idx]:.3f}")
# Apply optimal threshold
y_pred_optimal = (y_pred_proba_xgb >= optimal_threshold).astype(int)
evaluate_fraud_detection(y_test, y_pred_optimal, "XGBoost (Optimal Threshold)")
Image Recognition: Building a Simple Image Classifier
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import matplotlib.pyplot as plt
import numpy as np
# Load and preprocess CIFAR-10 dataset
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
# Normalize pixel values
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0
# Convert class vectors to binary class matrices
num_classes = 10
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)
# Data augmentation using tf.keras layers (ImageDataGenerator is deprecated)
data_augmentation = tf.keras.Sequential([
tf.keras.layers.RandomFlip("horizontal"),
tf.keras.layers.RandomRotation(0.05),
tf.keras.layers.RandomZoom(0.1),
tf.keras.layers.RandomTranslation(0.1, 0.1),
])
# Build the CNN model (augmentation is applied inside the model)
def create_cnn_model():
model = Sequential([
# InputLayer defines the expected shape for the Sequential model
tf.keras.layers.InputLayer(shape=(32, 32, 3)),
# Data augmentation (only active during training)
data_augmentation,
# First convolutional block
Conv2D(32, (3, 3), padding='same', activation='relu'),
BatchNormalization(),
Conv2D(32, (3, 3), padding='same', activation='relu'),
BatchNormalization(),
MaxPooling2D((2, 2)),
Dropout(0.25),
# Second convolutional block
Conv2D(64, (3, 3), padding='same', activation='relu'),
BatchNormalization(),
Conv2D(64, (3, 3), padding='same', activation='relu'),
BatchNormalization(),
MaxPooling2D((2, 2)),
Dropout(0.3),
# Third convolutional block
Conv2D(128, (3, 3), padding='same', activation='relu'),
BatchNormalization(),
Conv2D(128, (3, 3), padding='same', activation='relu'),
BatchNormalization(),
MaxPooling2D((2, 2)),
Dropout(0.4),
# Dense layers
Flatten(),
Dense(512, activation='relu'),
BatchNormalization(),
Dropout(0.5),
Dense(num_classes, activation='softmax')
])
return model
# Create and compile the model
model = create_cnn_model()
model.compile(optimizer=Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])
# Callbacks
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=1e-6)
# Train the model (augmentation happens inside the model via the Sequential layers)
batch_size = 64
epochs = 50
history = model.fit(
X_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(X_test, y_test),
callbacks=[early_stopping, reduce_lr],
verbose=1
)
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_accuracy:.4f}")
# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.tight_layout()
plt.show()
# Make predictions
predictions = model.predict(X_test[:10])
predicted_classes = np.argmax(predictions, axis=1)
# Class names for CIFAR-10
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
# Display some predictions
plt.figure(figsize=(15, 5))
for i in range(10):
plt.subplot(2, 5, i+1)
plt.imshow(X_test[i])
true_label = class_names[np.argmax(y_test[i])]
pred_label = class_names[predicted_classes[i]]
plt.title(f"True: {true_label}\nPred: {pred_label}")
plt.axis('off')
plt.tight_layout()
plt.show()
# Confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true = np.argmax(y_test, axis=1)
cm = confusion_matrix(y_true, y_pred_classes)
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
Recommendation Systems: Implementing Collaborative Filtering
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse.linalg import svds
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# Generate synthetic user-item interaction data
np.random.seed(42)
n_users = 1000
n_items = 200
n_ratings = 10000
# Create user-item matrix
user_ids = np.random.randint(0, n_users, n_ratings)
item_ids = np.random.randint(0, n_items, n_ratings)
ratings = np.random.randint(1, 6, n_ratings) # Ratings from 1 to 5
# Create DataFrame
ratings_df = pd.DataFrame({
'user_id': user_ids,
'item_id': item_ids,
'rating': ratings
})
# Remove duplicate user-item pairs
ratings_df = ratings_df.drop_duplicates(['user_id', 'item_id'])
# Create user-item matrix
user_item_matrix = ratings_df.pivot(
index='user_id',
columns='item_id',
values='rating'
).fillna(0).values
# Split into train and test sets
train_data, test_data = train_test_split(ratings_df, test_size=0.2, random_state=42)
# Create train and test matrices
train_matrix = np.zeros((n_users, n_items))
for _, row in train_data.iterrows():
train_matrix[row['user_id'], row['item_id']] = row['rating']
test_matrix = np.zeros((n_users, n_items))
for _, row in test_data.iterrows():
test_matrix[row['user_id'], row['item_id']] = row['rating']
# Calculate global average
global_average = np.mean(train_matrix[train_matrix != 0])
# Calculate user and item biases
user_bias = np.zeros(n_users)
item_bias = np.zeros(n_items)
# Initialize biases
for i in range(n_users):
user_ratings = train_matrix[i, :]
if np.sum(user_ratings != 0) > 0:
user_bias[i] = np.mean(user_ratings[user_ratings != 0]) - global_average
for j in range(n_items):
item_ratings = train_matrix[:, j]
if np.sum(item_ratings != 0) > 0:
item_bias[j] = np.mean(item_ratings[item_ratings != 0]) - global_average
# Center the training matrix
centered_train_matrix = train_matrix.copy()
for i in range(n_users):
for j in range(n_items):
if train_matrix[i, j] != 0:
centered_train_matrix[i, j] = train_matrix[i, j] - (global_average + user_bias[i] + item_bias[j])
# Perform SVD
k = 20 # Number of latent factors
U, sigma, Vt = svds(centered_train_matrix, k=k)
# Convert sigma to diagonal matrix
sigma = np.diag(sigma)
# Reconstruct the matrix
predicted_ratings = np.dot(np.dot(U, sigma), Vt) + global_average + user_bias[:, np.newaxis] + item_bias[np.newaxis, :]
# Clip ratings to valid range
predicted_ratings = np.clip(predicted_ratings, 1, 5)
# Function to calculate RMSE
def calculate_rmse(actual, predicted, mask):
return np.sqrt(mean_squared_error(actual[mask], predicted[mask]))
# Calculate RMSE on test set
test_mask = test_matrix != 0
rmse = calculate_rmse(test_matrix, predicted_ratings, test_mask)
print(f"Test RMSE: {rmse:.4f}")
# Function to get top N recommendations for a user
def get_top_n_recommendations(user_id, n=5):
user_ratings = predicted_ratings[user_id, :]
# Get indices of items not rated by the user
unrated_items = np.where(train_matrix[user_id, :] == 0)[0]
# Get predicted ratings for unrated items
predicted_unrated = user_ratings[unrated_items]
# Get indices of top N items
top_n_indices = np.argsort(predicted_unrated)[::-1][:n]
# Get item IDs and predicted ratings
top_n_items = unrated_items[top_n_indices]
top_n_ratings = predicted_unrated[top_n_indices]
return list(zip(top_n_items, top_n_ratings))
# Example: Get top 5 recommendations for user 0
user_id = 0
recommendations = get_top_n_recommendations(user_id, n=5)
print(f"\nTop 5 recommendations for user {user_id}:")
for item_id, predicted_rating in recommendations:
print(f"Item {item_id}: Predicted rating = {predicted_rating:.2f}")
# Visualize latent factors
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(U[:, 0], U[:, 1], alpha=0.5)
plt.title('User Latent Factors')
plt.xlabel('Latent Factor 1')
plt.ylabel('Latent Factor 2')
plt.subplot(1, 2, 2)
plt.scatter(Vt[0, :], Vt[1, :], alpha=0.5)
plt.title('Item Latent Factors')
plt.xlabel('Latent Factor 1')
plt.ylabel('Latent Factor 2')
plt.tight_layout()
plt.show()
# Evaluate different numbers of latent factors
k_values = [5, 10, 20, 30, 50]
rmse_scores = []
for k in k_values:
# Perform SVD
U, sigma, Vt = svds(centered_train_matrix, k=k)
sigma = np.diag(sigma)
# Reconstruct the matrix
predicted_ratings = np.dot(np.dot(U, sigma), Vt) + global_average + user_bias[:, np.newaxis] + item_bias[np.newaxis, :]
predicted_ratings = np.clip(predicted_ratings, 1, 5)
# Calculate RMSE
rmse = calculate_rmse(test_matrix, predicted_ratings, test_mask)
rmse_scores.append(rmse)
print(f"k = {k}: RMSE = {rmse:.4f}")
# Plot RMSE vs number of latent factors
plt.figure(figsize=(10, 6))
plt.plot(k_values, rmse_scores, 'bo-')
plt.xlabel('Number of Latent Factors (k)')
plt.ylabel('RMSE')
plt.title('RMSE vs Number of Latent Factors')
plt.grid(True)
plt.show()
Ethical Considerations in Machine Learning
As machine learning systems become increasingly integrated into decision-making processes, it's crucial to address the ethical implications of these technologies.
Bias in Datasets and Algorithms
Machine learning models can perpetuate or even amplify existing biases present in training data. Common sources of bias include:
- Sampling Bias: When the training data doesn't represent the target population
- Measurement Bias: When the way data is collected introduces systematic errors
- Historical Bias: When historical inequalities are reflected in the data
# Example: Detecting bias in a dataset
import pandas as pd
import matplotlib.pyplot as plt
# Simulate a biased hiring dataset
np.random.seed(42)
n_samples = 1000
# Generate synthetic data with bias
data = pd.DataFrame({
'age': np.random.normal(35, 10, n_samples).astype(int),
'gender': np.random.choice(['M', 'F'], n_samples, p=[0.7, 0.3]),
'experience': np.random.normal(10, 5, n_samples).clip(0, 30),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'],
n_samples, p=[0.2, 0.5, 0.2, 0.1]),
'hired': np.zeros(n_samples)
})
# Introduce bias in hiring decisions
for i in range(n_samples):
hire_prob = 0.3 # Base probability
if data.loc[i, 'gender'] == 'F':
hire_prob *= 0.7 # Gender bias
if data.loc[i, 'age'] > 40:
hire_prob *= 0.6 # Age bias
if data.loc[i, 'education'] in ['Master', 'PhD']:
hire_prob *= 1.3 # Education bias
data.loc[i, 'hired'] = np.random.binomial(1, hire_prob)
# Analyze bias
print("Hiring Rates by Gender:")
print(data.groupby('gender')['hired'].mean())
print("\nHiring Rates by Age Group:")
data['age_group'] = pd.cut(data['age'], bins=[0, 30, 40, 50, 100])
print(data.groupby('age_group')['hired'].mean())
# Visualize bias
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
data.groupby('gender')['hired'].mean().plot(kind='bar')
plt.title('Hiring Rate by Gender')
plt.ylabel('Hiring Rate')
plt.subplot(1, 2, 2)
data.groupby('education')['hired'].mean().sort_values().plot(kind='bar')
plt.title('Hiring Rate by Education Level')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Fairness and Accountability
Ensuring fairness in ML models requires careful consideration of different fairness metrics and trade-offs:
- Demographic Parity: Equal selection rates across groups
- Equal Opportunity: Equal true positive rates across groups
- Equalized Odds: Equal true positive and false positive rates across groups
# Example: Evaluating model fairness
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
# Train a model on the biased data
X = data[['age', 'experience']]
y = data['hired']
# Add gender as a feature (to demonstrate bias)
X['is_female'] = (data['gender'] == 'F').astype(int)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate fairness
def evaluate_fairness(y_true, y_pred, sensitive_feature, feature_name):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
tpr = tp / (tp + fn) # True Positive Rate
fpr = fp / (fp + tn) # False Positive Rate
# Calculate metrics by group
groups = np.unique(sensitive_feature)
group_metrics = {}
for group in groups:
mask = (sensitive_feature == group)
group_tn, group_fp, group_fn, group_tp = confusion_matrix(
y_true[mask], y_pred[mask]
).ravel()
group_tpr = group_tp / (group_tp + group_fn)
group_fpr = group_fp / (group_fp + group_tn)
group_metrics[group] = {
'tpr': group_tpr,
'fpr': group_fpr,
'selection_rate': (group_tp + group_fp) / len(y_true[mask])
}
# Print fairness metrics
print(f"\nFairness Metrics for {feature_name}:")
for group, metrics in group_metrics.items():
print(f"\nGroup {group}:")
print(f" True Positive Rate: {metrics['tpr']:.3f}")
print(f" False Positive Rate: {metrics['fpr']:.3f}")
print(f" Selection Rate: {metrics['selection_rate']:.3f}")
# Calculate disparate impact ratio
selection_rates = [m['selection_rate'] for m in group_metrics.values()]
disparate_impact = min(selection_rates) / max(selection_rates)
print(f"\nDisparate Impact Ratio: {disparate_impact:.3f}")
print(f"(Values below 0.8 may indicate adverse impact)")
return group_metrics
# Evaluate fairness across gender
gender_test = X_test['is_female'].values
evaluate_fairness(y_test.values, y_pred, gender_test, "Gender")
Practical Steps Toward Responsible ML
Building fair and accountable ML systems is not just about metrics — it requires deliberate process decisions throughout the project lifecycle:
- Audit your training data before modeling. Check for demographic imbalances, label quality, and historical patterns that could encode discrimination.
- Choose fairness constraints that match your domain. Demographic parity works for some contexts; equal opportunity works for others. There is no universal "fair" threshold.
- Document model decisions with model cards or datasheets. Record what the model was trained on, what it optimizes for, who it was tested against, and where it should not be deployed.
- Monitor after deployment. Fairness metrics can shift as the population changes. Set up automated monitoring to catch drift in selection rates across groups.
Libraries like Fairlearn and AI Fairness 360 provide additional tooling for measuring and mitigating bias in production systems.
Where to Go from Here
This guide covered the foundational workflow — preprocessing, algorithms, evaluation, and ethics. If you want to go deeper into specific areas, here are some next steps worth exploring:
- Deep learning architectures: convolutional networks for images, recurrent networks and transformers for sequences. Our guide on deep learning fundamentals walks through the core concepts with code.
- Building neural networks from scratch: if you want to understand what happens inside
model.fit(), the neural networks from scratch guide implements forward and backward passes manually. - End-to-end project structure: preprocessing and modeling are only part of the pipeline. The data science project guide covers packaging, deployment, and monitoring.
- Local AI and RAG systems: if you are interested in running models on your own hardware, the local AI guide covers Ollama, retrieval-augmented generation, and agent workflows.
Machine learning moves fast — the libraries and best practices in this guide reflect the state of things in early 2026. The fundamentals (linear algebra, probability, optimization, evaluation methodology) change much more slowly than the frameworks. Invest in understanding the math, and the tooling transitions will feel natural.