Do I need a GPU to follow this guide?

No. Every example in this guide runs on a CPU. A GPU accelerates training for the neural network and CNN sections, but the datasets used here (Iris, MNIST, CIFAR-10) are small enough that CPU training completes in minutes. Google Colab provides free GPU access if you want to experiment.

What is the difference between scikit-learn, TensorFlow, and PyTorch?

Scikit-learn is the standard library for classical ML algorithms (regression, trees, SVMs, clustering) and preprocessing pipelines. TensorFlow and PyTorch are deep learning frameworks designed for building and training neural networks. Most practitioners use scikit-learn for tabular data and traditional models, and switch to PyTorch or TensorFlow when the problem requires deep learning.

How do I choose between all these algorithms?

Start with the simplest model that could work. For tabular classification, try logistic regression or a random forest first. For tabular regression, start with linear regression or gradient boosting. Move to neural networks only when simpler models plateau or when your data is unstructured (images, text, audio). The bias-variance tradeoff section in this guide explains why more complex is not always better.

Is ImageDataGenerator still recommended in TensorFlow?

No. tf.keras.preprocessing.image.ImageDataGenerator has been deprecated since TensorFlow 2.9. The recommended approach is to use tf.keras.layers augmentation layers (like RandomFlip , RandomRotation , RandomZoom ) directly inside your model's Sequential stack, as shown in the image classification section of this guide. This approach is faster and integrates cleanly with tf.data pipelines.

Machine Learning: A Hands-On Guide for 2026

April 16, 2026

#machine learning #python machine learning #machine learning tutorial #artificial intelligence #scikit-learn #data preprocessing #model evaluation #ethical machine learning #machine learning algorithms

Machine Learning: A Hands-On Guide for 2026

TL;DR

This guide covers everything you need to start building machine learning models in Python as of 2026. You will set up your development environment, master data preprocessing, implement core algorithms (linear regression, logistic regression, decision trees, SVMs, and neural networks), evaluate models with cross-validation, and work through real-world case studies. The guide includes runnable code examples, visualizations, and a section on ethical considerations in AI development — because shipping a model that works is only half the job.

What You'll Learn

How to set up a modern Python ML environment with scikit-learn, PyTorch, and TensorFlow
Data preprocessing essentials: handling missing values, feature scaling, and encoding
Core algorithms from linear regression to neural networks, with runnable code
Model evaluation techniques including cross-validation and the bias-variance tradeoff
Real-world case studies: fraud detection, image classification, and recommendation systems
Ethical considerations: detecting bias, measuring fairness, and building responsibly

Introduction to Machine Learning

Machine learning (ML) is a subset of artificial intelligence (AI) that enables systems to automatically learn and improve from experience without being explicitly programmed. While AI encompasses the broader concept of machines performing tasks that typically require human intelligence, ML specifically focuses on developing algorithms that can learn from and make predictions or decisions based on data.

Deep learning, a specialized subset of machine learning, uses neural networks with multiple layers (hence "deep") to model complex patterns in large amounts of data. It has driven recent breakthroughs in areas like computer vision and natural language processing.

Types of Machine Learning

Supervised Learning: The algorithm learns from labeled training data, making predictions based on input-output pairs. Common tasks include:
- Classification (predicting categories)
- Regression (predicting continuous values)
Unsupervised Learning: The algorithm finds patterns in unlabeled data without explicit guidance. Typical applications include:
- Clustering (grouping similar data points)
- Dimensionality reduction (simplifying data while preserving structure)
Reinforcement Learning: An agent learns to make decisions by performing actions and receiving rewards or penalties. This approach is used in:
- Game playing (e.g., AlphaGo)
- Robotics
- Autonomous vehicles

The Machine Learning Workflow

A typical ML project follows these steps:

Problem definition and data collection
Data preprocessing and exploration
Model selection and training
Model evaluation and optimization
Deployment and monitoring

Setting Up Your Environment

Let's set up a robust Python environment for machine learning:

Option 1: Local Installation with Anaconda

# Download and install Anaconda from https://www.anaconda.com/download
# Create a new environment
conda create -n ml-env python=3.12
conda activate ml-env

# Install core packages
conda install numpy pandas matplotlib scikit-learn jupyter
conda install -c conda-forge xgboost lightgbm imbalanced-learn
conda install pytorch torchvision torchaudio -c pytorch
conda install tensorflow

Option 2: Google Colab

Google Colab provides a free, cloud-based environment with GPU support:

Go to https://colab.research.google.com/
Create a new notebook
Install required packages:

!pip install numpy pandas matplotlib scikit-learn xgboost lightgbm imbalanced-learn tensorflow torch torchvision

Essential Libraries

# Core data handling and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn import datasets, model_selection, preprocessing, metrics
import xgboost as xgb
import lightgbm as lgb
import tensorflow as tf
import torch
import torch.nn as nn

Data Preprocessing: The Foundation of ML

Quality data preprocessing is crucial for successful machine learning. Let's explore essential techniques:

Handling Missing Values

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

# Create sample data with missing values
data = pd.DataFrame({
    'age': [25, 30, np.nan, 35, 40, 45, np.nan, 50],
    'income': [50000, np.nan, 75000, 80000, np.nan, 110000, 120000, 130000],
    'department': ['HR', 'IT', 'IT', np.nan, 'Finance', 'Finance', 'HR', 'IT']
})

# 1. Simple imputation
# For numerical features
num_imputer = SimpleImputer(strategy='mean')
data[['age', 'income']] = num_imputer.fit_transform(data[['age', 'income']])

# For categorical features
cat_imputer = SimpleImputer(strategy='most_frequent')
data['department'] = cat_imputer.fit_transform(data[['department']]).ravel()

# 2. KNN imputation for more sophisticated handling
knn_imputer = KNNImputer(n_neighbors=2)
data_imputed = knn_imputer.fit_transform(data[['age', 'income']])

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Sample data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Standardization (Z-score normalization)
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

# Min-Max scaling
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(X)

# Robust scaling (less sensitive to outliers)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)

Encoding Categorical Variables

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder

# Sample data
categories = ['red', 'blue', 'green', 'red', 'blue']

# One-hot encoding
onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoded = onehot_encoder.fit_transform(np.array(categories).reshape(-1, 1))

# Label encoding
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(categories)

# Ordinal encoding for ordered categories
size_categories = ['S', 'M', 'L', 'XL', 'XXL']
ordinal_encoder = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL', 'XXL']])
ordinal_encoded = ordinal_encoder.fit_transform(np.array(size_categories).reshape(-1, 1))

Data Splitting

from sklearn.model_selection import train_test_split

# Load sample dataset
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

# Basic train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# For time-series data, use TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Core Machine Learning Algorithms

Linear Regression

Linear regression models the relationship between a dependent variable and one or more independent variables using a linear equation.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Create and train the model
lin_reg = LinearRegression()
lin_reg.fit(X, y)

# Make predictions
X_new = np.array([[0], [2]])
y_pred = lin_reg.predict(X_new)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Data')
plt.plot(X_new, y_pred, 'r-', linewidth=2, label='Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Example')
plt.legend()
plt.grid(True)
plt.show()

# Model evaluation
y_pred_all = lin_reg.predict(X)
mse = mean_squared_error(y, y_pred_all)
r2 = r2_score(y, y_pred_all)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
print(f"Intercept: {lin_reg.intercept_[0]:.2f}")
print(f"Coefficient: {lin_reg.coef_[0][0]:.2f}")

Logistic Regression

Logistic regression is used for binary classification problems, modeling the probability that an instance belongs to a particular class.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, 
                          n_informative=2, random_state=42, n_clusters_per_class=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)
y_pred_proba = log_reg.predict_proba(X_test)[:, 1]

# Plot decision boundary
def plot_decision_boundary(X, y, model):
    h = 0.02  # Step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.4)
    plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Logistic Regression Decision Boundary')

plt.figure(figsize=(10, 6))
plot_decision_boundary(X_test, y_test, log_reg)
plt.show()

# Model evaluation
print("Classification Report:")
print(classification_report(y_test, y_pred))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

Decision Trees and Random Forests

Decision trees are versatile algorithms that can perform both classification and regression tasks. Random forests combine multiple decision trees to improve performance and reduce overfitting.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Decision Tree
dt_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_clf.fit(X_train, y_train)

# Visualize the decision tree
plt.figure(figsize=(20,10))
plot_tree(dt_clf, feature_names=feature_names, class_names=class_names, 
          filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree Visualization")
plt.show()

# Random Forest
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Feature importance
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices])
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Model evaluation
dt_pred = dt_clf.predict(X_test)
rf_pred = rf_clf.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, dt_pred))
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))

# Compare performance with different number of trees
n_estimators = [1, 5, 10, 50, 100, 200, 500]
train_scores = []
test_scores = []

for n in n_estimators:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train, y_train)
    train_scores.append(rf.score(X_train, y_train))
    test_scores.append(rf.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(n_estimators, train_scores, label='Train Score')
plt.plot(n_estimators, test_scores, label='Test Score')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('Random Forest Performance vs. Number of Trees')
plt.legend()
plt.grid(True)
plt.xscale('log')
plt.show()

Support Vector Machines (SVMs)

SVMs are powerful algorithms for both classification and regression tasks, particularly effective in high-dimensional spaces.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.inspection import DecisionBoundaryDisplay

# Generate sample data
X, y = datasets.make_classification(n_samples=300, n_features=2, n_redundant=0, 
                                   n_informative=2, random_state=42, n_clusters_per_class=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train different SVM kernels
kernels = ['linear', 'rbf', 'poly']
svm_models = {}

for kernel in kernels:
    svm = SVC(kernel=kernel, random_state=42, probability=True)
    svm.fit(X_train_scaled, y_train)
    svm_models[kernel] = svm
    print(f"{kernel.upper()} Kernel - Train Accuracy: {svm.score(X_train_scaled, y_train):.3f}, "
          f"Test Accuracy: {svm.score(X_test_scaled, y_test):.3f}")

# Plot decision boundaries using sklearn's DecisionBoundaryDisplay
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, (kernel, svm) in zip(axes, svm_models.items()):
    DecisionBoundaryDisplay.from_estimator(
        svm, X_train_scaled, ax=ax, response_method="predict", alpha=0.4
    )
    ax.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, edgecolors='k', s=20)
    ax.set_title(f'SVM with {kernel.upper()} Kernel')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

# Hyperparameter tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.1, 1, 10],
    'kernel': ['rbf', 'poly', 'sigmoid']
}

grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score: {:.3f}".format(grid_search.best_score_))
print("Test set score: {:.3f}".format(grid_search.score(X_test_scaled, y_test)))

Neural Networks: A Gentle Introduction

Neural networks are computing systems inspired by biological neural networks, capable of learning complex patterns in data.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt

# Load and preprocess MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Normalize pixel values
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

# Reshape the data
X_train = X_train.reshape(-1, 28*28)
X_test = X_test.reshape(-1, 28*28)

# Convert labels to one-hot encoding
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

# Build the neural network
model = Sequential([
    Dense(512, activation='relu', input_shape=(784,)),
    Dropout(0.2),
    Dense(256, activation='relu'),
    Dropout(0.2),
    Dense(128, activation='relu'),
    Dropout(0.2),
    Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Model summary
model.summary()

# Train the model
history = model.fit(X_train, y_train,
                    batch_size=128,
                    epochs=20,
                    validation_split=0.2,
                    verbose=1)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_accuracy:.4f}")

# Plot training history
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

# Make predictions
predictions = model.predict(X_test[:5])
predicted_classes = np.argmax(predictions, axis=1)

# Display some predictions
plt.figure(figsize=(15, 3))
for i in range(5):
    plt.subplot(1, 5, i+1)
    plt.imshow(X_test[i].reshape(28, 28), cmap='gray')
    plt.title(f"Pred: {predicted_classes[i]}")
    plt.axis('off')
plt.show()

Model Evaluation and Selection

Proper model evaluation is crucial for assessing performance and making informed decisions about model selection.

Classification Metrics

from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, roc_auc_score, confusion_matrix, 
                           classification_report, roc_curve, auc)
import seaborn as sns

# For binary classification
def evaluate_classification(y_true, y_pred, y_pred_proba=None):
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred))
    print("Recall:", recall_score(y_true, y_pred))
    print("F1-Score:", f1_score(y_true, y_pred))
    
    if y_pred_proba is not None:
        print("ROC AUC:", roc_auc_score(y_true, y_pred_proba))
    
    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()
    
    # ROC Curve
    if y_pred_proba is not None:
        fpr, tpr, _ = roc_curve(y_true, y_pred_proba)
        roc_auc = auc(fpr, tpr)
        
        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, color='darkorange', lw=2, 
                 label=f'ROC curve (AUC = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver Operating Characteristic (ROC) Curve')
        plt.legend(loc="lower right")
        plt.show()

# For multi-class classification
def evaluate_multiclass(y_true, y_pred, class_names):
    print("Classification Report:")
    print(classification_report(y_true, y_pred, target_names=class_names))
    
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=class_names, yticklabels=class_names)
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.xticks(rotation=45)
    plt.yticks(rotation=0)
    plt.show()

Cross-Validation Techniques

from sklearn.model_selection import (cross_val_score, KFold, StratifiedKFold, 
                                   cross_validate, learning_curve)
import numpy as np

def perform_cross_validation(model, X, y, cv=5):
    # Simple cross-validation
    cv_scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    print(f"Cross-Validation Scores: {cv_scores}")
    print(f"Mean CV Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
    
    # Stratified K-Fold for imbalanced datasets
    skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
    print(f"\nStratified CV Scores: {stratified_scores}")
    print(f"Mean Stratified CV Accuracy: {stratified_scores.mean():.3f} ± {stratified_scores.std():.3f}")
    
    # Cross-validate with multiple metrics
    scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
    cv_results = cross_validate(model, X, y, cv=cv, scoring=scoring)
    
    print("\nDetailed Cross-Validation Results:")
    for metric in scoring:
        scores = cv_results[f'test_{metric}']
        print(f"{metric}: {scores.mean():.3f} ± {scores.std():.3f}")
    
    # Plot learning curves
    train_sizes, train_scores, test_scores = learning_curve(
        model, X, y, cv=cv, n_jobs=-1, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy'
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, color='blue', marker='o', 
             markersize=5, label='Training accuracy')
    plt.fill_between(train_sizes, train_mean + train_std, train_mean - train_std,
                     alpha=0.15, color='blue')
    
    plt.plot(train_sizes, test_mean, color='green', linestyle='--', 
             marker='s', markersize=5, label='Validation accuracy')
    plt.fill_between(train_sizes, test_mean + test_std, test_mean - test_std,
                     alpha=0.15, color='green')
    
    plt.grid()
    plt.xlabel('Number of training examples')
    plt.ylabel('Accuracy')
    plt.legend(loc='lower right')
    plt.title('Learning Curves')
    plt.show()

# Example usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
model = RandomForestClassifier(n_estimators=100, random_state=42)
perform_cross_validation(model, X, y)

Bias-Variance Tradeoff

Understanding the bias-variance tradeoff is essential for model selection and preventing overfitting or underfitting.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate sample data
np.random.seed(42)
X = np.linspace(-3, 3, 100)
y = np.sin(X) + np.random.normal(0, 0.1, 100)

# Reshape for sklearn
X = X.reshape(-1, 1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Function to fit polynomial regression
def fit_polynomial(degree):
    polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([
        ("polynomial_features", polynomial_features),
        ("linear_regression", linear_regression)
    ])
    pipeline.fit(X_train, y_train)
    return pipeline

# Calculate bias and variance
def calculate_bias_variance(model, X_test, y_test, n_iterations=100):
    y_preds = np.zeros((n_iterations, len(X_test)))
    
    for i in range(n_iterations):
        # Add noise to training data
        y_train_noisy = y_train + np.random.normal(0, 0.1, len(y_train))
        model.fit(X_train, y_train_noisy)
        y_preds[i] = model.predict(X_test)
    
    # Calculate variance and bias
    variance = np.var(y_preds, axis=0)
    bias = (y_test - np.mean(y_preds, axis=0))**2
    return np.mean(bias), np.mean(variance)

# Evaluate different polynomial degrees
degrees = range(1, 15)
biases = []
variances = []
train_errors = []
test_errors = []

for degree in degrees:
    model = fit_polynomial(degree)
    bias, variance = calculate_bias_variance(model, X_test, y_test)
    biases.append(bias)
    variances.append(variance)
    
    # Training and test errors
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    train_errors.append(mean_squared_error(y_train, train_pred))
    test_errors.append(mean_squared_error(y_test, test_pred))

# Plot bias-variance tradeoff
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
plt.plot(degrees, biases, label='Bias²')
plt.plot(degrees, variances, label='Variance')
plt.plot(degrees, np.array(biases) + np.array(variances), 'k--', label='Bias² + Variance')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(degrees, train_errors, 'b-', label='Training Error')
plt.plot(degrees, test_errors, 'r-', label='Test Error')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Training vs Test Error')
plt.legend()

plt.tight_layout()
plt.show()

# Find optimal model complexity
optimal_degree = degrees[np.argmin(test_errors)]
print(f"Optimal polynomial degree: {optimal_degree}")

# Plot best fit
best_model = fit_polynomial(optimal_degree)
X_plot = np.linspace(-3, 3, 100).reshape(-1, 1)
y_plot = best_model.predict(X_plot)

plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training Data')
plt.scatter(X_test, y_test, color='red', alpha=0.5, label='Test Data')
plt.plot(X_plot, y_plot, 'k-', linewidth=2, label=f'Polynomial (degree {optimal_degree})')
plt.plot(X_plot, np.sin(X_plot), 'g--', label='True Function')
plt.xlabel('X')
plt.ylabel('y')
plt.title(f'Best Fit Polynomial (Degree {optimal_degree})')
plt.legend()
plt.grid(True)
plt.show()

Real-World Case Studies

Fraud Detection: Using ML to Identify Fraudulent Transactions

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

# Load and preprocess the dataset
# Note: In practice, use your own dataset or download from a reliable source
# For this example, we'll create a synthetic dataset
np.random.seed(42)
n_samples = 10000
n_features = 20

# Generate synthetic data
X = np.random.randn(n_samples, n_features)
# Create synthetic fraud labels (highly imbalanced)
fraud = np.random.binomial(1, 0.01, n_samples)

# Add some patterns to the fraud cases
X[fraud == 1, 0] += 3  # First feature is higher for fraud
X[fraud == 1, 1] -= 2  # Second feature is lower for fraud

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, fraud, test_size=0.3, stratify=fraud, random_state=42
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

# Train Isolation Forest for anomaly detection
iso_forest = IsolationForest(contamination=0.01, random_state=42)
iso_forest.fit(X_train_scaled)

# Predict anomalies (1 for inliers, -1 for outliers)
y_pred_iso = iso_forest.predict(X_test_scaled)
# Convert to binary (0 for inliers, 1 for outliers)
y_pred_iso = (y_pred_iso == -1).astype(int)

# Train a classifier (XGBoost) on the balanced data
import xgboost as xgb

xgb_model = xgb.XGBClassifier(
    scale_pos_weight=len(y_train_balanced[y_train_balanced==0])/len(y_train_balanced[y_train_balanced==1]),
    random_state=42
)
xgb_model.fit(X_train_balanced, y_train_balanced)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test_scaled)
y_pred_proba_xgb = xgb_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate models
def evaluate_fraud_detection(y_true, y_pred, model_name):
    print(f"\n{model_name} Performance:")
    print("Confusion Matrix:")
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix - {model_name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()
    
    print("\nClassification Report:")
    print(classification_report(y_true, y_pred, target_names=['Legitimate', 'Fraud']))

evaluate_fraud_detection(y_test, y_pred_iso, "Isolation Forest")
evaluate_fraud_detection(y_test, y_pred_xgb, "XGBoost")

# Plot feature importance
plt.figure(figsize=(12, 6))
xgb.plot_importance(xgb_model, max_num_features=10)
plt.title('Feature Importance - XGBoost')
plt.show()

# Threshold optimization
from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_xgb)
average_precision = average_precision_score(y_test, y_pred_proba_xgb)

plt.figure(figsize=(10, 6))
plt.step(recall, precision, where='post', label=f'Precision-Recall (AP = {average_precision:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()

# Find optimal threshold based on F1-score
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-9)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"Optimal F1-score: {f1_scores[optimal_idx]:.3f}")

# Apply optimal threshold
y_pred_optimal = (y_pred_proba_xgb >= optimal_threshold).astype(int)
evaluate_fraud_detection(y_test, y_pred_optimal, "XGBoost (Optimal Threshold)")

Image Recognition: Building a Simple Image Classifier

import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import matplotlib.pyplot as plt
import numpy as np

# Load and preprocess CIFAR-10 dataset
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

# Normalize pixel values
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

# Convert class vectors to binary class matrices
num_classes = 10
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

# Data augmentation using tf.keras layers (ImageDataGenerator is deprecated)
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(0.05),
    tf.keras.layers.RandomZoom(0.1),
    tf.keras.layers.RandomTranslation(0.1, 0.1),
])

# Build the CNN model (augmentation is applied inside the model)
def create_cnn_model():
    model = Sequential([
        # InputLayer defines the expected shape for the Sequential model
        tf.keras.layers.InputLayer(shape=(32, 32, 3)),
        # Data augmentation (only active during training)
        data_augmentation,
        # First convolutional block
        Conv2D(32, (3, 3), padding='same', activation='relu'),
        BatchNormalization(),
        Conv2D(32, (3, 3), padding='same', activation='relu'),
        BatchNormalization(),
        MaxPooling2D((2, 2)),
        Dropout(0.25),
        
        # Second convolutional block
        Conv2D(64, (3, 3), padding='same', activation='relu'),
        BatchNormalization(),
        Conv2D(64, (3, 3), padding='same', activation='relu'),
        BatchNormalization(),
        MaxPooling2D((2, 2)),
        Dropout(0.3),
        
        # Third convolutional block
        Conv2D(128, (3, 3), padding='same', activation='relu'),
        BatchNormalization(),
        Conv2D(128, (3, 3), padding='same', activation='relu'),
        BatchNormalization(),
        MaxPooling2D((2, 2)),
        Dropout(0.4),
        
        # Dense layers
        Flatten(),
        Dense(512, activation='relu'),
        BatchNormalization(),
        Dropout(0.5),
        Dense(num_classes, activation='softmax')
    ])
    
    return model

# Create and compile the model
model = create_cnn_model()
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Callbacks
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=1e-6)

# Train the model (augmentation happens inside the model via the Sequential layers)
batch_size = 64
epochs = 50

history = model.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(X_test, y_test),
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_accuracy:.4f}")

# Plot training history
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

# Make predictions
predictions = model.predict(X_test[:10])
predicted_classes = np.argmax(predictions, axis=1)

# Class names for CIFAR-10
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

# Display some predictions
plt.figure(figsize=(15, 5))
for i in range(10):
    plt.subplot(2, 5, i+1)
    plt.imshow(X_test[i])
    true_label = class_names[np.argmax(y_test[i])]
    pred_label = class_names[predicted_classes[i]]
    plt.title(f"True: {true_label}\nPred: {pred_label}")
    plt.axis('off')
plt.tight_layout()
plt.show()

# Confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true = np.argmax(y_test, axis=1)

cm = confusion_matrix(y_true, y_pred_classes)
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

Recommendation Systems: Implementing Collaborative Filtering

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse.linalg import svds
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Generate synthetic user-item interaction data
np.random.seed(42)
n_users = 1000
n_items = 200
n_ratings = 10000

# Create user-item matrix
user_ids = np.random.randint(0, n_users, n_ratings)
item_ids = np.random.randint(0, n_items, n_ratings)
ratings = np.random.randint(1, 6, n_ratings)  # Ratings from 1 to 5

# Create DataFrame
ratings_df = pd.DataFrame({
    'user_id': user_ids,
    'item_id': item_ids,
    'rating': ratings
})

# Remove duplicate user-item pairs
ratings_df = ratings_df.drop_duplicates(['user_id', 'item_id'])

# Create user-item matrix
user_item_matrix = ratings_df.pivot(
    index='user_id', 
    columns='item_id', 
    values='rating'
).fillna(0).values

# Split into train and test sets
train_data, test_data = train_test_split(ratings_df, test_size=0.2, random_state=42)

# Create train and test matrices
train_matrix = np.zeros((n_users, n_items))
for _, row in train_data.iterrows():
    train_matrix[row['user_id'], row['item_id']] = row['rating']

test_matrix = np.zeros((n_users, n_items))
for _, row in test_data.iterrows():
    test_matrix[row['user_id'], row['item_id']] = row['rating']

# Calculate global average
global_average = np.mean(train_matrix[train_matrix != 0])

# Calculate user and item biases
user_bias = np.zeros(n_users)
item_bias = np.zeros(n_items)

# Initialize biases
for i in range(n_users):
    user_ratings = train_matrix[i, :]
    if np.sum(user_ratings != 0) > 0:
        user_bias[i] = np.mean(user_ratings[user_ratings != 0]) - global_average

for j in range(n_items):
    item_ratings = train_matrix[:, j]
    if np.sum(item_ratings != 0) > 0:
        item_bias[j] = np.mean(item_ratings[item_ratings != 0]) - global_average

# Center the training matrix
centered_train_matrix = train_matrix.copy()
for i in range(n_users):
    for j in range(n_items):
        if train_matrix[i, j] != 0:
            centered_train_matrix[i, j] = train_matrix[i, j] - (global_average + user_bias[i] + item_bias[j])

# Perform SVD
k = 20  # Number of latent factors
U, sigma, Vt = svds(centered_train_matrix, k=k)

# Convert sigma to diagonal matrix
sigma = np.diag(sigma)

# Reconstruct the matrix
predicted_ratings = np.dot(np.dot(U, sigma), Vt) + global_average + user_bias[:, np.newaxis] + item_bias[np.newaxis, :]

# Clip ratings to valid range
predicted_ratings = np.clip(predicted_ratings, 1, 5)

# Function to calculate RMSE
def calculate_rmse(actual, predicted, mask):
    return np.sqrt(mean_squared_error(actual[mask], predicted[mask]))

# Calculate RMSE on test set
test_mask = test_matrix != 0
rmse = calculate_rmse(test_matrix, predicted_ratings, test_mask)
print(f"Test RMSE: {rmse:.4f}")

# Function to get top N recommendations for a user
def get_top_n_recommendations(user_id, n=5):
    user_ratings = predicted_ratings[user_id, :]
    # Get indices of items not rated by the user
    unrated_items = np.where(train_matrix[user_id, :] == 0)[0]
    # Get predicted ratings for unrated items
    predicted_unrated = user_ratings[unrated_items]
    # Get indices of top N items
    top_n_indices = np.argsort(predicted_unrated)[::-1][:n]
    # Get item IDs and predicted ratings
    top_n_items = unrated_items[top_n_indices]
    top_n_ratings = predicted_unrated[top_n_indices]
    
    return list(zip(top_n_items, top_n_ratings))

# Example: Get top 5 recommendations for user 0
user_id = 0
recommendations = get_top_n_recommendations(user_id, n=5)
print(f"\nTop 5 recommendations for user {user_id}:")
for item_id, predicted_rating in recommendations:
    print(f"Item {item_id}: Predicted rating = {predicted_rating:.2f}")

# Visualize latent factors
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(U[:, 0], U[:, 1], alpha=0.5)
plt.title('User Latent Factors')
plt.xlabel('Latent Factor 1')
plt.ylabel('Latent Factor 2')

plt.subplot(1, 2, 2)
plt.scatter(Vt[0, :], Vt[1, :], alpha=0.5)
plt.title('Item Latent Factors')
plt.xlabel('Latent Factor 1')
plt.ylabel('Latent Factor 2')

plt.tight_layout()
plt.show()

# Evaluate different numbers of latent factors
k_values = [5, 10, 20, 30, 50]
rmse_scores = []

for k in k_values:
    # Perform SVD
    U, sigma, Vt = svds(centered_train_matrix, k=k)
    sigma = np.diag(sigma)
    
    # Reconstruct the matrix
    predicted_ratings = np.dot(np.dot(U, sigma), Vt) + global_average + user_bias[:, np.newaxis] + item_bias[np.newaxis, :]
    predicted_ratings = np.clip(predicted_ratings, 1, 5)
    
    # Calculate RMSE
    rmse = calculate_rmse(test_matrix, predicted_ratings, test_mask)
    rmse_scores.append(rmse)
    print(f"k = {k}: RMSE = {rmse:.4f}")

# Plot RMSE vs number of latent factors
plt.figure(figsize=(10, 6))
plt.plot(k_values, rmse_scores, 'bo-')
plt.xlabel('Number of Latent Factors (k)')
plt.ylabel('RMSE')
plt.title('RMSE vs Number of Latent Factors')
plt.grid(True)
plt.show()

Ethical Considerations in Machine Learning

As machine learning systems become increasingly integrated into decision-making processes, it's crucial to address the ethical implications of these technologies.

Bias in Datasets and Algorithms

Machine learning models can perpetuate or even amplify existing biases present in training data. Common sources of bias include:

Sampling Bias: When the training data doesn't represent the target population
Measurement Bias: When the way data is collected introduces systematic errors
Historical Bias: When historical inequalities are reflected in the data

# Example: Detecting bias in a dataset
import pandas as pd
import matplotlib.pyplot as plt

# Simulate a biased hiring dataset
np.random.seed(42)
n_samples = 1000

# Generate synthetic data with bias
data = pd.DataFrame({
    'age': np.random.normal(35, 10, n_samples).astype(int),
    'gender': np.random.choice(['M', 'F'], n_samples, p=[0.7, 0.3]),
    'experience': np.random.normal(10, 5, n_samples).clip(0, 30),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 
                                n_samples, p=[0.2, 0.5, 0.2, 0.1]),
    'hired': np.zeros(n_samples)
})

# Introduce bias in hiring decisions
for i in range(n_samples):
    hire_prob = 0.3  # Base probability
    if data.loc[i, 'gender'] == 'F':
        hire_prob *= 0.7  # Gender bias
    if data.loc[i, 'age'] > 40:
        hire_prob *= 0.6  # Age bias
    if data.loc[i, 'education'] in ['Master', 'PhD']:
        hire_prob *= 1.3  # Education bias
    
    data.loc[i, 'hired'] = np.random.binomial(1, hire_prob)

# Analyze bias
print("Hiring Rates by Gender:")
print(data.groupby('gender')['hired'].mean())

print("\nHiring Rates by Age Group:")
data['age_group'] = pd.cut(data['age'], bins=[0, 30, 40, 50, 100])
print(data.groupby('age_group')['hired'].mean())

# Visualize bias
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
data.groupby('gender')['hired'].mean().plot(kind='bar')
plt.title('Hiring Rate by Gender')
plt.ylabel('Hiring Rate')

plt.subplot(1, 2, 2)
data.groupby('education')['hired'].mean().sort_values().plot(kind='bar')
plt.title('Hiring Rate by Education Level')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

Fairness and Accountability

Ensuring fairness in ML models requires careful consideration of different fairness metrics and trade-offs:

Demographic Parity: Equal selection rates across groups
Equal Opportunity: Equal true positive rates across groups
Equalized Odds: Equal true positive and false positive rates across groups

# Example: Evaluating model fairness
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Train a model on the biased data
X = data[['age', 'experience']]
y = data['hired']

# Add gender as a feature (to demonstrate bias)
X['is_female'] = (data['gender'] == 'F').astype(int)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate fairness
def evaluate_fairness(y_true, y_pred, sensitive_feature, feature_name):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    tpr = tp / (tp + fn)  # True Positive Rate
    fpr = fp / (fp + tn)  # False Positive Rate
    
    # Calculate metrics by group
    groups = np.unique(sensitive_feature)
    group_metrics = {}
    
    for group in groups:
        mask = (sensitive_feature == group)
        group_tn, group_fp, group_fn, group_tp = confusion_matrix(
            y_true[mask], y_pred[mask]
        ).ravel()
        
        group_tpr = group_tp / (group_tp + group_fn)
        group_fpr = group_fp / (group_fp + group_tn)
        
        group_metrics[group] = {
            'tpr': group_tpr,
            'fpr': group_fpr,
            'selection_rate': (group_tp + group_fp) / len(y_true[mask])
        }
    
    # Print fairness metrics
    print(f"\nFairness Metrics for {feature_name}:")
    for group, metrics in group_metrics.items():
        print(f"\nGroup {group}:")
        print(f"  True Positive Rate: {metrics['tpr']:.3f}")
        print(f"  False Positive Rate: {metrics['fpr']:.3f}")
        print(f"  Selection Rate: {metrics['selection_rate']:.3f}")
    
    # Calculate disparate impact ratio
    selection_rates = [m['selection_rate'] for m in group_metrics.values()]
    disparate_impact = min(selection_rates) / max(selection_rates)
    print(f"\nDisparate Impact Ratio: {disparate_impact:.3f}")
    print(f"(Values below 0.8 may indicate adverse impact)")
    
    return group_metrics

# Evaluate fairness across gender
gender_test = X_test['is_female'].values
evaluate_fairness(y_test.values, y_pred, gender_test, "Gender")

Practical Steps Toward Responsible ML

Building fair and accountable ML systems is not just about metrics — it requires deliberate process decisions throughout the project lifecycle:

Audit your training data before modeling. Check for demographic imbalances, label quality, and historical patterns that could encode discrimination.
Choose fairness constraints that match your domain. Demographic parity works for some contexts; equal opportunity works for others. There is no universal "fair" threshold.
Document model decisions with model cards or datasheets. Record what the model was trained on, what it optimizes for, who it was tested against, and where it should not be deployed.
Monitor after deployment. Fairness metrics can shift as the population changes. Set up automated monitoring to catch drift in selection rates across groups.

Libraries like Fairlearn and AI Fairness 360 provide additional tooling for measuring and mitigating bias in production systems.

Where to Go from Here

This guide covered the foundational workflow — preprocessing, algorithms, evaluation, and ethics. If you want to go deeper into specific areas, here are some next steps worth exploring:

Deep learning architectures: convolutional networks for images, recurrent networks and transformers for sequences. Our guide on deep learning fundamentals walks through the core concepts with code.
Building neural networks from scratch: if you want to understand what happens inside model.fit(), the neural networks from scratch guide implements forward and backward passes manually.
End-to-end project structure: preprocessing and modeling are only part of the pipeline. The data science project guide covers packaging, deployment, and monitoring.
Local AI and RAG systems: if you are interested in running models on your own hardware, the local AI guide covers Ollama, retrieval-augmented generation, and agent workflows.

Machine learning moves fast — the libraries and best practices in this guide reflect the state of things in early 2026. The fundamentals (linear algebra, probability, optimization, evaluation methodology) change much more slowly than the frameworks. Invest in understanding the math, and the tooling transitions will feel natural.

Frequently Asked Questions

Python 3.12 is the recommended version as of April 2026. It offers broad library compatibility across scikit-learn, TensorFlow, and PyTorch. Python 3.10 is approaching end-of-life (October 2026) and is no longer supported by scikit-learn 1.8+.

Machine Learning: A Hands-On Guide for 2026

Frequently Asked Questions

Related Posts

Python Machine Learning: A Complete Guide

Mastering Cross-Validation Techniques in 2026

Mastering Scikit-learn: A Complete 2026 Tutorial for Machine Learning

Mastering Cross-Validation: The Key to Reliable Machine Learning Models

Stay on the Nerd Track