Data Validation & Testing

Pandera for DataFrame Validation

5 min read

English Content

Pandera vs Great Expectations

While Great Expectations is comprehensive for production data pipelines, Pandera is designed for:

Aspect Pandera Great Expectations
Use case Development & unit testing Production pipelines
Setup Minimal, just pip install Project initialization
Integration Native pytest support Checkpoint-based
Schema definition Python classes/decorators YAML/JSON expectations
Learning curve Low Medium

Choose Pandera when:

  • You want schema validation in unit tests
  • You prefer Python-native definitions
  • You need lightweight validation during development
  • You're already using pytest

Installation

pip install pandera

# With pytest integration
pip install pandera[pytest]

Defining Schemas

Pandera uses Python classes to define schemas:

# schemas.py
import pandera as pa
from pandera.typing import Series
import pandas as pd

class TrainingDataSchema(pa.DataFrameModel):
    """Schema for ML training data."""

    user_id: Series[int] = pa.Field(unique=True, ge=0)
    age: Series[int] = pa.Field(ge=0, le=120)
    income: Series[float] = pa.Field(ge=0)
    category: Series[str] = pa.Field(isin=["A", "B", "C", "D"])
    label: Series[int] = pa.Field(isin=[0, 1])

    class Config:
        coerce = True  # Automatically coerce types
        strict = True  # No extra columns allowed

# Usage
@pa.check_types
def load_training_data(path: str) -> pd.DataFrame[TrainingDataSchema]:
    """Load and validate training data."""
    df = pd.read_csv(path)
    return df  # Automatically validated on return

Schema with Custom Checks

Add custom validation logic:

import pandera as pa
from pandera.typing import Series
import pandas as pd

class FeatureSchema(pa.DataFrameModel):
    """Schema for ML features with custom checks."""

    feature_1: Series[float] = pa.Field(ge=-1, le=1)
    feature_2: Series[float] = pa.Field(ge=-1, le=1)
    feature_3: Series[float] = pa.Field(nullable=True)
    timestamp: Series[pd.Timestamp]

    @pa.check("feature_1")
    def check_normalized(cls, series: Series[float]) -> bool:
        """Verify feature is approximately normalized."""
        return series.std() < 2.0

    @pa.dataframe_check
    def check_no_duplicate_timestamps(cls, df: pd.DataFrame) -> bool:
        """Ensure no duplicate timestamps."""
        return df["timestamp"].is_unique

    @pa.dataframe_check
    def check_feature_correlation(cls, df: pd.DataFrame) -> bool:
        """Ensure features aren't perfectly correlated."""
        corr = df[["feature_1", "feature_2"]].corr().iloc[0, 1]
        return abs(corr) < 0.99

Pytest Integration

Pandera integrates seamlessly with pytest:

# tests/test_data_validation.py
import pytest
import pandera as pa
from pandera.typing import DataFrame
import pandas as pd

from schemas import TrainingDataSchema, FeatureSchema

class TestTrainingData:
    """Test training data validation."""

    def test_valid_data(self):
        """Test that valid data passes validation."""
        df = pd.DataFrame({
            "user_id": [1, 2, 3],
            "age": [25, 30, 35],
            "income": [50000.0, 60000.0, 70000.0],
            "category": ["A", "B", "C"],
            "label": [0, 1, 0]
        })

        validated = TrainingDataSchema.validate(df)
        assert len(validated) == 3

    def test_invalid_age(self):
        """Test that invalid age fails validation."""
        df = pd.DataFrame({
            "user_id": [1],
            "age": [150],  # Invalid: > 120
            "income": [50000.0],
            "category": ["A"],
            "label": [0]
        })

        with pytest.raises(pa.errors.SchemaError):
            TrainingDataSchema.validate(df)

    def test_invalid_category(self):
        """Test that invalid category fails validation."""
        df = pd.DataFrame({
            "user_id": [1],
            "age": [25],
            "income": [50000.0],
            "category": ["X"],  # Invalid: not in ["A", "B", "C", "D"]
            "label": [0]
        })

        with pytest.raises(pa.errors.SchemaError):
            TrainingDataSchema.validate(df)

# Parameterized testing with schemas
@pytest.mark.parametrize("schema,data,should_pass", [
    (TrainingDataSchema, {"user_id": [1], "age": [25], "income": [50000.0], "category": ["A"], "label": [0]}, True),
    (TrainingDataSchema, {"user_id": [1], "age": [-5], "income": [50000.0], "category": ["A"], "label": [0]}, False),
])
def test_schema_validation(schema, data, should_pass):
    df = pd.DataFrame(data)
    if should_pass:
        schema.validate(df)
    else:
        with pytest.raises(pa.errors.SchemaError):
            schema.validate(df)

Using pytest-pandera Plugin

# tests/conftest.py
import pytest
import pandas as pd

@pytest.fixture
def sample_training_data():
    """Fixture providing sample training data."""
    return pd.DataFrame({
        "user_id": range(100),
        "age": [25 + i % 50 for i in range(100)],
        "income": [50000.0 + i * 1000 for i in range(100)],
        "category": ["A", "B", "C", "D"] * 25,
        "label": [0, 1] * 50
    })
# tests/test_with_plugin.py
import pandera as pa
from schemas import TrainingDataSchema

def test_training_data_schema(sample_training_data):
    """Test training data against schema."""
    validated = TrainingDataSchema.validate(sample_training_data)
    assert len(validated) == 100

CI/CD Integration

# .github/workflows/test-data.yml
name: Data Validation Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install pandera pandas pytest

      - name: Run schema tests
        run: pytest tests/test_data_validation.py -v

      - name: Validate production data
        run: |
          python -c "
          import pandas as pd
          from schemas import TrainingDataSchema

          df = pd.read_csv('data/training.csv')
          TrainingDataSchema.validate(df)
          print('Data validation passed!')
          "

Lazy Validation for Better Error Messages

import pandera as pa

# Get all validation errors at once
schema = TrainingDataSchema

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print("Validation errors found:")
    for error in err.failure_cases.itertuples():
        print(f"  - {error}")

Key Takeaways

Feature Usage
Schema classes Define with pa.DataFrameModel
Field validation Use pa.Field() with constraints
Custom checks Use @pa.check and @pa.dataframe_check
pytest integration Direct validation in test functions
Lazy validation Get all errors with lazy=True
Type coercion Enable with Config.coerce = True

المحتوى العربي

Pandera مقابل Great Expectations

بينما Great Expectations شامل لخطوط أنابيب البيانات في الإنتاج، Pandera مصمم لـ:

الجانب Pandera Great Expectations
حالة الاستخدام التطوير واختبار الوحدات خطوط أنابيب الإنتاج
الإعداد أدنى، فقط pip install تهيئة المشروع
التكامل دعم pytest أصلي قائم على Checkpoint
تعريف المخطط فئات/مزخرفات Python YAML/JSON توقعات
منحنى التعلم منخفض متوسط

اختر Pandera عندما:

  • تريد التحقق من المخطط في اختبارات الوحدات
  • تفضل تعريفات Python الأصلية
  • تحتاج تحقق خفيف الوزن أثناء التطوير
  • تستخدم pytest بالفعل

التثبيت

pip install pandera

# مع تكامل pytest
pip install pandera[pytest]

تعريف المخططات

Pandera يستخدم فئات Python لتعريف المخططات:

# schemas.py
import pandera as pa
from pandera.typing import Series
import pandas as pd

class TrainingDataSchema(pa.DataFrameModel):
    """مخطط لبيانات تدريب ML."""

    user_id: Series[int] = pa.Field(unique=True, ge=0)
    age: Series[int] = pa.Field(ge=0, le=120)
    income: Series[float] = pa.Field(ge=0)
    category: Series[str] = pa.Field(isin=["A", "B", "C", "D"])
    label: Series[int] = pa.Field(isin=[0, 1])

    class Config:
        coerce = True  # تحويل الأنواع تلقائياً
        strict = True  # لا أعمدة إضافية مسموحة

# الاستخدام
@pa.check_types
def load_training_data(path: str) -> pd.DataFrame[TrainingDataSchema]:
    """تحميل والتحقق من بيانات التدريب."""
    df = pd.read_csv(path)
    return df  # يتم التحقق تلقائياً عند الإرجاع

المخطط مع فحوصات مخصصة

أضف منطق تحقق مخصص:

import pandera as pa
from pandera.typing import Series
import pandas as pd

class FeatureSchema(pa.DataFrameModel):
    """مخطط لميزات ML مع فحوصات مخصصة."""

    feature_1: Series[float] = pa.Field(ge=-1, le=1)
    feature_2: Series[float] = pa.Field(ge=-1, le=1)
    feature_3: Series[float] = pa.Field(nullable=True)
    timestamp: Series[pd.Timestamp]

    @pa.check("feature_1")
    def check_normalized(cls, series: Series[float]) -> bool:
        """التحقق من أن الميزة موحدة تقريباً."""
        return series.std() < 2.0

    @pa.dataframe_check
    def check_no_duplicate_timestamps(cls, df: pd.DataFrame) -> bool:
        """ضمان عدم وجود timestamps مكررة."""
        return df["timestamp"].is_unique

    @pa.dataframe_check
    def check_feature_correlation(cls, df: pd.DataFrame) -> bool:
        """ضمان أن الميزات ليست مترابطة تماماً."""
        corr = df[["feature_1", "feature_2"]].corr().iloc[0, 1]
        return abs(corr) < 0.99

تكامل Pytest

Pandera يتكامل بسلاسة مع pytest:

# tests/test_data_validation.py
import pytest
import pandera as pa
from pandera.typing import DataFrame
import pandas as pd

from schemas import TrainingDataSchema, FeatureSchema

class TestTrainingData:
    """اختبار التحقق من بيانات التدريب."""

    def test_valid_data(self):
        """اختبار أن البيانات الصالحة تمر التحقق."""
        df = pd.DataFrame({
            "user_id": [1, 2, 3],
            "age": [25, 30, 35],
            "income": [50000.0, 60000.0, 70000.0],
            "category": ["A", "B", "C"],
            "label": [0, 1, 0]
        })

        validated = TrainingDataSchema.validate(df)
        assert len(validated) == 3

    def test_invalid_age(self):
        """اختبار أن العمر غير الصالح يفشل التحقق."""
        df = pd.DataFrame({
            "user_id": [1],
            "age": [150],  # غير صالح: > 120
            "income": [50000.0],
            "category": ["A"],
            "label": [0]
        })

        with pytest.raises(pa.errors.SchemaError):
            TrainingDataSchema.validate(df)

    def test_invalid_category(self):
        """اختبار أن الفئة غير الصالحة تفشل التحقق."""
        df = pd.DataFrame({
            "user_id": [1],
            "age": [25],
            "income": [50000.0],
            "category": ["X"],  # غير صالح: ليس في ["A", "B", "C", "D"]
            "label": [0]
        })

        with pytest.raises(pa.errors.SchemaError):
            TrainingDataSchema.validate(df)

# اختبار معلمات مع المخططات
@pytest.mark.parametrize("schema,data,should_pass", [
    (TrainingDataSchema, {"user_id": [1], "age": [25], "income": [50000.0], "category": ["A"], "label": [0]}, True),
    (TrainingDataSchema, {"user_id": [1], "age": [-5], "income": [50000.0], "category": ["A"], "label": [0]}, False),
])
def test_schema_validation(schema, data, should_pass):
    df = pd.DataFrame(data)
    if should_pass:
        schema.validate(df)
    else:
        with pytest.raises(pa.errors.SchemaError):
            schema.validate(df)

استخدام pytest-pandera Plugin

# tests/conftest.py
import pytest
import pandas as pd

@pytest.fixture
def sample_training_data():
    """Fixture يوفر بيانات تدريب عينة."""
    return pd.DataFrame({
        "user_id": range(100),
        "age": [25 + i % 50 for i in range(100)],
        "income": [50000.0 + i * 1000 for i in range(100)],
        "category": ["A", "B", "C", "D"] * 25,
        "label": [0, 1] * 50
    })
# tests/test_with_plugin.py
import pandera as pa
from schemas import TrainingDataSchema

def test_training_data_schema(sample_training_data):
    """اختبار بيانات التدريب مقابل المخطط."""
    validated = TrainingDataSchema.validate(sample_training_data)
    assert len(validated) == 100

تكامل CI/CD

# .github/workflows/test-data.yml
name: Data Validation Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install pandera pandas pytest

      - name: Run schema tests
        run: pytest tests/test_data_validation.py -v

      - name: Validate production data
        run: |
          python -c "
          import pandas as pd
          from schemas import TrainingDataSchema

          df = pd.read_csv('data/training.csv')
          TrainingDataSchema.validate(df)
          print('Data validation passed!')
          "

التحقق الكسول لرسائل خطأ أفضل

import pandera as pa

# الحصول على جميع أخطاء التحقق مرة واحدة
schema = TrainingDataSchema

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print("Validation errors found:")
    for error in err.failure_cases.itertuples():
        print(f"  - {error}")

النقاط الرئيسية

الميزة الاستخدام
فئات المخطط عرّف بـ pa.DataFrameModel
التحقق من الحقل استخدم pa.Field() مع القيود
الفحوصات المخصصة استخدم @pa.check و@pa.dataframe_check
تكامل pytest تحقق مباشر في دوال الاختبار
التحقق الكسول احصل على جميع الأخطاء بـ lazy=True
تحويل النوع مكّن بـ Config.coerce = True

Quiz

Module 4: Data Validation & Testing

Take Quiz