Data Validation & Testing
Pandera for DataFrame Validation
5 min read
English Content
Pandera vs Great Expectations
While Great Expectations is comprehensive for production data pipelines, Pandera is designed for:
| Aspect | Pandera | Great Expectations |
|---|---|---|
| Use case | Development & unit testing | Production pipelines |
| Setup | Minimal, just pip install | Project initialization |
| Integration | Native pytest support | Checkpoint-based |
| Schema definition | Python classes/decorators | YAML/JSON expectations |
| Learning curve | Low | Medium |
Choose Pandera when:
- You want schema validation in unit tests
- You prefer Python-native definitions
- You need lightweight validation during development
- You're already using pytest
Installation
pip install pandera
# With pytest integration
pip install pandera[pytest]
Defining Schemas
Pandera uses Python classes to define schemas:
# schemas.py
import pandera as pa
from pandera.typing import Series
import pandas as pd
class TrainingDataSchema(pa.DataFrameModel):
"""Schema for ML training data."""
user_id: Series[int] = pa.Field(unique=True, ge=0)
age: Series[int] = pa.Field(ge=0, le=120)
income: Series[float] = pa.Field(ge=0)
category: Series[str] = pa.Field(isin=["A", "B", "C", "D"])
label: Series[int] = pa.Field(isin=[0, 1])
class Config:
coerce = True # Automatically coerce types
strict = True # No extra columns allowed
# Usage
@pa.check_types
def load_training_data(path: str) -> pd.DataFrame[TrainingDataSchema]:
"""Load and validate training data."""
df = pd.read_csv(path)
return df # Automatically validated on return
Schema with Custom Checks
Add custom validation logic:
import pandera as pa
from pandera.typing import Series
import pandas as pd
class FeatureSchema(pa.DataFrameModel):
"""Schema for ML features with custom checks."""
feature_1: Series[float] = pa.Field(ge=-1, le=1)
feature_2: Series[float] = pa.Field(ge=-1, le=1)
feature_3: Series[float] = pa.Field(nullable=True)
timestamp: Series[pd.Timestamp]
@pa.check("feature_1")
def check_normalized(cls, series: Series[float]) -> bool:
"""Verify feature is approximately normalized."""
return series.std() < 2.0
@pa.dataframe_check
def check_no_duplicate_timestamps(cls, df: pd.DataFrame) -> bool:
"""Ensure no duplicate timestamps."""
return df["timestamp"].is_unique
@pa.dataframe_check
def check_feature_correlation(cls, df: pd.DataFrame) -> bool:
"""Ensure features aren't perfectly correlated."""
corr = df[["feature_1", "feature_2"]].corr().iloc[0, 1]
return abs(corr) < 0.99
Pytest Integration
Pandera integrates seamlessly with pytest:
# tests/test_data_validation.py
import pytest
import pandera as pa
from pandera.typing import DataFrame
import pandas as pd
from schemas import TrainingDataSchema, FeatureSchema
class TestTrainingData:
"""Test training data validation."""
def test_valid_data(self):
"""Test that valid data passes validation."""
df = pd.DataFrame({
"user_id": [1, 2, 3],
"age": [25, 30, 35],
"income": [50000.0, 60000.0, 70000.0],
"category": ["A", "B", "C"],
"label": [0, 1, 0]
})
validated = TrainingDataSchema.validate(df)
assert len(validated) == 3
def test_invalid_age(self):
"""Test that invalid age fails validation."""
df = pd.DataFrame({
"user_id": [1],
"age": [150], # Invalid: > 120
"income": [50000.0],
"category": ["A"],
"label": [0]
})
with pytest.raises(pa.errors.SchemaError):
TrainingDataSchema.validate(df)
def test_invalid_category(self):
"""Test that invalid category fails validation."""
df = pd.DataFrame({
"user_id": [1],
"age": [25],
"income": [50000.0],
"category": ["X"], # Invalid: not in ["A", "B", "C", "D"]
"label": [0]
})
with pytest.raises(pa.errors.SchemaError):
TrainingDataSchema.validate(df)
# Parameterized testing with schemas
@pytest.mark.parametrize("schema,data,should_pass", [
(TrainingDataSchema, {"user_id": [1], "age": [25], "income": [50000.0], "category": ["A"], "label": [0]}, True),
(TrainingDataSchema, {"user_id": [1], "age": [-5], "income": [50000.0], "category": ["A"], "label": [0]}, False),
])
def test_schema_validation(schema, data, should_pass):
df = pd.DataFrame(data)
if should_pass:
schema.validate(df)
else:
with pytest.raises(pa.errors.SchemaError):
schema.validate(df)
Using pytest-pandera Plugin
# tests/conftest.py
import pytest
import pandas as pd
@pytest.fixture
def sample_training_data():
"""Fixture providing sample training data."""
return pd.DataFrame({
"user_id": range(100),
"age": [25 + i % 50 for i in range(100)],
"income": [50000.0 + i * 1000 for i in range(100)],
"category": ["A", "B", "C", "D"] * 25,
"label": [0, 1] * 50
})
# tests/test_with_plugin.py
import pandera as pa
from schemas import TrainingDataSchema
def test_training_data_schema(sample_training_data):
"""Test training data against schema."""
validated = TrainingDataSchema.validate(sample_training_data)
assert len(validated) == 100
CI/CD Integration
# .github/workflows/test-data.yml
name: Data Validation Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install pandera pandas pytest
- name: Run schema tests
run: pytest tests/test_data_validation.py -v
- name: Validate production data
run: |
python -c "
import pandas as pd
from schemas import TrainingDataSchema
df = pd.read_csv('data/training.csv')
TrainingDataSchema.validate(df)
print('Data validation passed!')
"
Lazy Validation for Better Error Messages
import pandera as pa
# Get all validation errors at once
schema = TrainingDataSchema
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print("Validation errors found:")
for error in err.failure_cases.itertuples():
print(f" - {error}")
Key Takeaways
| Feature | Usage |
|---|---|
| Schema classes | Define with pa.DataFrameModel |
| Field validation | Use pa.Field() with constraints |
| Custom checks | Use @pa.check and @pa.dataframe_check |
| pytest integration | Direct validation in test functions |
| Lazy validation | Get all errors with lazy=True |
| Type coercion | Enable with Config.coerce = True |
المحتوى العربي
Pandera مقابل Great Expectations
بينما Great Expectations شامل لخطوط أنابيب البيانات في الإنتاج، Pandera مصمم لـ:
| الجانب | Pandera | Great Expectations |
|---|---|---|
| حالة الاستخدام | التطوير واختبار الوحدات | خطوط أنابيب الإنتاج |
| الإعداد | أدنى، فقط pip install | تهيئة المشروع |
| التكامل | دعم pytest أصلي | قائم على Checkpoint |
| تعريف المخطط | فئات/مزخرفات Python | YAML/JSON توقعات |
| منحنى التعلم | منخفض | متوسط |
اختر Pandera عندما:
- تريد التحقق من المخطط في اختبارات الوحدات
- تفضل تعريفات Python الأصلية
- تحتاج تحقق خفيف الوزن أثناء التطوير
- تستخدم pytest بالفعل
التثبيت
pip install pandera
# مع تكامل pytest
pip install pandera[pytest]
تعريف المخططات
Pandera يستخدم فئات Python لتعريف المخططات:
# schemas.py
import pandera as pa
from pandera.typing import Series
import pandas as pd
class TrainingDataSchema(pa.DataFrameModel):
"""مخطط لبيانات تدريب ML."""
user_id: Series[int] = pa.Field(unique=True, ge=0)
age: Series[int] = pa.Field(ge=0, le=120)
income: Series[float] = pa.Field(ge=0)
category: Series[str] = pa.Field(isin=["A", "B", "C", "D"])
label: Series[int] = pa.Field(isin=[0, 1])
class Config:
coerce = True # تحويل الأنواع تلقائياً
strict = True # لا أعمدة إضافية مسموحة
# الاستخدام
@pa.check_types
def load_training_data(path: str) -> pd.DataFrame[TrainingDataSchema]:
"""تحميل والتحقق من بيانات التدريب."""
df = pd.read_csv(path)
return df # يتم التحقق تلقائياً عند الإرجاع
المخطط مع فحوصات مخصصة
أضف منطق تحقق مخصص:
import pandera as pa
from pandera.typing import Series
import pandas as pd
class FeatureSchema(pa.DataFrameModel):
"""مخطط لميزات ML مع فحوصات مخصصة."""
feature_1: Series[float] = pa.Field(ge=-1, le=1)
feature_2: Series[float] = pa.Field(ge=-1, le=1)
feature_3: Series[float] = pa.Field(nullable=True)
timestamp: Series[pd.Timestamp]
@pa.check("feature_1")
def check_normalized(cls, series: Series[float]) -> bool:
"""التحقق من أن الميزة موحدة تقريباً."""
return series.std() < 2.0
@pa.dataframe_check
def check_no_duplicate_timestamps(cls, df: pd.DataFrame) -> bool:
"""ضمان عدم وجود timestamps مكررة."""
return df["timestamp"].is_unique
@pa.dataframe_check
def check_feature_correlation(cls, df: pd.DataFrame) -> bool:
"""ضمان أن الميزات ليست مترابطة تماماً."""
corr = df[["feature_1", "feature_2"]].corr().iloc[0, 1]
return abs(corr) < 0.99
تكامل Pytest
Pandera يتكامل بسلاسة مع pytest:
# tests/test_data_validation.py
import pytest
import pandera as pa
from pandera.typing import DataFrame
import pandas as pd
from schemas import TrainingDataSchema, FeatureSchema
class TestTrainingData:
"""اختبار التحقق من بيانات التدريب."""
def test_valid_data(self):
"""اختبار أن البيانات الصالحة تمر التحقق."""
df = pd.DataFrame({
"user_id": [1, 2, 3],
"age": [25, 30, 35],
"income": [50000.0, 60000.0, 70000.0],
"category": ["A", "B", "C"],
"label": [0, 1, 0]
})
validated = TrainingDataSchema.validate(df)
assert len(validated) == 3
def test_invalid_age(self):
"""اختبار أن العمر غير الصالح يفشل التحقق."""
df = pd.DataFrame({
"user_id": [1],
"age": [150], # غير صالح: > 120
"income": [50000.0],
"category": ["A"],
"label": [0]
})
with pytest.raises(pa.errors.SchemaError):
TrainingDataSchema.validate(df)
def test_invalid_category(self):
"""اختبار أن الفئة غير الصالحة تفشل التحقق."""
df = pd.DataFrame({
"user_id": [1],
"age": [25],
"income": [50000.0],
"category": ["X"], # غير صالح: ليس في ["A", "B", "C", "D"]
"label": [0]
})
with pytest.raises(pa.errors.SchemaError):
TrainingDataSchema.validate(df)
# اختبار معلمات مع المخططات
@pytest.mark.parametrize("schema,data,should_pass", [
(TrainingDataSchema, {"user_id": [1], "age": [25], "income": [50000.0], "category": ["A"], "label": [0]}, True),
(TrainingDataSchema, {"user_id": [1], "age": [-5], "income": [50000.0], "category": ["A"], "label": [0]}, False),
])
def test_schema_validation(schema, data, should_pass):
df = pd.DataFrame(data)
if should_pass:
schema.validate(df)
else:
with pytest.raises(pa.errors.SchemaError):
schema.validate(df)
استخدام pytest-pandera Plugin
# tests/conftest.py
import pytest
import pandas as pd
@pytest.fixture
def sample_training_data():
"""Fixture يوفر بيانات تدريب عينة."""
return pd.DataFrame({
"user_id": range(100),
"age": [25 + i % 50 for i in range(100)],
"income": [50000.0 + i * 1000 for i in range(100)],
"category": ["A", "B", "C", "D"] * 25,
"label": [0, 1] * 50
})
# tests/test_with_plugin.py
import pandera as pa
from schemas import TrainingDataSchema
def test_training_data_schema(sample_training_data):
"""اختبار بيانات التدريب مقابل المخطط."""
validated = TrainingDataSchema.validate(sample_training_data)
assert len(validated) == 100
تكامل CI/CD
# .github/workflows/test-data.yml
name: Data Validation Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install pandera pandas pytest
- name: Run schema tests
run: pytest tests/test_data_validation.py -v
- name: Validate production data
run: |
python -c "
import pandas as pd
from schemas import TrainingDataSchema
df = pd.read_csv('data/training.csv')
TrainingDataSchema.validate(df)
print('Data validation passed!')
"
التحقق الكسول لرسائل خطأ أفضل
import pandera as pa
# الحصول على جميع أخطاء التحقق مرة واحدة
schema = TrainingDataSchema
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print("Validation errors found:")
for error in err.failure_cases.itertuples():
print(f" - {error}")
النقاط الرئيسية
| الميزة | الاستخدام |
|---|---|
| فئات المخطط | عرّف بـ pa.DataFrameModel |
| التحقق من الحقل | استخدم pa.Field() مع القيود |
| الفحوصات المخصصة | استخدم @pa.check و@pa.dataframe_check |
| تكامل pytest | تحقق مباشر في دوال الاختبار |
| التحقق الكسول | احصل على جميع الأخطاء بـ lazy=True |
| تحويل النوع | مكّن بـ Config.coerce = True |