Data Validation & Testing
Great Expectations for Data Quality
English Content
What is Great Expectations?
Great Expectations (GX) is a Python library for data validation, documentation, and profiling. It's the industry standard for production ML data quality, used by companies like GitHub, Superconductive, and thousands of ML teams.
Key concepts:
- Expectations: Assertions about your data (e.g., "column age should be between 0 and 120")
- Expectation Suites: Collections of expectations for a dataset
- Checkpoints: Runnable validation configurations
- Data Docs: Auto-generated documentation of your data quality
Installation and Setup
# Install Great Expectations
pip install great-expectations
# Initialize a new GX project
great_expectations init
This creates a great_expectations/ directory with:
great_expectations/
├── great_expectations.yml # Main configuration
├── expectations/ # Expectation suites
├── checkpoints/ # Checkpoint configurations
├── plugins/ # Custom expectations
└── uncommitted/ # Local files (gitignored)
Creating Expectations
Define expectations for your data:
# scripts/create_expectations.py
import great_expectations as gx
# Create context
context = gx.get_context()
# Connect to your data
datasource = context.data_sources.add_pandas("training_data")
data_asset = datasource.add_dataframe_asset("features")
# Create batch request
batch_request = data_asset.build_batch_request(
dataframe=your_dataframe
)
# Create expectation suite
suite = context.suites.add(
gx.ExpectationSuite(name="training_data_suite")
)
# Add expectations
suite.add_expectation(
gx.expectations.ExpectColumnToExist(column="user_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="age",
min_value=0,
max_value=120
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(
column="email"
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(
column="user_id"
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToMatchRegex(
column="email",
regex=r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
)
)
# Save the suite
suite.save()
Common Expectations for ML
# Feature validation expectations
expectations = [
# Numerical features
gx.expectations.ExpectColumnValuesToBeBetween(
column="feature_1", min_value=-1.0, max_value=1.0
),
gx.expectations.ExpectColumnMeanToBeBetween(
column="feature_1", min_value=-0.1, max_value=0.1
),
gx.expectations.ExpectColumnStdevToBeBetween(
column="feature_1", min_value=0.8, max_value=1.2
),
# Categorical features
gx.expectations.ExpectColumnValuesToBeInSet(
column="category", value_set=["A", "B", "C", "D"]
),
gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
column="category", min_value=0.0001, max_value=0.01
),
# Target variable
gx.expectations.ExpectColumnValuesToBeInSet(
column="label", value_set=[0, 1]
),
# Data completeness
gx.expectations.ExpectColumnValuesToNotBeNull(
column="required_feature", mostly=0.99
),
# Table-level
gx.expectations.ExpectTableRowCountToBeBetween(
min_value=10000, max_value=1000000
),
]
Creating Checkpoints
Checkpoints define how to run validations:
# great_expectations/checkpoints/training_checkpoint.yml
name: training_checkpoint
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-training-validation"
validations:
- batch_request:
datasource_name: training_data
data_asset_name: features
expectation_suite_name: training_data_suite
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
Running Validations in CI/CD
# scripts/validate_data.py
import great_expectations as gx
import pandas as pd
import sys
def validate_training_data(data_path: str) -> bool:
"""Run Great Expectations validation on training data."""
# Load data
df = pd.read_parquet(data_path)
# Get GX context
context = gx.get_context()
# Run checkpoint
result = context.run_checkpoint(
checkpoint_name="training_checkpoint",
batch_request={
"runtime_parameters": {"batch_data": df},
"batch_identifiers": {"batch_id": "ci_validation"},
},
)
# Check results
if not result.success:
print("Data validation FAILED!")
for validation_result in result.run_results.values():
for expectation_result in validation_result.results:
if not expectation_result.success:
print(f" FAILED: {expectation_result.expectation_config}")
return False
print("Data validation PASSED!")
return True
if __name__ == "__main__":
data_path = sys.argv[1] if len(sys.argv) > 1 else "data/training.parquet"
success = validate_training_data(data_path)
sys.exit(0 if success else 1)
GitHub Actions Integration
# .github/workflows/validate-data.yml
name: Data Validation
on:
push:
paths:
- 'data/**'
pull_request:
paths:
- 'data/**'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install great-expectations pandas pyarrow
- name: Run data validation
run: python scripts/validate_data.py data/training.parquet
- name: Upload Data Docs
if: always()
uses: actions/upload-artifact@v4
with:
name: data-docs
path: great_expectations/uncommitted/data_docs/
Great Expectations GitHub Action
Use the official GX GitHub Action for simpler integration:
# .github/workflows/gx-validation.yml
name: GX Data Validation
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Great Expectations
uses: great-expectations/great_expectations_action@v1
with:
checkpoint_name: training_checkpoint
- name: Upload validation results
if: always()
uses: actions/upload-artifact@v4
with:
name: gx-results
path: great_expectations/uncommitted/
Key Takeaways
| Feature | Description |
|---|---|
| Expectations | Declarative data assertions |
| Checkpoints | Runnable validation configurations |
| Data Docs | Auto-generated documentation |
| CI Integration | GitHub Action or Python script |
| Mostly parameter | Allow partial failures (e.g., 99% not null) |
المحتوى العربي
ما هو Great Expectations؟
Great Expectations (GX) هي مكتبة Python للتحقق من البيانات والتوثيق والتصنيف. هي المعيار الصناعي لجودة بيانات ML في الإنتاج، تستخدمها شركات مثل GitHub وSuperconductive وآلاف فرق ML.
المفاهيم الرئيسية:
- التوقعات (Expectations): تأكيدات حول بياناتك (مثل "عمود العمر يجب أن يكون بين 0 و120")
- مجموعات التوقعات (Expectation Suites): مجموعات توقعات لمجموعة بيانات
- نقاط التحقق (Checkpoints): تكوينات تحقق قابلة للتشغيل
- مستندات البيانات (Data Docs): توثيق مُنشأ تلقائياً لجودة بياناتك
التثبيت والإعداد
# تثبيت Great Expectations
pip install great-expectations
# تهيئة مشروع GX جديد
great_expectations init
هذا ينشئ دليل great_expectations/ مع:
great_expectations/
├── great_expectations.yml # التكوين الرئيسي
├── expectations/ # مجموعات التوقعات
├── checkpoints/ # تكوينات نقاط التحقق
├── plugins/ # توقعات مخصصة
└── uncommitted/ # ملفات محلية (مستثناة من git)
إنشاء التوقعات
حدد توقعات لبياناتك:
# scripts/create_expectations.py
import great_expectations as gx
# إنشاء السياق
context = gx.get_context()
# الاتصال ببياناتك
datasource = context.data_sources.add_pandas("training_data")
data_asset = datasource.add_dataframe_asset("features")
# إنشاء طلب الدفعة
batch_request = data_asset.build_batch_request(
dataframe=your_dataframe
)
# إنشاء مجموعة التوقعات
suite = context.suites.add(
gx.ExpectationSuite(name="training_data_suite")
)
# إضافة التوقعات
suite.add_expectation(
gx.expectations.ExpectColumnToExist(column="user_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="age",
min_value=0,
max_value=120
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(
column="email"
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(
column="user_id"
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToMatchRegex(
column="email",
regex=r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
)
)
# حفظ المجموعة
suite.save()
التوقعات الشائعة لـ ML
# توقعات التحقق من الميزات
expectations = [
# الميزات الرقمية
gx.expectations.ExpectColumnValuesToBeBetween(
column="feature_1", min_value=-1.0, max_value=1.0
),
gx.expectations.ExpectColumnMeanToBeBetween(
column="feature_1", min_value=-0.1, max_value=0.1
),
gx.expectations.ExpectColumnStdevToBeBetween(
column="feature_1", min_value=0.8, max_value=1.2
),
# الميزات الفئوية
gx.expectations.ExpectColumnValuesToBeInSet(
column="category", value_set=["A", "B", "C", "D"]
),
gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
column="category", min_value=0.0001, max_value=0.01
),
# المتغير الهدف
gx.expectations.ExpectColumnValuesToBeInSet(
column="label", value_set=[0, 1]
),
# اكتمال البيانات
gx.expectations.ExpectColumnValuesToNotBeNull(
column="required_feature", mostly=0.99
),
# مستوى الجدول
gx.expectations.ExpectTableRowCountToBeBetween(
min_value=10000, max_value=1000000
),
]
إنشاء نقاط التحقق
نقاط التحقق تحدد كيفية تشغيل التحققات:
# great_expectations/checkpoints/training_checkpoint.yml
name: training_checkpoint
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-training-validation"
validations:
- batch_request:
datasource_name: training_data
data_asset_name: features
expectation_suite_name: training_data_suite
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
تشغيل التحققات في CI/CD
# scripts/validate_data.py
import great_expectations as gx
import pandas as pd
import sys
def validate_training_data(data_path: str) -> bool:
"""تشغيل التحقق من Great Expectations على بيانات التدريب."""
# تحميل البيانات
df = pd.read_parquet(data_path)
# الحصول على سياق GX
context = gx.get_context()
# تشغيل نقطة التحقق
result = context.run_checkpoint(
checkpoint_name="training_checkpoint",
batch_request={
"runtime_parameters": {"batch_data": df},
"batch_identifiers": {"batch_id": "ci_validation"},
},
)
# فحص النتائج
if not result.success:
print("Data validation FAILED!")
for validation_result in result.run_results.values():
for expectation_result in validation_result.results:
if not expectation_result.success:
print(f" FAILED: {expectation_result.expectation_config}")
return False
print("Data validation PASSED!")
return True
if __name__ == "__main__":
data_path = sys.argv[1] if len(sys.argv) > 1 else "data/training.parquet"
success = validate_training_data(data_path)
sys.exit(0 if success else 1)
تكامل GitHub Actions
# .github/workflows/validate-data.yml
name: Data Validation
on:
push:
paths:
- 'data/**'
pull_request:
paths:
- 'data/**'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install great-expectations pandas pyarrow
- name: Run data validation
run: python scripts/validate_data.py data/training.parquet
- name: Upload Data Docs
if: always()
uses: actions/upload-artifact@v4
with:
name: data-docs
path: great_expectations/uncommitted/data_docs/
Great Expectations GitHub Action
استخدم GX GitHub Action الرسمي لتكامل أبسط:
# .github/workflows/gx-validation.yml
name: GX Data Validation
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Great Expectations
uses: great-expectations/great_expectations_action@v1
with:
checkpoint_name: training_checkpoint
- name: Upload validation results
if: always()
uses: actions/upload-artifact@v4
with:
name: gx-results
path: great_expectations/uncommitted/
النقاط الرئيسية
| الميزة | الوصف |
|---|---|
| التوقعات | تأكيدات بيانات تصريحية |
| نقاط التحقق | تكوينات تحقق قابلة للتشغيل |
| مستندات البيانات | توثيق مُنشأ تلقائياً |
| تكامل CI | GitHub Action أو سكريبت Python |
| معلمة mostly | السماح بفشل جزئي (مثل 99% ليست null) |