Data Validation & Testing

Great Expectations for Data Quality

5 min read

English Content

What is Great Expectations?

Great Expectations (GX) is a Python library for data validation, documentation, and profiling. It's the industry standard for production ML data quality, used by companies like GitHub, Superconductive, and thousands of ML teams.

Key concepts:

  • Expectations: Assertions about your data (e.g., "column age should be between 0 and 120")
  • Expectation Suites: Collections of expectations for a dataset
  • Checkpoints: Runnable validation configurations
  • Data Docs: Auto-generated documentation of your data quality

Installation and Setup

# Install Great Expectations
pip install great-expectations

# Initialize a new GX project
great_expectations init

This creates a great_expectations/ directory with:

great_expectations/
├── great_expectations.yml    # Main configuration
├── expectations/             # Expectation suites
├── checkpoints/              # Checkpoint configurations
├── plugins/                  # Custom expectations
└── uncommitted/              # Local files (gitignored)

Creating Expectations

Define expectations for your data:

# scripts/create_expectations.py
import great_expectations as gx

# Create context
context = gx.get_context()

# Connect to your data
datasource = context.data_sources.add_pandas("training_data")
data_asset = datasource.add_dataframe_asset("features")

# Create batch request
batch_request = data_asset.build_batch_request(
    dataframe=your_dataframe
)

# Create expectation suite
suite = context.suites.add(
    gx.ExpectationSuite(name="training_data_suite")
)

# Add expectations
suite.add_expectation(
    gx.expectations.ExpectColumnToExist(column="user_id")
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="age",
        min_value=0,
        max_value=120
    )
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(
        column="email"
    )
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(
        column="user_id"
    )
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToMatchRegex(
        column="email",
        regex=r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
    )
)

# Save the suite
suite.save()

Common Expectations for ML

# Feature validation expectations
expectations = [
    # Numerical features
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="feature_1", min_value=-1.0, max_value=1.0
    ),
    gx.expectations.ExpectColumnMeanToBeBetween(
        column="feature_1", min_value=-0.1, max_value=0.1
    ),
    gx.expectations.ExpectColumnStdevToBeBetween(
        column="feature_1", min_value=0.8, max_value=1.2
    ),

    # Categorical features
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="category", value_set=["A", "B", "C", "D"]
    ),
    gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
        column="category", min_value=0.0001, max_value=0.01
    ),

    # Target variable
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="label", value_set=[0, 1]
    ),

    # Data completeness
    gx.expectations.ExpectColumnValuesToNotBeNull(
        column="required_feature", mostly=0.99
    ),

    # Table-level
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=10000, max_value=1000000
    ),
]

Creating Checkpoints

Checkpoints define how to run validations:

# great_expectations/checkpoints/training_checkpoint.yml
name: training_checkpoint
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-training-validation"
validations:
  - batch_request:
      datasource_name: training_data
      data_asset_name: features
    expectation_suite_name: training_data_suite
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction

Running Validations in CI/CD

# scripts/validate_data.py
import great_expectations as gx
import pandas as pd
import sys

def validate_training_data(data_path: str) -> bool:
    """Run Great Expectations validation on training data."""
    # Load data
    df = pd.read_parquet(data_path)

    # Get GX context
    context = gx.get_context()

    # Run checkpoint
    result = context.run_checkpoint(
        checkpoint_name="training_checkpoint",
        batch_request={
            "runtime_parameters": {"batch_data": df},
            "batch_identifiers": {"batch_id": "ci_validation"},
        },
    )

    # Check results
    if not result.success:
        print("Data validation FAILED!")
        for validation_result in result.run_results.values():
            for expectation_result in validation_result.results:
                if not expectation_result.success:
                    print(f"  FAILED: {expectation_result.expectation_config}")
        return False

    print("Data validation PASSED!")
    return True

if __name__ == "__main__":
    data_path = sys.argv[1] if len(sys.argv) > 1 else "data/training.parquet"
    success = validate_training_data(data_path)
    sys.exit(0 if success else 1)

GitHub Actions Integration

# .github/workflows/validate-data.yml
name: Data Validation
on:
  push:
    paths:
      - 'data/**'
  pull_request:
    paths:
      - 'data/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install great-expectations pandas pyarrow

      - name: Run data validation
        run: python scripts/validate_data.py data/training.parquet

      - name: Upload Data Docs
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: data-docs
          path: great_expectations/uncommitted/data_docs/

Great Expectations GitHub Action

Use the official GX GitHub Action for simpler integration:

# .github/workflows/gx-validation.yml
name: GX Data Validation
on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Great Expectations
        uses: great-expectations/great_expectations_action@v1
        with:
          checkpoint_name: training_checkpoint

      - name: Upload validation results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: gx-results
          path: great_expectations/uncommitted/

Key Takeaways

Feature Description
Expectations Declarative data assertions
Checkpoints Runnable validation configurations
Data Docs Auto-generated documentation
CI Integration GitHub Action or Python script
Mostly parameter Allow partial failures (e.g., 99% not null)

المحتوى العربي

ما هو Great Expectations؟

Great Expectations (GX) هي مكتبة Python للتحقق من البيانات والتوثيق والتصنيف. هي المعيار الصناعي لجودة بيانات ML في الإنتاج، تستخدمها شركات مثل GitHub وSuperconductive وآلاف فرق ML.

المفاهيم الرئيسية:

  • التوقعات (Expectations): تأكيدات حول بياناتك (مثل "عمود العمر يجب أن يكون بين 0 و120")
  • مجموعات التوقعات (Expectation Suites): مجموعات توقعات لمجموعة بيانات
  • نقاط التحقق (Checkpoints): تكوينات تحقق قابلة للتشغيل
  • مستندات البيانات (Data Docs): توثيق مُنشأ تلقائياً لجودة بياناتك

التثبيت والإعداد

# تثبيت Great Expectations
pip install great-expectations

# تهيئة مشروع GX جديد
great_expectations init

هذا ينشئ دليل great_expectations/ مع:

great_expectations/
├── great_expectations.yml    # التكوين الرئيسي
├── expectations/             # مجموعات التوقعات
├── checkpoints/              # تكوينات نقاط التحقق
├── plugins/                  # توقعات مخصصة
└── uncommitted/              # ملفات محلية (مستثناة من git)

إنشاء التوقعات

حدد توقعات لبياناتك:

# scripts/create_expectations.py
import great_expectations as gx

# إنشاء السياق
context = gx.get_context()

# الاتصال ببياناتك
datasource = context.data_sources.add_pandas("training_data")
data_asset = datasource.add_dataframe_asset("features")

# إنشاء طلب الدفعة
batch_request = data_asset.build_batch_request(
    dataframe=your_dataframe
)

# إنشاء مجموعة التوقعات
suite = context.suites.add(
    gx.ExpectationSuite(name="training_data_suite")
)

# إضافة التوقعات
suite.add_expectation(
    gx.expectations.ExpectColumnToExist(column="user_id")
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="age",
        min_value=0,
        max_value=120
    )
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(
        column="email"
    )
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(
        column="user_id"
    )
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToMatchRegex(
        column="email",
        regex=r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
    )
)

# حفظ المجموعة
suite.save()

التوقعات الشائعة لـ ML

# توقعات التحقق من الميزات
expectations = [
    # الميزات الرقمية
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="feature_1", min_value=-1.0, max_value=1.0
    ),
    gx.expectations.ExpectColumnMeanToBeBetween(
        column="feature_1", min_value=-0.1, max_value=0.1
    ),
    gx.expectations.ExpectColumnStdevToBeBetween(
        column="feature_1", min_value=0.8, max_value=1.2
    ),

    # الميزات الفئوية
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="category", value_set=["A", "B", "C", "D"]
    ),
    gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
        column="category", min_value=0.0001, max_value=0.01
    ),

    # المتغير الهدف
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="label", value_set=[0, 1]
    ),

    # اكتمال البيانات
    gx.expectations.ExpectColumnValuesToNotBeNull(
        column="required_feature", mostly=0.99
    ),

    # مستوى الجدول
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=10000, max_value=1000000
    ),
]

إنشاء نقاط التحقق

نقاط التحقق تحدد كيفية تشغيل التحققات:

# great_expectations/checkpoints/training_checkpoint.yml
name: training_checkpoint
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-training-validation"
validations:
  - batch_request:
      datasource_name: training_data
      data_asset_name: features
    expectation_suite_name: training_data_suite
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction

تشغيل التحققات في CI/CD

# scripts/validate_data.py
import great_expectations as gx
import pandas as pd
import sys

def validate_training_data(data_path: str) -> bool:
    """تشغيل التحقق من Great Expectations على بيانات التدريب."""
    # تحميل البيانات
    df = pd.read_parquet(data_path)

    # الحصول على سياق GX
    context = gx.get_context()

    # تشغيل نقطة التحقق
    result = context.run_checkpoint(
        checkpoint_name="training_checkpoint",
        batch_request={
            "runtime_parameters": {"batch_data": df},
            "batch_identifiers": {"batch_id": "ci_validation"},
        },
    )

    # فحص النتائج
    if not result.success:
        print("Data validation FAILED!")
        for validation_result in result.run_results.values():
            for expectation_result in validation_result.results:
                if not expectation_result.success:
                    print(f"  FAILED: {expectation_result.expectation_config}")
        return False

    print("Data validation PASSED!")
    return True

if __name__ == "__main__":
    data_path = sys.argv[1] if len(sys.argv) > 1 else "data/training.parquet"
    success = validate_training_data(data_path)
    sys.exit(0 if success else 1)

تكامل GitHub Actions

# .github/workflows/validate-data.yml
name: Data Validation
on:
  push:
    paths:
      - 'data/**'
  pull_request:
    paths:
      - 'data/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install great-expectations pandas pyarrow

      - name: Run data validation
        run: python scripts/validate_data.py data/training.parquet

      - name: Upload Data Docs
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: data-docs
          path: great_expectations/uncommitted/data_docs/

Great Expectations GitHub Action

استخدم GX GitHub Action الرسمي لتكامل أبسط:

# .github/workflows/gx-validation.yml
name: GX Data Validation
on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Great Expectations
        uses: great-expectations/great_expectations_action@v1
        with:
          checkpoint_name: training_checkpoint

      - name: Upload validation results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: gx-results
          path: great_expectations/uncommitted/

النقاط الرئيسية

الميزة الوصف
التوقعات تأكيدات بيانات تصريحية
نقاط التحقق تكوينات تحقق قابلة للتشغيل
مستندات البيانات توثيق مُنشأ تلقائياً
تكامل CI GitHub Action أو سكريبت Python
معلمة mostly السماح بفشل جزئي (مثل 99% ليست null)

Quiz

Module 4: Data Validation & Testing

Take Quiz