Great Expectations (GX) is an open-source Python library for data testing, documentation, and profiling. It helps data teams define "expectations" about their data and validate those expectations automatically in pipelines.
Core Concepts
1. Expectations: Assertions about data (e.g., "column A should not be null")
2. Expectation Suites: Collections of expectations for a dataset
3. Data Sources: Connections to your data (Pandas, Spark, SQL)
4. Checkpoints: Validation runbooks that execute expectations
5. Data Docs: Auto-generated documentation of expectations and results
Example Expectations
``python
import great_expectations as gx
# Create a Data Source
context = gx.get_context()
validator = context.sources.add_pandas("my_data").read_dataframe(df)
# Define Expectations
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_in_set("status", ["active", "inactive"])
validator.expect_column_mean_to_be_between("order_total", 50, 200)
``
Why Teams Use Great Expectations
- Catch Issues Early: Validate data before it reaches downstream
- Documentation: Auto-generate data quality docs
- Collaboration: Share expectations across teams
- Integration: Works with Airflow, dbt, Spark, and more
- Open Source: Free to use with commercial support
Great Expectations GX Cloud
The SaaS version adds:
- Hosted expectation management
- Collaboration features
- Alerting and notifications
- Metrics and dashboards
Integration with Data Tools
- Airflow: GX operators for pipeline validation
- dbt: Run GX after dbt models
- Spark: Validate large-scale data
- Prefect/Dagster: Native integrations