Data quality testing validates that your data meets expectations—whether it's checking for freshness,
completeness, duplicates, or anomalies.
With Spry, quality checks are written as executable SQL cells
embedded in Markdown, so your tests are also your documentation.
Instead of maintaining separate test suites and documentation, Spry unifies them.
Each quality check
explains what it validates, why it matters, and executes to prove itself current.
Basic Quality Check Workflow
# Customer Data Freshness Check
Validate that customer records have been updated within the last 24 hours.
```sql
SELECT
COUNT(*) as total_customers,
MAX(updated_at) as last_update,
JULIANDAY('now') - JULIANDAY(MAX(updated_at)) as hours_since_update
FROM customers;
```
**Expected**: hours_since_update < 24
**Alert**: If check fails, investigate ETL pipeline delays.Quality testing that's simple, transparent, and version-controlled
Quality checks are readable, maintainable, and self-documenting. Every test explains what it validates and why.
All quality rules tracked in version control. Review changes via pull requests and maintain historical audit trails.
Every test run is auditable and reproducible. No hidden state, no black boxes—just transparent SQL validation.
From basic checks to sophisticated anomaly detection
Validate that data is up-to-date by checking timestamps and comparing against expected update windows.
Detect missing values, null fields, and incomplete records that could compromise data integrity.
Identify and report duplicate records using custom uniqueness rules and composite key validation.
Ensure columns, data types, and table structures match expected schemas and data contracts.
Check that numeric values, dates, and measurements fall within acceptable business ranges.
Validate foreign key relationships and ensure referential consistency across related tables.
Implement custom SQL logic for domain-specific rules and complex business constraints.
Monitor table sizes, deltas, and growth patterns to detect pipeline issues or data loss.
Check value distributions, cardinality, and patterns to identify data skew or anomalies.
Validate that your orders table has been updated today. Critical for real-time dashboards and operational monitoring.
# Daily Order Freshness Check
Validate that orders table has been updated today.
```sql
SELECT
MAX(created_at) as last_order,
JULIANDAY('now') - JULIANDAY(MAX(created_at)) as days_old
FROM orders;
```
**Expected**: days_old < 1
**Action**: Alert data engineering team if check failsEnsure critical fields like email addresses are present. Track completeness percentage over time.
# Customer Email Completeness
Check that at least 95% of customers have email addresses.
```sql --task=check_email_completeness
SELECT
COUNT(*) as total_customers,
COUNT(email) as with_email,
COUNT(*) - COUNT(email) as missing_email,
ROUND(100.0 * COUNT(email) / COUNT(*), 2) as completeness_pct
FROM customers;
```
**Threshold**: completeness_pct >= 95
**Alert**: Investigate data collection process if below thresholdIdentify duplicate records that violate uniqueness constraints. Prevent data corruption and inaccurate reporting.
# Detect Duplicate Orders
Find orders with duplicate order_ids (should be unique).
```sql --task=find_duplicates
SELECT
order_id,
COUNT(*) as duplicate_count,
STRING_AGG(customer_id, ', ') as affected_customers
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
```
**Expected**: Zero rows returned
**Action**: Investigate ETL process if duplicates found
**Severity**: HIGH - duplicates cause revenue miscalculation
Use Spry's task orchestration to run complex quality checks with dependencies.
Execute checks in the right order,
skip downstream checks if prerequisites fail, and parallelize independent validations.
# Complete Data Quality Check Suite
## 1. Schema Validation (runs first)
```sql --task=validate_schema
SELECT
table_name,
column_name,
data_type,
is_nullable
FROM information_schema.columns
WHERE table_name = 'orders'
ORDER BY ordinal_position;
```
## 2. Row Count Check (depends on schema)
```sql --task=count_check --dep=validate_schema
SELECT
COUNT(*) as total_orders,
COUNT(DISTINCT customer_id) as unique_customers,
MIN(order_date) as earliest_order,
MAX(order_date) as latest_order
FROM orders;
```
## 3. Referential Integrity (depends on row count)
```sql --task=ref_integrity --dep=count_check
SELECT
o.order_id,
o.customer_id,
'Missing customer' as issue
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL
LIMIT 10;
```
## 4. Business Rule Validation (parallel with ref integrity)
```sql --task=business_rules --dep=count_check
SELECT
order_id,
total_amount,
'Negative amount' as issue
FROM orders
WHERE total_amount < 0
UNION ALL
SELECT
order_id,
total_amount,
'Suspiciously high amount' as issue
FROM orders
WHERE total_amount > 100000;
```
**Run all checks**: `spry run --all quality-suite.md`
**Run specific check**: `spry run quality-suite.md count_check`
**Parallel execution**: Independent checks run simultaneously for speed
A complete quality monitoring suite for an e-commerce platform, including health checks and anomaly detection.
Run this daily to catch issues before they impact customers.
# E-commerce Data Quality Dashboard
## Orders Health Check
```sql --task=orders_health
SELECT
'orders' as table_name,
COUNT(*) as total_rows,
COUNT(DISTINCT customer_id) as unique_customers,
MIN(order_date) as earliest_order,
MAX(order_date) as latest_order,
SUM(total_amount) as total_revenue,
AVG(total_amount) as avg_order_value
FROM orders
WHERE order_date >= DATE('now', '-7 days');
```
## Revenue Anomaly Detection
```sql --task=revenue_check --dep=orders_health
WITH daily_revenue AS (
SELECT
DATE(order_date) as day,
SUM(total_amount) as revenue,
COUNT(*) as order_count
FROM orders
WHERE order_date >= DATE('now', '-30 days')
GROUP BY DATE(order_date)
),
stats AS (
SELECT
AVG(revenue) as avg_revenue,
STDDEV(revenue) as stddev_revenue,
AVG(revenue) + 2 * STDDEV(revenue) as upper_bound,
AVG(revenue) - 2 * STDDEV(revenue) as lower_bound
FROM daily_revenue
)
SELECT
d.day,
d.revenue,
d.order_count,
ROUND(s.avg_revenue, 2) as avg_revenue,
ROUND(s.upper_bound, 2) as upper_threshold,
ROUND(s.lower_bound, 2) as lower_threshold,
CASE
WHEN d.revenue > s.upper_bound THEN 'HIGH_ANOMALY'
WHEN d.revenue < s.lower_bound THEN 'LOW_ANOMALY'
ELSE 'NORMAL'
END as status,
CASE
WHEN d.revenue > s.upper_bound THEN
ROUND(((d.revenue - s.avg_revenue) / s.avg_revenue) * 100, 1)
WHEN d.revenue < s.lower_bound THEN
ROUND(((s.avg_revenue - d.revenue) / s.avg_revenue) * 100, 1)
ELSE 0
END as deviation_pct
FROM daily_revenue d
CROSS JOIN stats s
WHERE d.day >= DATE('now', '-7 days')
ORDER BY d.day DESC;
```
## Inventory Consistency Check
```sql --task=inventory_check --dep=orders_health
SELECT
p.product_id,
p.product_name,
p.stock_quantity,
COUNT(oi.order_item_id) as recent_orders,
SUM(oi.quantity) as units_sold
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.order_date >= DATE('now', '-7 days')
GROUP BY p.product_id, p.product_name, p.stock_quantity
HAVING p.stock_quantity < 0
OR (units_sold > p.stock_quantity * 2);
```
**Schedule**: Run daily at 6 AM via cron
**Output**: `spry run --all ecommerce-quality.md > daily-quality-report.json`
**Alert**: Send notification if any check finds anomalies
**Dashboard**: Feed results to Evidence.dev for visualization
Treat quality standards like application code. Version control your checks in Git, review changes via pull requests,
and maintain a complete audit trail of quality evolution.This "data contracts" approach ensures alignment between
data producers and consumers.
# Customer Table Schema Contract
This contract defines the expected schema for the customers table.
Any changes to this schema must be reviewed by data governance team.
## Expected Schema
```sql --task=validate_customer_schema
WITH expected_schema AS (
SELECT 'id' as column_name, 'INTEGER' as expected_type, 'NO' as nullable
UNION ALL SELECT 'email', 'VARCHAR', 'NO'
UNION ALL SELECT 'first_name', 'VARCHAR', 'NO'
UNION ALL SELECT 'last_name', 'VARCHAR', 'NO'
UNION ALL SELECT 'created_at', 'TIMESTAMP', 'NO'
UNION ALL SELECT 'updated_at', 'TIMESTAMP', 'YES'
),
actual_schema AS (
SELECT
column_name,
data_type as actual_type,
is_nullable as nullable
FROM information_schema.columns
WHERE table_name = 'customers'
)
SELECT
COALESCE(e.column_name, a.column_name) as column_name,
e.expected_type,
a.actual_type,
e.nullable as expected_nullable,
a.nullable as actual_nullable,
CASE
WHEN e.column_name IS NULL THEN 'UNEXPECTED_COLUMN'
WHEN a.column_name IS NULL THEN 'MISSING_COLUMN'
WHEN e.expected_type != a.actual_type THEN 'TYPE_MISMATCH'
WHEN e.nullable != a.nullable THEN 'NULLABLE_MISMATCH'
ELSE 'OK'
END as status
FROM expected_schema e
FULL OUTER JOIN actual_schema a ON e.column_name = a.column_name
WHERE e.column_name IS NULL
OR a.column_name IS NULL
OR e.expected_type != a.actual_type
OR e.nullable != a.nullable;
```
**Expected**: Zero rows returned (all columns match contract)
**Breaking Change**: Any mismatch requires data governance approval
**Version**: 2.1.0 (updated 2024-01-15)From development to production, ensure data integrity everywhere
Run quality checks in CI/CD pipelines before deploying data transformations. Fail builds if quality thresholds aren't met.
Schedule quality checks to run hourly or daily. Detect data drift, pipeline failures, and anomalies in real-time.
Verify that data migrations preserve integrity. Compare source and target systems for completeness and accuracy.
Document quality validation for regulatory compliance. Maintain auditable records of all quality checks.
Track data quality KPIs and service level agreements. Report on quality metrics to stakeholders.
Monitor data warehouse performance and integrity. Detect table bloat, index issues, and query problems.
Use the full power of SQL for any validation logic. No proprietary DSL to learn, no limitations on complexity. Write any query you need—joins, window functions, CTEs, aggregations—all supported.
All quality rules versioned in Git. Pull requests for quality standard changes. Historical view of quality evolution. Treat quality checks like application code with proper review and approval workflows.
No agents to install, no servers to maintain, no cloud platform required. Just Markdown files and Deno runtime. Run quality checks anywhere: local development, CI/CD pipelines, or scheduled jobs.
Every quality check produces structured output. Feed results to dashboards, alerts, or audit logs. Integration with Evidence.dev for visualization. Complete transparency into what was checked and when.
Start with basic SQL assertions. Scale to sophisticated anomaly detection using statistical methods. Custom logic for any domain—finance, healthcare, e-commerce, or scientific data. No limits on complexity.
Run quality checks automatically in GitHub Actions.
Fail builds if quality standards aren't met.
Prevent bad data from reaching production.
name: Data Quality Checks
on:
push:
branches: [main, develop]
pull_request:
paths:
- 'data/**'
- 'transforms/**'
jobs:
quality-checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Deno
uses: denoland/setup-deno@v1
with:
deno-version: v1.x
- name: Install Spry
run: deno install -A jsr:@programmablemd/spry
- name: Run Quality Suite
run: spry run --all quality/data-quality-suite.md
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
- name: Check for Quality Violations
run: |
if grep -q "FAIL" quality-results.json; then
echo "Quality checks failed!"
exit 1
fi
- name: Upload Quality Report
uses: actions/upload-artifact@v3
with:
name: quality-report
path: quality-results.json