Spry for Data Quality

Data Quality Checks as Executable Markdown

Define, test, and document data quality standards in one place. From freshness checks to anomaly detection, Spry turns your quality rules into reproducible, auditable workflows.

What is Data Quality Testing?

Data quality testing validates that your data meets expectations—whether it's checking for freshness, completeness, duplicates, or anomalies.
With Spry, quality checks are written as executable SQL cells embedded in Markdown, so your tests are also your documentation.

Instead of maintaining separate test suites and documentation, Spry unifies them.
Each quality check explains what it validates, why it matters, and executes to prove itself current.

Basic Quality Check Workflow

# Customer Data Freshness Check

Validate that customer records have been updated within the last 24 hours.

```sql
SELECT 
  COUNT(*) as total_customers,
  MAX(updated_at) as last_update,
  JULIANDAY('now') - JULIANDAY(MAX(updated_at)) as hours_since_update
FROM customers;
```

**Expected**: hours_since_update < 24
**Alert**: If check fails, investigate ETL pipeline delays.

bash

Why Data Teams Choose Spry

Quality testing that's simple, transparent, and version-controlled

Tests as Documentation

Quality checks are readable, maintainable, and self-documenting. Every test explains what it validates and why.

Git-Versioned Standards

All quality rules tracked in version control. Review changes via pull requests and maintain historical audit trails.

Reproducible Results

Every test run is auditable and reproducible. No hidden state, no black boxes—just transparent SQL validation.

Comprehensive Quality Validation

From basic checks to sophisticated anomaly detection

Freshness Checks

Validate that data is up-to-date by checking timestamps and comparing against expected update windows.

Completeness Checks

Detect missing values, null fields, and incomplete records that could compromise data integrity.

Duplicate Detection

Identify and report duplicate records using custom uniqueness rules and composite key validation.

Schema Validation

Ensure columns, data types, and table structures match expected schemas and data contracts.

Range Validation

Check that numeric values, dates, and measurements fall within acceptable business ranges.

Referential Integrity

Validate foreign key relationships and ensure referential consistency across related tables.

Business Rule Validation

Implement custom SQL logic for domain-specific rules and complex business constraints.

Row Count Validation

Monitor table sizes, deltas, and growth patterns to detect pipeline issues or data loss.

Distribution Analysis

Check value distributions, cardinality, and patterns to identify data skew or anomalies.

Quality Check Examples

Freshness Check

Validate that your orders table has been updated today. Critical for real-time dashboards and operational monitoring.

# Daily Order Freshness Check

Validate that orders table has been updated today.

```sql
SELECT 
  MAX(created_at) as last_order,
  JULIANDAY('now') - JULIANDAY(MAX(created_at)) as days_old
FROM orders;
```

**Expected**: days_old < 1
**Action**: Alert data engineering team if check fails

bash

Completeness Check

Ensure critical fields like email addresses are present. Track completeness percentage over time.

# Customer Email Completeness

Check that at least 95% of customers have email addresses.

```sql --task=check_email_completeness
SELECT 
  COUNT(*) as total_customers,
  COUNT(email) as with_email,
  COUNT(*) - COUNT(email) as missing_email,
  ROUND(100.0 * COUNT(email) / COUNT(*), 2) as completeness_pct
FROM customers;
```

**Threshold**: completeness_pct >= 95
**Alert**: Investigate data collection process if below threshold

bash

Duplicate Detection

Identify duplicate records that violate uniqueness constraints. Prevent data corruption and inaccurate reporting.

# Detect Duplicate Orders

Find orders with duplicate order_ids (should be unique).

```sql --task=find_duplicates
SELECT 
  order_id,
  COUNT(*) as duplicate_count,
  STRING_AGG(customer_id, ', ') as affected_customers
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
```

**Expected**: Zero rows returned
**Action**: Investigate ETL process if duplicates found
**Severity**: HIGH - duplicates cause revenue miscalculation

bash

Orchestrated Quality Suites

Use Spry's task orchestration to run complex quality checks with dependencies.
Execute checks in the right order, skip downstream checks if prerequisites fail, and parallelize independent validations.

# Complete Data Quality Check Suite

## 1. Schema Validation (runs first)
```sql --task=validate_schema
SELECT 
  table_name,
  column_name, 
  data_type,
  is_nullable
FROM information_schema.columns 
WHERE table_name = 'orders'
ORDER BY ordinal_position;
```

## 2. Row Count Check (depends on schema)
```sql --task=count_check --dep=validate_schema
SELECT 
  COUNT(*) as total_orders,
  COUNT(DISTINCT customer_id) as unique_customers,
  MIN(order_date) as earliest_order,
  MAX(order_date) as latest_order
FROM orders;
```

## 3. Referential Integrity (depends on row count)
```sql --task=ref_integrity --dep=count_check
SELECT 
  o.order_id,
  o.customer_id,
  'Missing customer' as issue
FROM orders o 
LEFT JOIN customers c ON o.customer_id = c.id 
WHERE c.id IS NULL
LIMIT 10;
```

## 4. Business Rule Validation (parallel with ref integrity)
```sql --task=business_rules --dep=count_check
SELECT 
  order_id,
  total_amount,
  'Negative amount' as issue
FROM orders
WHERE total_amount < 0
UNION ALL
SELECT 
  order_id,
  total_amount,
  'Suspiciously high amount' as issue
FROM orders
WHERE total_amount > 100000;
```

**Run all checks**: `spry run --all quality-suite.md`
**Run specific check**: `spry run quality-suite.md count_check`
**Parallel execution**: Independent checks run simultaneously for speed

bash

Real-World Example: E-commerce Health Dashboard

A complete quality monitoring suite for an e-commerce platform, including health checks and anomaly detection.
Run this daily to catch issues before they impact customers.

# E-commerce Data Quality Dashboard

## Orders Health Check
```sql --task=orders_health
SELECT 
  'orders' as table_name,
  COUNT(*) as total_rows,
  COUNT(DISTINCT customer_id) as unique_customers,
  MIN(order_date) as earliest_order,
  MAX(order_date) as latest_order,
  SUM(total_amount) as total_revenue,
  AVG(total_amount) as avg_order_value
FROM orders
WHERE order_date >= DATE('now', '-7 days');
```

## Revenue Anomaly Detection
```sql --task=revenue_check --dep=orders_health
WITH daily_revenue AS (
  SELECT 
    DATE(order_date) as day,
    SUM(total_amount) as revenue,
    COUNT(*) as order_count
  FROM orders
  WHERE order_date >= DATE('now', '-30 days')
  GROUP BY DATE(order_date)
),
stats AS (
  SELECT 
    AVG(revenue) as avg_revenue,
    STDDEV(revenue) as stddev_revenue,
    AVG(revenue) + 2 * STDDEV(revenue) as upper_bound,
    AVG(revenue) - 2 * STDDEV(revenue) as lower_bound
  FROM daily_revenue
)
SELECT 
  d.day,
  d.revenue,
  d.order_count,
  ROUND(s.avg_revenue, 2) as avg_revenue,
  ROUND(s.upper_bound, 2) as upper_threshold,
  ROUND(s.lower_bound, 2) as lower_threshold,
  CASE 
    WHEN d.revenue > s.upper_bound THEN 'HIGH_ANOMALY'
    WHEN d.revenue < s.lower_bound THEN 'LOW_ANOMALY'
    ELSE 'NORMAL'
  END as status,
  CASE
    WHEN d.revenue > s.upper_bound THEN 
      ROUND(((d.revenue - s.avg_revenue) / s.avg_revenue) * 100, 1)
    WHEN d.revenue < s.lower_bound THEN 
      ROUND(((s.avg_revenue - d.revenue) / s.avg_revenue) * 100, 1)
    ELSE 0
  END as deviation_pct
FROM daily_revenue d
CROSS JOIN stats s
WHERE d.day >= DATE('now', '-7 days')
ORDER BY d.day DESC;
```

## Inventory Consistency Check
```sql --task=inventory_check --dep=orders_health
SELECT 
  p.product_id,
  p.product_name,
  p.stock_quantity,
  COUNT(oi.order_item_id) as recent_orders,
  SUM(oi.quantity) as units_sold
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.order_date >= DATE('now', '-7 days')
GROUP BY p.product_id, p.product_name, p.stock_quantity
HAVING p.stock_quantity < 0
   OR (units_sold > p.stock_quantity * 2);
```

**Schedule**: Run daily at 6 AM via cron
**Output**: `spry run --all ecommerce-quality.md > daily-quality-report.json`
**Alert**: Send notification if any check finds anomalies
**Dashboard**: Feed results to Evidence.dev for visualization

bash

Data Quality as Code

Treat quality standards like application code. Version control your checks in Git, review changes via pull requests,
and maintain a complete audit trail of quality evolution.This "data contracts" approach ensures alignment between data producers and consumers.

Version Controlled

All quality checks stored in Git repositories
Track changes to quality standards over time
Review new checks via pull requests
Roll back to previous quality definitions

Collaborative Standards

Data engineers define validation rules
Analysts review and approve standards
Stakeholders understand expectations
Clear accountability for quality

Example: Schema Contract Validation

# Customer Table Schema Contract

This contract defines the expected schema for the customers table.
Any changes to this schema must be reviewed by data governance team.

## Expected Schema
```sql --task=validate_customer_schema
WITH expected_schema AS (
  SELECT 'id' as column_name, 'INTEGER' as expected_type, 'NO' as nullable
  UNION ALL SELECT 'email', 'VARCHAR', 'NO'
  UNION ALL SELECT 'first_name', 'VARCHAR', 'NO'
  UNION ALL SELECT 'last_name', 'VARCHAR', 'NO'
  UNION ALL SELECT 'created_at', 'TIMESTAMP', 'NO'
  UNION ALL SELECT 'updated_at', 'TIMESTAMP', 'YES'
),
actual_schema AS (
  SELECT 
    column_name,
    data_type as actual_type,
    is_nullable as nullable
  FROM information_schema.columns
  WHERE table_name = 'customers'
)
SELECT 
  COALESCE(e.column_name, a.column_name) as column_name,
  e.expected_type,
  a.actual_type,
  e.nullable as expected_nullable,
  a.nullable as actual_nullable,
  CASE
    WHEN e.column_name IS NULL THEN 'UNEXPECTED_COLUMN'
    WHEN a.column_name IS NULL THEN 'MISSING_COLUMN'
    WHEN e.expected_type != a.actual_type THEN 'TYPE_MISMATCH'
    WHEN e.nullable != a.nullable THEN 'NULLABLE_MISMATCH'
    ELSE 'OK'
  END as status
FROM expected_schema e
FULL OUTER JOIN actual_schema a ON e.column_name = a.column_name
WHERE e.column_name IS NULL 
   OR a.column_name IS NULL
   OR e.expected_type != a.actual_type
   OR e.nullable != a.nullable;
```

**Expected**: Zero rows returned (all columns match contract)
**Breaking Change**: Any mismatch requires data governance approval
**Version**: 2.1.0 (updated 2024-01-15)

bash

Quality Monitoring Use Cases

From development to production, ensure data integrity everywhere

Pre-Production Validation Gates

Run quality checks in CI/CD pipelines before deploying data transformations. Fail builds if quality thresholds aren't met.

Validate transformations before merge
Prevent bad data from reaching production
Automated quality gates in GitHub Actions

Continuous Quality Monitoring

Schedule quality checks to run hourly or daily. Detect data drift, pipeline failures, and anomalies in real-time.

Scheduled checks via cron or orchestrators
Real-time alerts on quality violations
Historical quality trends in dashboards

Data Migration Validation

Verify that data migrations preserve integrity. Compare source and target systems for completeness and accuracy.

Row count reconciliation across systems
Schema compatibility validation
Data type and format consistency checks

Compliance & Audit Trails

Document quality validation for regulatory compliance. Maintain auditable records of all quality checks.

GDPR, HIPAA, SOC 2 compliance checks
Timestamped execution logs
Version-controlled quality standards

SLA Monitoring & Reporting

Track data quality KPIs and service level agreements. Report on quality metrics to stakeholders.

Data freshness SLA tracking
Quality score calculation and trends
Automated executive reporting

Data Warehouse Health Checks

Monitor data warehouse performance and integrity. Detect table bloat, index issues, and query problems.

Table size and growth monitoring
Query performance regression detection
Partition health and optimization

Why Data Teams Choose Spry for Quality

SQL-First Validation

Use the full power of SQL for any validation logic. No proprietary DSL to learn, no limitations on complexity. Write any query you need—joins, window functions, CTEs, aggregations—all supported.

Git-Native Quality Standards

All quality rules versioned in Git. Pull requests for quality standard changes. Historical view of quality evolution. Treat quality checks like application code with proper review and approval workflows.

Zero Infrastructure

No agents to install, no servers to maintain, no cloud platform required. Just Markdown files and Deno runtime. Run quality checks anywhere: local development, CI/CD pipelines, or scheduled jobs.

Observable by Design

Every quality check produces structured output. Feed results to dashboards, alerts, or audit logs. Integration with Evidence.dev for visualization. Complete transparency into what was checked and when.

From Simple to Sophisticated

Start with basic SQL assertions. Scale to sophisticated anomaly detection using statistical methods. Custom logic for any domain—finance, healthcare, e-commerce, or scientific data. No limits on complexity.

Integration Example: CI/CD Quality Gates

Run quality checks automatically in GitHub Actions.
Fail builds if quality standards aren't met. Prevent bad data from reaching production.

name: Data Quality Checks

on:
  push:
    branches: [main, develop]
  pull_request:
    paths:
      - 'data/**'
      - 'transforms/**'

jobs:
  quality-checks:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Install Deno
        uses: denoland/setup-deno@v1
        with:
          deno-version: v1.x
      
      - name: Install Spry
        run: deno install -A jsr:@programmablemd/spry
      
      - name: Run Quality Suite
        run: spry run --all quality/data-quality-suite.md
        env:
          DATABASE_URL: ${{ secrets.DATABASE_URL }}
      
      - name: Check for Quality Violations
        run: |
          if grep -q "FAIL" quality-results.json; then
            echo "Quality checks failed!"
            exit 1
          fi
      
      - name: Upload Quality Report
        uses: actions/upload-artifact@v3
        with:
          name: quality-report
          path: quality-results.json

bash

Data Quality Checks as Executable Markdown

What is Data Quality Testing?

Why Data Teams Choose Spry

Tests as Documentation

Git-Versioned Standards

Reproducible Results

Comprehensive Quality Validation

Freshness Checks

Completeness Checks

Duplicate Detection

Schema Validation

Range Validation

Referential Integrity

Business Rule Validation

Row Count Validation

Distribution Analysis

Quality Check Examples

Freshness Check

Completeness Check

Duplicate Detection

Orchestrated Quality Suites

Real-World Example: E-commerce Health Dashboard

Data Quality as Code

Version Controlled

Collaborative Standards

Example: Schema Contract Validation

Quality Monitoring Use Cases

Pre-Production Validation Gates

Continuous Quality Monitoring

Data Migration Validation

Compliance & Audit Trails

SLA Monitoring & Reporting

Data Warehouse Health Checks

Why Data Teams Choose Spry for Quality

SQL-First Validation

Git-Native Quality Standards

Zero Infrastructure

Observable by Design

From Simple to Sophisticated

Integration Example: CI/CD Quality Gates

Start Testing Data Quality with Spry