Database Benchmarking

Realistic data for performance testing

The Benchmarking Challenge

Database benchmarks require realistic data distributions to produce meaningful results. Random data doesn't cut it.

Why Realistic Distributions Matter

❌ Uniform Random Data

Every value equally likely. Unrealistic query patterns. Poor index selectivity testing.

SELECT * WHERE user_id = 42
→ Returns 1 row (0.0001%)

✅ Power-Law Distribution

Top 1% of users generate 50% of activity. Realistic skew. Real-world query patterns.

SELECT * WHERE user_id = 42
→ Returns 50,000 rows (5%)

Supported Benchmarks

📊 TPC-H

  • 8 tables, 22 queries
  • Scale factors: 1GB - 1TB
  • Realistic skew in orders/lineitems
  • Deterministic for reproducibility

🎬 IMDB JOB

  • 21 tables, 113 queries
  • Real-world join complexity
  • Power-law distributions
  • Temporal constraints

🔧 Custom

  • Your schema, your rules
  • Configurable distributions
  • Weighted choice generators
  • Multi-tenant skew

Real Example: IMDB JOB Benchmark

aphelion generate examples/imdb-job/schema.json \
         --rows 1000000 \
         --distribution power-law \
         --seed 42

Generated Data:

  • • 1M movies (80% released 2000-2024)
  • • 5M cast_info (top 100 actors in 40% of movies)
  • • 2M movie_info (realistic genre/rating distributions)
  • • Power-law: Top 1% of movies have 50% of cast members

Distribution Types

Power-Law (Zipfian)

80/20 rule: 20% of items account for 80% of activity

Use for: User activity, product popularity, page views

Normal (Gaussian)

Bell curve: Most values cluster around the mean

Use for: Response times, ages, measurements

Weighted Choice

Custom probabilities: 60% A, 25% B, 15% C

Use for: Status codes, categories, regions

Temporal

Time-based constraints: orders after signup

Use for: Event sequences, audit trails

Ready to Benchmark?