Database Benchmarking
Realistic data for performance testing
The Benchmarking Challenge
Database benchmarks require realistic data distributions to produce meaningful results. Random data doesn't cut it.
Why Realistic Distributions Matter
❌ Uniform Random Data
Every value equally likely. Unrealistic query patterns. Poor index selectivity testing.
→ Returns 1 row (0.0001%)
✅ Power-Law Distribution
Top 1% of users generate 50% of activity. Realistic skew. Real-world query patterns.
→ Returns 50,000 rows (5%)
Supported Benchmarks
📊 TPC-H
- 8 tables, 22 queries
- Scale factors: 1GB - 1TB
- Realistic skew in orders/lineitems
- Deterministic for reproducibility
🎬 IMDB JOB
- 21 tables, 113 queries
- Real-world join complexity
- Power-law distributions
- Temporal constraints
🔧 Custom
- Your schema, your rules
- Configurable distributions
- Weighted choice generators
- Multi-tenant skew
Real Example: IMDB JOB Benchmark
aphelion generate examples/imdb-job/schema.json \
--rows 1000000 \
--distribution power-law \
--seed 42
Generated Data:
- • 1M movies (80% released 2000-2024)
- • 5M cast_info (top 100 actors in 40% of movies)
- • 2M movie_info (realistic genre/rating distributions)
- • Power-law: Top 1% of movies have 50% of cast members
Distribution Types
Power-Law (Zipfian)
80/20 rule: 20% of items account for 80% of activity
Use for: User activity, product popularity, page views
Normal (Gaussian)
Bell curve: Most values cluster around the mean
Use for: Response times, ages, measurements
Weighted Choice
Custom probabilities: 60% A, 25% B, 15% C
Use for: Status codes, categories, regions
Temporal
Time-based constraints: orders after signup
Use for: Event sequences, audit trails