Benchmarks
qualink ships with a real-world benchmark suite using the NYC Yellow Taxi Trip dataset — one of the most popular open datasets for data engineering benchmarks.
Results at a Glance
| Metric | Value |
|---|---|
| Total records | 41.94 M |
| Data size | 654.3 MB (3 Parquet files) |
| Wall-clock time | 1.455 s |
| Checks | 12 |
| Constraints | 92 |
| Passed | 91 |
| Failed | 1 |
| Pass rate | 98.9% |
| Engine time | 1440 ms |
Full Output
========================================================================
qualink Benchmark — NYC Taxi Trips
========================================================================
Parquet files : 3
Total size : 654.3 MB
Data dir : benchmarks/data
YAML config : benchmarks/nyc_taxi_validation.yaml
• data-200901.parquet (211.9 MB)
• data-201206.parquet (231.1 MB)
• data-201501.parquet (211.3 MB)
========================================================================
⏱ Running benchmark with 'human' formatter …
Check 'Uniqueness' completed with status=Warning (passed=0, failed=1)
Verification PASSED: NYC Taxi Trips – qualink Benchmark Suite
Checks 12
Constraints 92
Passed 91
Failed 1
Skipped 0
Pass rate 98.9%
Execution time 1440 ms
Status Check Message
-------- ---------- ---------------------------------------------
[FAIL] Uniqueness Uniqueness of (id) is 0.0000, expected >= 1.0
Issues:
Level Check Constraint Column Message Description Extra
------- ---------- -------------- -------- --------------------------------------------- ------------------------- -------
WARNING Uniqueness Uniqueness(id) id Uniqueness of (id) is 0.0000, expected >= 1.0 Uniqueness of (id) >= 1.0 -
========================================================================
Status : ✅ PASSED
Total records : 41.94M
Wall-clock : 1.455s
Checks : 12
Constraints : 92
Passed : 91
Failed : 1
Pass rate : 98.9%
Engine time : 0.02m
========================================================================
What's Validated
The benchmark YAML suite runs 12 check groups with 92 constraint rules:
| # | Check Group | Level | What it validates |
|---|---|---|---|
| 1 | Schema & Structure | ERROR | All 25 columns exist, column count = 25, table non-empty |
| 2 | Completeness – Critical | ERROR | Zero nulls in ID, timestamps, distance, fares |
| 3 | Completeness – Secondary | WARNING | ≥90–99% completeness on location and categorical fields |
| 4 | Uniqueness | WARNING | Trip id is globally unique |
| 5 | Fare & Amount Ranges | WARNING | Min/max/mean bounds on all monetary fields |
| 6 | Trip Distance & Passengers | WARNING | Distance 0–500mi, passengers 0–9, mean checks |
| 7 | Statistical Checks | INFO | Sum, stddev, median, 90th/95th percentile quantiles |
| 8 | Geo Coordinates | WARNING | ≥90% completeness on lat/lon fields |
| 9 | Categorical Cardinality | INFO | Approx distinct counts for vendor, payment, rate codes |
| 10 | String Lengths | INFO | Vendor ID and payment type are short codes |
| 11 | Business Rules | WARNING | Dropoff > pickup, total ≥ fare, positive passengers |
| 12 | Correlation | INFO | distance↔fare >0.3, fare↔total >0.7 |
Run It Yourself
# 1. Download data (parquet files from public S3)
./benchmarks/download_data.sh 3
# 2. Run the benchmark
uv run python benchmarks/run_benchmark.py
# Other output formats
uv run python benchmarks/run_benchmark.py --format markdown
uv run python benchmarks/run_benchmark.py --format json
Data Files
The download script fetches Parquet files from a public S3 bucket. Each file contains ~14 million taxi trip records with 25 columns.
benchmarks/
├── README.md # detailed benchmark documentation
├── download_data.sh # fetches parquet files from S3
├── nyc_taxi_validation.yaml # comprehensive YAML validation suite
├── run_benchmark.py # Python benchmark runner with timing
└── data/ # ← created by download_data.sh (git-ignored)
├── data-200901.parquet
├── data-201206.parquet
└── data-201501.parquet
See the full dataset schema and configuration in benchmarks/README.md.