Benchmarks

qualink ships with a real-world benchmark suite using the NYC Yellow Taxi Trip dataset — one of the most popular open datasets for data engineering benchmarks.

Results at a Glance

Metric
Value


Total records
41.94 M

Data size
654.3 MB (3 Parquet files)

Wall-clock time
1.455 s

Checks
12

Constraints
92

Passed
91

Failed
1

Pass rate
98.9%

Engine time
1440 ms

Metric	Value
Total records	41.94 M
Data size	654.3 MB (3 Parquet files)
Wall-clock time	1.455 s
Checks	12
Constraints	92
Passed	91
Failed	1
Pass rate	98.9%
Engine time	1440 ms

Full Output

========================================================================
  qualink Benchmark — NYC Taxi Trips
========================================================================
  Parquet files : 3
  Total size    : 654.3 MB
  Data dir      : benchmarks/data
  YAML config   : benchmarks/nyc_taxi_validation.yaml

    • data-200901.parquet  (211.9 MB)
    • data-201206.parquet  (231.1 MB)
    • data-201501.parquet  (211.3 MB)
========================================================================

⏱  Running benchmark with 'human' formatter …

Check 'Uniqueness' completed with status=Warning (passed=0, failed=1)
Verification PASSED: NYC Taxi Trips – qualink Benchmark Suite

Checks          12
Constraints     92
Passed          91
Failed          1
Skipped         0
Pass rate       98.9%
Execution time  1440 ms

Status    Check       Message
--------  ----------  ---------------------------------------------
[FAIL]    Uniqueness  Uniqueness of (id) is 0.0000, expected >= 1.0

Issues:
Level    Check       Constraint      Column    Message                                        Description                Extra
-------  ----------  --------------  --------  ---------------------------------------------  -------------------------  -------
WARNING  Uniqueness  Uniqueness(id)  id        Uniqueness of (id) is 0.0000, expected >= 1.0  Uniqueness of (id) >= 1.0  -

========================================================================
  Status         : ✅ PASSED
  Total records  : 41.94M
  Wall-clock     : 1.455s
  Checks         : 12
  Constraints    : 92
  Passed         : 91
  Failed         : 1
  Pass rate      : 98.9%
  Engine time    : 0.02m
========================================================================

What's Validated

The benchmark YAML suite runs 12 check groups with 92 constraint rules:

#	Check Group	Level	What it validates
1	Schema & Structure	ERROR	All 25 columns exist, column count = 25, table non-empty
2	Completeness – Critical	ERROR	Zero nulls in ID, timestamps, distance, fares
3	Completeness – Secondary	WARNING	≥90–99% completeness on location and categorical fields
4	Uniqueness	WARNING	Trip `id` is globally unique
5	Fare & Amount Ranges	WARNING	Min/max/mean bounds on all monetary fields
6	Trip Distance & Passengers	WARNING	Distance 0–500mi, passengers 0–9, mean checks
7	Statistical Checks	INFO	Sum, stddev, median, 90th/95th percentile quantiles
8	Geo Coordinates	WARNING	≥90% completeness on lat/lon fields
9	Categorical Cardinality	INFO	Approx distinct counts for vendor, payment, rate codes
10	String Lengths	INFO	Vendor ID and payment type are short codes
11	Business Rules	WARNING	Dropoff > pickup, total ≥ fare, positive passengers
12	Correlation	INFO	distance↔fare >0.3, fare↔total >0.7

Run It Yourself

# 1. Download data (parquet files from public S3)
./benchmarks/download_data.sh 3

# 2. Run the benchmark
uv run python benchmarks/run_benchmark.py

# Other output formats
uv run python benchmarks/run_benchmark.py --format markdown
uv run python benchmarks/run_benchmark.py --format json

Data Files

The download script fetches Parquet files from a public S3 bucket. Each file contains ~14 million taxi trip records with 25 columns.

benchmarks/
├── README.md                  # detailed benchmark documentation
├── download_data.sh           # fetches parquet files from S3
├── nyc_taxi_validation.yaml   # comprehensive YAML validation suite
├── run_benchmark.py           # Python benchmark runner with timing
└── data/                      # ← created by download_data.sh (git-ignored)
    ├── data-200901.parquet
    ├── data-201206.parquet
    └── data-201501.parquet

See the full dataset schema and configuration in benchmarks/README.md.