YAML Configuration

qualink supports a fully declarative YAML format so you can define your entire validation suite without writing Python code.

Basic Structure

suite:
  name: "My Validation Suite"

data_sources:
  - name: users_source
    format: csv
    path: "data/users.csv"
    table_name: users

checks:
  - name: "Check Name"
    level: error
    description: "Optional description"
    rules:
      - constraint_type: column_name_or_config

outputs:
  - path: "reports/results.json"
    format: json
  - uri: "s3://my-bucket/qualink/results.md"
    format: markdown

Data Sources

Single Source

data_sources:
  - name: users_source
    format: csv
    path: "data/users.csv"
    table_name: users

Multiple Sources (for cross-table checks)

data_sources:
  - name: orders_source
    format: csv
    path: "data/orders.csv"
    table_name: orders
  - name: users_source
    format: csv
    path: "data/users.csv"
    table_name: users

Supported source types: csv, parquet, json.

ADBC Sources

For database-backed sources, use a named connection with a URI and define either table or query on the datasource. Qualink reads the result through ADBC, registers it as a DataFusion table, and runs the normal checks on that registered table.

connections:
  sqlite_local:
    uri: sqlite:///tmp/users.db

data_sources:
  - name: users_source
    connection: sqlite_local
    table: users
    table_name: users
connections:
  sqlite_local:
    uri: sqlite:///tmp/users.db

data_sources:
  - name: users_source
    connection: sqlite_local
    query: |
      SELECT user_id, email, age
      FROM users
    table_name: users

Secret-backed Connections

Connection values can be resolved inline from supported secret systems instead of storing them directly in YAML. The supported inline sources are:

The general shape is:

connections:
  warehouse:
    uri:
      from: aws_ssm
      key: /qualink/prod/postgres/uri
      region: us-east-1

Environment variable example:

connections:
  sqlite_local:
    uri:
      from: env
      key: QUALINK_SQLITE_URI

AWS Secrets Manager with JSON field extraction:

connections:
  snowflake_prod:
    uri:
      from: aws_secretsmanager
      key: qualink/prod/snowflake
      field: uri
      region: eu-west-1

GCP Secret Manager example:

connections:
  bigquery_prod:
    uri:
      from: gcp_secret_manager
      key: qualink-bigquery-uri
      project_id: my-project

Optional secret-backed values can be omitted by setting required: false:

connections:
  lake:
    endpoint:
      from: env
      key: AWS_ENDPOINT_URL
      required: false

Inline secret refs are only resolved inside connections. They work for ADBC uri values and for object-store connection options such as endpoint, service_account_path, or region.

Object Store Sources

qualink supports reading data directly from object stores using DataFusion-native adapters. The object store provider is inferred from the URI scheme in path.

Amazon S3

data_sources:
  - name: users_source
    format: parquet
    path: s3://my-data-bucket/data/users.parquet
    table_name: users

Set credentials via the standard AWS provider chain before running if you are not using an attached role:

export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=wJalr...

Environment Variable Reference

Variable Description
AWS_DEFAULT_REGION / AWS_REGION AWS region for the bucket
AWS_ACCESS_KEY_ID AWS access key
AWS_SECRET_ACCESS_KEY AWS secret key
AWS_SESSION_TOKEN Temporary session token (optional)
AWS_ENDPOINT_URL Custom endpoint for MinIO, R2, etc.
AWS_ALLOW_HTTP Set to true to allow plain HTTP endpoints

Object Store YAML Configuration Reference

Field Description Required
data_sources[].path Full object-store URI such as s3://bucket/key Yes
data_sources[].format csv, parquet, json (auto-detected if omitted) No
data_sources[].table_name DataFusion table name Yes
data_sources[].connection Optional named connection for extra settings such as region or endpoint No

Security: Prefer inline secret refs for sensitive connection values and keep only non-secret settings such as region or endpoint as plain YAML values.

Multiple S3 Sources

data_sources:
  - name: orders_source
    path: s3://data-lake/orders/2024/
    format: parquet
    table_name: orders
  - name: users_source
    path: s3://data-lake/users.csv
    format: csv
    table_name: users

You can also mix local and S3 sources:

data_sources:
  - name: orders_source
    path: s3://data-lake/production/orders.parquet
    format: parquet
    table_name: orders
  - name: users_source
    format: csv
    path: local/users.csv
    table_name: users

Assertion Syntax

Result Outputs

Use output for a single destination or outputs for multiple destinations. Each entry currently uses the filesystem writer and can target a local path or a supported filesystem URI.

output:
  path: reports/results.json
  format: json
  show_passed: true
outputs:
  - path: reports/results.json
    format: json
    show_passed: true
  - uri: s3://my-bucket/qualink/results.md
    format: markdown
  - uri: abfss://container@account.dfs.core.windows.net/qualink/results.json
    format: json

Output Fields

Field Description Required
output.path / outputs[].path Local filesystem destination No
output.uri / outputs[].uri Remote filesystem URI No
output.destination / outputs[].destination Generic destination alias No
output.format / outputs[].format human, json, or markdown No
output.show_passed / outputs[].show_passed Include passed constraints No
output.show_metrics / outputs[].show_metrics Include aggregate metrics No
output.show_issues / outputs[].show_issues Include issues section No
output.colorize / outputs[].colorize Enable ANSI colors for human output No

At least one of path, uri, or destination is required for every output entry.

Supported remote URI schemes currently include s3://, gs://, gcs://, az://, abfs://, and abfss://.

Inline Bound Keys (recommended)

Use shorthand keys directly in the rule config:

rules:
  - has_min:
      column: age
      gte: 0          # greater_than_or_equal
  - has_max:
      column: age
      lte: 120         # less_than_or_equal
  - has_mean:
      column: score
      between: [30, 100]  # between
  - has_size:
      gt: 0            # greater_than
  - has_sum:
      column: amount
      eq: 10000        # equal_to
  - has_max:
      column: price
      lt: 1000         # less_than
Key Assertion
gt greater_than
gte greater_than_or_equal
lt less_than
lte less_than_or_equal
eq equal_to
between between(lower, upper)

Structured Assertion

rules:
  - has_size:
      assertion:
        operator: equal_to
        value: 100

Shorthand String

rules:
  - has_size:
      assertion: "> 0"
  - has_completeness:
      column: name
      assertion: ">= 0.95"

Rule Types

Simple Column Rules

When a rule takes only a column name, use the scalar shorthand:

rules:
  - is_complete: user_id
  - is_unique: email
  - has_column: name

Column List Rules

rules:
  - is_unique: [first_name, last_name]    # Composite uniqueness
  - has_distinctness:
      columns: [status]
      gte: 0.3

All Supported Rule Types

checks:
  - name: "Complete Example"
    level: warning
    rules:
      # Structure
      - has_size:
          gt: 0
      - has_column_count:
          eq: 10
      - has_column: user_id

      # Completeness
      - is_complete: id
      - has_completeness:
          column: email
          gte: 0.95

      # Uniqueness
      - is_unique: id
      - is_primary_key: id
      - has_uniqueness:
          columns: [id]
          gte: 1.0
      - has_distinctness:
          columns: [status]
          gte: 0.3
      - has_unique_value_ratio:
          columns: [tier]
          gte: 0.5

      # Statistics
      - has_min:
          column: age
          gte: 0
      - has_max:
          column: age
          lte: 120
      - has_mean:
          column: age
          between: [20, 60]
      - has_sum:
          column: quantity
          eq: 1000
      - has_stddev:
          column: score
          lte: 30

      # String Lengths
      - has_min_length:
          column: name
          gte: 2
      - has_max_length:
          column: name
          lte: 100

      # Approximate
      - has_approx_count_distinct:
          column: id
          gte: 100
      - has_approx_quantile:
          column: age
          quantile: 0.5
          between: [25, 45]

      # Patterns & Formats
      - has_pattern:
          column: email
          pattern: "@"
      - contains_email: email
      - contains_url: website
      - contains_credit_card: cc_number
      - contains_ssn: ssn

      # Business Rules
      - satisfies:
          predicate: "age >= 18"
          name: "adults_only"
          gte: 1.0
      - custom_sql:
          expression: "price > 0"

      # Correlation
      - has_correlation:
          column_a: height
          column_b: weight
          gte: 0.5

      # Cross-Table
      - referential_integrity:
          child_table: orders
          child_column: user_id
          parent_table: users
          parent_column: id
          eq: 1.0
      - row_count_match:
          table_a: staging
          table_b: production
          eq: 1.0
      - schema_match:
          table_a: staging
          table_b: production
          eq: 1.0

Running YAML Configs

Using qualinkctl (recommended)

The simplest way to run a YAML config is with the qualinkctl CLI:

qualinkctl checks.yaml
qualinkctl checks.yaml -f json
qualinkctl checks.yaml -f markdown -o report.md
qualinkctl s3://my-bucket/qualink/checks.yaml -f json

See the CLI guide for all options and CI/CD integration examples.

One-liner

from qualink.config import run_yaml

result = await run_yaml("checks.yaml")

The config source can be a local file path, a filesystem URI such as s3://my-bucket/qualink/checks.yaml or file:///tmp/checks.yaml, or an inline YAML string.

With Custom Context

from datafusion import SessionContext
from qualink.config import build_suite_from_yaml

ctx = SessionContext()
ctx.register_parquet("users", "users.parquet")

builder = build_suite_from_yaml("checks.yaml", ctx=ctx)
result = await builder.run()

With Formatter

from qualink.config import run_yaml
from qualink.formatters import HumanFormatter

result = await run_yaml("checks.yaml")
print(HumanFormatter().format(result))

Suite Options

suite:
  name: "My Suite"
  run_parallel: true    # Run checks concurrently