YAML Configuration

qualink supports a fully declarative YAML format so you can define your entire validation suite without writing Python code.

Basic Structure

suite:
  name: "My Validation Suite"

data_sources:
  - name: users_source
    format: csv
    path: "data/users.csv"
    table_name: users

checks:
  - name: "Check Name"
    level: error
    description: "Optional description"
    rules:
      - constraint_type: column_name_or_config

outputs:
  - path: "reports/results.json"
    format: json
  - uri: "s3://my-bucket/qualink/results.md"
    format: markdown

Data Sources

Single Source

data_sources:
  - name: users_source
    format: csv
    path: "data/users.csv"
    table_name: users

Multiple Sources (for cross-table checks)

data_sources:
  - name: orders_source
    format: csv
    path: "data/orders.csv"
    table_name: orders
  - name: users_source
    format: csv
    path: "data/users.csv"
    table_name: users

Supported source types: csv, parquet, json.

ADBC Sources

For database-backed sources, use a named connection with a URI and define either table or query on the datasource. Qualink reads the result through ADBC, registers it as a DataFusion table, and runs the normal checks on that registered table.

connections:
  sqlite_local:
    uri: sqlite:///tmp/users.db

data_sources:
  - name: users_source
    connection: sqlite_local
    table: users
    table_name: users

connections:
  sqlite_local:
    uri: sqlite:///tmp/users.db

data_sources:
  - name: users_source
    connection: sqlite_local
    query: |
      SELECT user_id, email, age
      FROM users
    table_name: users

Secret-backed Connections

Connection values can be resolved inline from supported secret systems instead of storing them directly in YAML. The supported inline sources are:

env
aws_ssm
aws_secretsmanager
gcp_secret_manager

The general shape is:

connections:
  warehouse:
    uri:
      from: aws_ssm
      key: /qualink/prod/postgres/uri
      region: us-east-1

Environment variable example:

connections:
  sqlite_local:
    uri:
      from: env
      key: QUALINK_SQLITE_URI

AWS Secrets Manager with JSON field extraction:

connections:
  snowflake_prod:
    uri:
      from: aws_secretsmanager
      key: qualink/prod/snowflake
      field: uri
      region: eu-west-1

GCP Secret Manager example:

connections:
  bigquery_prod:
    uri:
      from: gcp_secret_manager
      key: qualink-bigquery-uri
      project_id: my-project

Optional secret-backed values can be omitted by setting required: false:

connections:
  lake:
    endpoint:
      from: env
      key: AWS_ENDPOINT_URL
      required: false

Inline secret refs are only resolved inside connections. They work for ADBC uri values and for object-store connection options such as endpoint, service_account_path, or region.

Object Store Sources

qualink supports reading data directly from object stores using DataFusion-native adapters. The object store provider is inferred from the URI scheme in path.

Amazon S3

data_sources:
  - name: users_source
    format: parquet
    path: s3://my-data-bucket/data/users.parquet
    table_name: users

Set credentials via the standard AWS provider chain before running if you are not using an attached role:

export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=wJalr...

Environment Variable Reference

Variable	Description
`AWS_DEFAULT_REGION` / `AWS_REGION`	AWS region for the bucket
`AWS_ACCESS_KEY_ID`	AWS access key
`AWS_SECRET_ACCESS_KEY`	AWS secret key
`AWS_SESSION_TOKEN`	Temporary session token (optional)
`AWS_ENDPOINT_URL`	Custom endpoint for MinIO, R2, etc.
`AWS_ALLOW_HTTP`	Set to `true` to allow plain HTTP endpoints

Object Store YAML Configuration Reference

Field	Description	Required
`data_sources[].path`	Full object-store URI such as `s3://bucket/key`	Yes
`data_sources[].format`	`csv`, `parquet`, `json` (auto-detected if omitted)	No
`data_sources[].table_name`	DataFusion table name	Yes
`data_sources[].connection`	Optional named connection for extra settings such as region or endpoint	No

Security: Prefer inline secret refs for sensitive connection values and keep only non-secret settings such as region or endpoint as plain YAML values.

Multiple S3 Sources

data_sources:
  - name: orders_source
    path: s3://data-lake/orders/2024/
    format: parquet
    table_name: orders
  - name: users_source
    path: s3://data-lake/users.csv
    format: csv
    table_name: users

You can also mix local and S3 sources:

data_sources:
  - name: orders_source
    path: s3://data-lake/production/orders.parquet
    format: parquet
    table_name: orders
  - name: users_source
    format: csv
    path: local/users.csv
    table_name: users

Assertion Syntax

Result Outputs

Use output for a single destination or outputs for multiple destinations. Each entry currently uses the filesystem writer and can target a local path or a supported filesystem URI.

output:
  path: reports/results.json
  format: json
  show_passed: true

outputs:
  - path: reports/results.json
    format: json
    show_passed: true
  - uri: s3://my-bucket/qualink/results.md
    format: markdown
  - uri: abfss://container@account.dfs.core.windows.net/qualink/results.json
    format: json

Output Fields

Field	Description	Required
`output.path` / `outputs[].path`	Local filesystem destination	No
`output.uri` / `outputs[].uri`	Remote filesystem URI	No
`output.destination` / `outputs[].destination`	Generic destination alias	No
`output.format` / `outputs[].format`	`human`, `json`, or `markdown`	No
`output.show_passed` / `outputs[].show_passed`	Include passed constraints	No
`output.show_metrics` / `outputs[].show_metrics`	Include aggregate metrics	No
`output.show_issues` / `outputs[].show_issues`	Include issues section	No
`output.colorize` / `outputs[].colorize`	Enable ANSI colors for human output	No

At least one of path, uri, or destination is required for every output entry.

Supported remote URI schemes currently include s3://, gs://, gcs://, az://, abfs://, and abfss://.

Inline Bound Keys (recommended)

Use shorthand keys directly in the rule config:

rules:
  - has_min:
      column: age
      gte: 0          # greater_than_or_equal
  - has_max:
      column: age
      lte: 120         # less_than_or_equal
  - has_mean:
      column: score
      between: [30, 100]  # between
  - has_size:
      gt: 0            # greater_than
  - has_sum:
      column: amount
      eq: 10000        # equal_to
  - has_max:
      column: price
      lt: 1000         # less_than

Key	Assertion
`gt`	`greater_than`
`gte`	`greater_than_or_equal`
`lt`	`less_than`
`lte`	`less_than_or_equal`
`eq`	`equal_to`
`between`	`between(lower, upper)`

Structured Assertion

rules:
  - has_size:
      assertion:
        operator: equal_to
        value: 100

Shorthand String

rules:
  - has_size:
      assertion: "> 0"
  - has_completeness:
      column: name
      assertion: ">= 0.95"

Rule Types

Simple Column Rules

When a rule takes only a column name, use the scalar shorthand:

rules:
  - is_complete: user_id
  - is_unique: email
  - has_column: name

Column List Rules

rules:
  - is_unique: [first_name, last_name]    # Composite uniqueness
  - has_distinctness:
      columns: [status]
      gte: 0.3

All Supported Rule Types

checks:
  - name: "Complete Example"
    level: warning
    rules:
      # Structure
      - has_size:
          gt: 0
      - has_column_count:
          eq: 10
      - has_column: user_id

      # Completeness
      - is_complete: id
      - has_completeness:
          column: email
          gte: 0.95

      # Uniqueness
      - is_unique: id
      - is_primary_key: id
      - has_uniqueness:
          columns: [id]
          gte: 1.0
      - has_distinctness:
          columns: [status]
          gte: 0.3
      - has_unique_value_ratio:
          columns: [tier]
          gte: 0.5

      # Statistics
      - has_min:
          column: age
          gte: 0
      - has_max:
          column: age
          lte: 120
      - has_mean:
          column: age
          between: [20, 60]
      - has_sum:
          column: quantity
          eq: 1000
      - has_stddev:
          column: score
          lte: 30

      # String Lengths
      - has_min_length:
          column: name
          gte: 2
      - has_max_length:
          column: name
          lte: 100

      # Approximate
      - has_approx_count_distinct:
          column: id
          gte: 100
      - has_approx_quantile:
          column: age
          quantile: 0.5
          between: [25, 45]

      # Patterns & Formats
      - has_pattern:
          column: email
          pattern: "@"
      - contains_email: email
      - contains_url: website
      - contains_credit_card: cc_number
      - contains_ssn: ssn

      # Business Rules
      - satisfies:
          predicate: "age >= 18"
          name: "adults_only"
          gte: 1.0
      - custom_sql:
          expression: "price > 0"

      # Correlation
      - has_correlation:
          column_a: height
          column_b: weight
          gte: 0.5

      # Cross-Table
      - referential_integrity:
          child_table: orders
          child_column: user_id
          parent_table: users
          parent_column: id
          eq: 1.0
      - row_count_match:
          table_a: staging
          table_b: production
          eq: 1.0
      - schema_match:
          table_a: staging
          table_b: production
          eq: 1.0

Running YAML Configs

Using qualinkctl (recommended)

The simplest way to run a YAML config is with the qualinkctl CLI:

qualinkctl checks.yaml
qualinkctl checks.yaml -f json
qualinkctl checks.yaml -f markdown -o report.md
qualinkctl s3://my-bucket/qualink/checks.yaml -f json

See the CLI guide for all options and CI/CD integration examples.

One-liner

from qualink.config import run_yaml

result = await run_yaml("checks.yaml")

The config source can be a local file path, a filesystem URI such as s3://my-bucket/qualink/checks.yaml or file:///tmp/checks.yaml, or an inline YAML string.

With Custom Context

from datafusion import SessionContext
from qualink.config import build_suite_from_yaml

ctx = SessionContext()
ctx.register_parquet("users", "users.parquet")

builder = build_suite_from_yaml("checks.yaml", ctx=ctx)
result = await builder.run()

With Formatter

from qualink.config import run_yaml
from qualink.formatters import HumanFormatter

result = await run_yaml("checks.yaml")
print(HumanFormatter().format(result))

Suite Options

suite:
  name: "My Suite"
  run_parallel: true    # Run checks concurrently