Column-aware values with --engine spec

The default heuristic engine covers ~80% of columns correctly via pattern matching — it knows email means an email address and phone means a phone number. For domain-specific columns (risk_tier, account_lifecycle_state, coverage_band) the heuristic falls back to generic strings.

Problem: you need semantically correct values for columns that don’t match any built-in pattern, without writing custom generators by hand.

How —engine spec works

When you run with --engine spec, DBSprout sends column names and types (never row values) to an LLM once per schema. The LLM returns a DataSpec — a structured JSON document describing the generator, value distribution, and constraints for each column. That spec is cached under .dbsprout/cache/ keyed by the schema hash; subsequent runs read from cache without calling the LLM again.

Prerequisites

Python 3.10+ with dbsprout installed
An LLM provider configured — the embedded Qwen2.5-1.5B model (zero config, offline) or a cloud provider via OPENAI_API_KEY / ANTHROPIC_API_KEY
A schema initialised with dbsprout init

Steps

1. Initialise the schema

Point DBSprout at your database or schema file:

dbsprout init --db postgresql://localhost/myapp
# or a schema file
dbsprout init --file schema.sql

2. Generate with the spec engine

dbsprout generate --engine spec --rows 500

On the first run you will see a one-time message:

Generating DataSpec via LLM (embedded model)…
Cached → .dbsprout/cache/spec_a3f9c12.json

All subsequent runs skip the LLM call entirely.

3. Verify the cached spec

Inspect what the LLM produced:

cat .dbsprout/cache/spec_*.json | python3 -m json.tool | head -60

Each column entry looks like:

{
  "column": "coverage_band",
  "generator": "choice",
  "values": ["bronze", "silver", "gold", "platinum"],
  "distribution": "uniform"
}

4. Tune a column (optional)

If the spec chose wrong values for a column, edit the cached JSON directly and re-run — DBSprout will use your edited spec without calling the LLM again.

5. Force a spec refresh after a schema change

When you add or rename columns, invalidate the cache:

dbsprout generate --engine spec --refresh-spec

This regenerates the spec for the changed columns only.

Result

Unusual column names now produce domain-correct values — risk_tier generates ["low", "medium", "high", "critical"] instead of random strings, coverage_band generates insurance tier names, account_lifecycle_state generates ["trial", "active", "churned", "paused"]. The LLM call happens once; every subsequent run is as fast as the heuristic engine.

For more on provider configuration and privacy tiers, see LLM configuration.

Column-aware values with --engine spec

How —engine spec works

Prerequisites

Steps

Result

Keep everything on a local LLM

Keep real data private