The default heuristic engine covers ~80% of columns correctly via pattern matching — it knows email means an email address and phone means a phone number. For domain-specific columns (risk_tier, account_lifecycle_state, coverage_band) the heuristic falls back to generic strings.
Problem: you need semantically correct values for columns that don’t match any built-in pattern, without writing custom generators by hand.
How —engine spec works
When you run with --engine spec, DBSprout sends column names and types (never row values) to an LLM once per schema. The LLM returns a DataSpec — a structured JSON document describing the generator, value distribution, and constraints for each column. That spec is cached under .dbsprout/cache/ keyed by the schema hash; subsequent runs read from cache without calling the LLM again.
Prerequisites
- Python 3.10+ with
dbsproutinstalled - An LLM provider configured — the embedded Qwen2.5-1.5B model (zero config, offline) or a cloud provider via
OPENAI_API_KEY/ANTHROPIC_API_KEY - A schema initialised with
dbsprout init
Steps
1. Initialise the schema
Point DBSprout at your database or schema file:
dbsprout init --db postgresql://localhost/myapp
# or a schema file
dbsprout init --file schema.sql
2. Generate with the spec engine
dbsprout generate --engine spec --rows 500
On the first run you will see a one-time message:
Generating DataSpec via LLM (embedded model)…
Cached → .dbsprout/cache/spec_a3f9c12.json
All subsequent runs skip the LLM call entirely.
3. Verify the cached spec
Inspect what the LLM produced:
cat .dbsprout/cache/spec_*.json | python3 -m json.tool | head -60
Each column entry looks like:
{
"column": "coverage_band",
"generator": "choice",
"values": ["bronze", "silver", "gold", "platinum"],
"distribution": "uniform"
}
4. Tune a column (optional)
If the spec chose wrong values for a column, edit the cached JSON directly and re-run — DBSprout will use your edited spec without calling the LLM again.
5. Force a spec refresh after a schema change
When you add or rename columns, invalidate the cache:
dbsprout generate --engine spec --refresh-spec
This regenerates the spec for the changed columns only.
Result
Unusual column names now produce domain-correct values — risk_tier generates ["low", "medium", "high", "critical"] instead of random strings, coverage_band generates insurance tier names, account_lifecycle_state generates ["trial", "active", "churned", "paused"]. The LLM call happens once; every subsequent run is as fast as the heuristic engine.
For more on provider configuration and privacy tiers, see LLM configuration.