Data Generation

Generation Engines

DBSprout ships with multiple generation engines for different use cases:

Uses regex and fuzzy matching on column names and types to select appropriate data generators. No AI model required.

dbsprout generate --engine heuristic

Uses an AI model to analyze your schema once and produce a DataSpec — a per-column generation plan that is cached and reused.

dbsprout generate --engine spec

Speed: First run includes spec generation (<30s embedded, <5s cloud), subsequent runs use cache
Quality: High semantic accuracy
Dependencies: dbsprout[llm] for embedded, dbsprout[cloud] for cloud providers

Uses NumPy for bulk numeric generation. Best for tables with many numeric columns.

dbsprout generate --engine vectorized

DBSprout automatically handles foreign key relationships:

Builds a directed dependency graph from FK constraints
Separates self-referencing FKs for special handling
Performs topological sort to determine insertion order
If cycles exist: detects SCCs via Tarjan’s algorithm, finds nullable FKs, defers them
Two-pass insertion: first pass with NULLs for deferred FKs, second pass updates with real values

This ensures 100% FK integrity on every run.

# Global row count
dbsprout generate --rows 5000

# Per-table via config file (dbsprout.toml)

[generation]
default_rows = 1000

[generation.tables.users]
rows = 500

[generation.tables.orders]
rows = 10000

# Same seed = identical output
dbsprout generate --seed 42

Every cell value is derived from a hash-based per-cell seed, making output fully reproducible across runs and machines.