Skip to content

Data Generation

DBSprout ships with multiple generation engines for different use cases:

Uses regex and fuzzy matching on column names and types to select appropriate data generators. No AI model required.

Terminal window
dbsprout generate --engine heuristic
  • Speed: 100K+ rows/sec
  • Quality: ~80% semantic accuracy
  • Dependencies: None (core install)

Uses an AI model to analyze your schema once and produce a DataSpec — a per-column generation plan that is cached and reused.

Terminal window
dbsprout generate --engine spec
  • Speed: First run includes spec generation (<30s embedded, <5s cloud), subsequent runs use cache
  • Quality: High semantic accuracy
  • Dependencies: dbsprout[llm] for embedded, dbsprout[cloud] for cloud providers

Uses NumPy for bulk numeric generation. Best for tables with many numeric columns.

Terminal window
dbsprout generate --engine vectorized

DBSprout automatically handles foreign key relationships:

  1. Builds a directed dependency graph from FK constraints
  2. Separates self-referencing FKs for special handling
  3. Performs topological sort to determine insertion order
  4. If cycles exist: detects SCCs via Tarjan’s algorithm, finds nullable FKs, defers them
  5. Two-pass insertion: first pass with NULLs for deferred FKs, second pass updates with real values

This ensures 100% FK integrity on every run.

Terminal window
# Global row count
dbsprout generate --rows 5000
# Per-table via config file (dbsprout.toml)
dbsprout.toml
[generation]
default_rows = 1000
[generation.tables.users]
rows = 500
[generation.tables.orders]
rows = 10000
Terminal window
# Same seed = identical output
dbsprout generate --seed 42

Every cell value is derived from a hash-based per-cell seed, making output fully reproducible across runs and machines.