Demo Data Generator

Generates a demo SQLite database for librocco with realistic book inventory data.

Architecture

This pipeline creates demo data by:

Using book metadata from data/books.csv (checked in, fetched from Google Books API)
Generating warehouse data - 8 warehouses representing used/new books across years 2022-2025
Generating notes - ~15,000 inventory notes (inbound purchases and outbound sales)
Generating transactions - Book transactions with realistic statistical distributions
Loading into SQLite - Final database with schema pulled from main librocco repo

Statistical Modeling

The data generation uses several statistical techniques for realism:

GEM distribution (Griffiths, Engen, McCloskey): Creates Zipfian-like popularity distribution for the book catalog - a small number of books are very popular, most are rarely sold
Geometric distribution: Models the number of books per note (mean ~8 for inbound, ~3 for outbound)
Exponential distribution: Spreads timestamps across Jan-Aug 2025, weighted toward recent dates
Skorokhod reflection: Mathematical technique to ensure stock never goes negative - automatically inserts reconciliation notes when needed

Schema Sync

The database schema is downloaded directly from the main librocco repository at build time:

https://raw.githubusercontent.com/librocco/librocco/main/apps/sync-server/schemas/init

The crsql_as_crr and crsql_finalize calls are filtered out since plain SQLite doesn't have the CRSQLite extension.

This ensures the demo database always matches the current application schema.

Prerequisites

uv - Python package manager (scripts use inline deps)
curl - For downloading schema
sqlite3 - CLI tool for database creation

Usage

Build the demo database

make

Output: data/demo_db.sqlite3

Refresh book catalog (optional)

The book catalog (data/books.csv) is checked into git. To regenerate it from Google Books API:

./fetch_book_data.py  # Fetches ~600+ books (rate-limited without API key)

Optionally set GOOGLE_BOOKS_API_KEY for higher rate limits.

Clean generated files

make clean

Individual targets

make data/warehouses.csv    # Generate warehouse data only
make data/notes.csv         # Generate notes and transactions
make data/demo_db.sqlite3   # Build final database

Pipeline Overview

GitHub (schema) ──────────────────────────────────────┐
                                                      │
                                                      ▼
┌─────────────────────────┐    ┌────────────────────────────────┐
│ generate_warehouse_data │    │         books.csv              │
│          .py            │    │       (checked in)             │
└───────────┬─────────────┘    └───────────────┬────────────────┘
            │                                  │
            ▼                                  │
    ┌───────────────┐                          │
    │warehouses.csv │                          │
    └───────┬───────┘                          │
            │                                  │
            ▼                                  │
┌───────────────────────┐                      │
│  generate_note_data   │                      │
│         .py           │                      │
└───────────┬───────────┘                      │
            │                                  │
            ▼                                  │
   ┌─────────────────┐                         │
   │notes_prelim.csv │                         │
   └────────┬────────┘                         │
            │                                  │
            └──────────────┬───────────────────┘
                           │
                           ▼
           ┌───────────────────────────┐
           │generate_book_transactions │
           │           .py             │
           └─────────────┬─────────────┘
                         │
            ┌────────────┴────────────┐
            ▼                         ▼
   ┌─────────────────┐     ┌────────────────────┐
   │    notes.csv    │     │book_transactions.csv│
   └────────┬────────┘     └──────────┬─────────┘
            │                         │
            └────────────┬────────────┘
                         │
                         ▼
              ┌─────────────────┐
              │   load_db.py    │◄──── schema.sql (from GitHub)
              └────────┬────────┘
                       │
                       ▼
              ┌─────────────────┐
              │demo_db.sqlite3  │
              └─────────────────┘

Data Details

Warehouses (8 total)

ID	Name	Discount
1	Used books (2022)	20%
2	New books (2022)	0%
3	Used books (2023)	15%
4	New books (2023)	0%
5	Used books (2024)	10%
6	New books (2024)	0%
7	Used books (2025)	5%
8	New books (2025)	0%

Notes (~15,000)

~1/3 inbound (purchases), ~2/3 outbound (sales)
Each warehouse starts with one large inbound note
Timestamps: Jan 1 - Aug 31, 2025
Commitment probability increases with age

Transactions

~160 ISBNs selected from full catalog using GEM distribution
Multinomial sampling distributes books across notes
Reconciliation notes automatically inserted to prevent negative stock

GitHub Actions

The workflow in .github/workflows/on-push.yml builds the database on every push to verify the pipeline works.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
data		data
.envrc		.envrc
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
fetch_book_data.py		fetch_book_data.py
generate_book_transactions.py		generate_book_transactions.py
generate_note_data.py		generate_note_data.py
generate_warehouse_data.py		generate_warehouse_data.py
load_db.py		load_db.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demo Data Generator

Architecture

Statistical Modeling

Schema Sync

Prerequisites

Usage

Build the demo database

Refresh book catalog (optional)

Clean generated files

Individual targets

Pipeline Overview

Data Details

Warehouses (8 total)

Notes (~15,000)

Transactions

GitHub Actions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Demo Data Generator

Architecture

Statistical Modeling

Schema Sync

Prerequisites

Usage

Build the demo database

Refresh book catalog (optional)

Clean generated files

Individual targets

Pipeline Overview

Data Details

Warehouses (8 total)

Notes (~15,000)

Transactions

GitHub Actions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages