The Core Algorithm
1. Start from Root Tables
You define one or more root tables with optionalWHERE clauses:
2. Follow Relationships Recursively
From each seed row, Basecut traverses foreign keys in both directions: Upstream (Parents): Follow foreign keys to referenced tablesorders.user_id → users.idline_items.product_id → products.id
users.id ← orders.user_idorders.id ← shipments.order_id
3. Ensure Referential Integrity
As Basecut discovers rows, it tracks dependencies:- Before including a row, ensure all its foreign key targets are included
- If a parent row is missing, add it (even if it requires traversing beyond depth limits)
- Result: zero broken foreign key references
4. Apply Limits and Filters
Prevent runaway extraction with safeguards:Example Walkthrough
Given this schema:- Seed: Find
usersrow foralice@example.com→ 1 row - Downstream L1: Find all
orderswhereuser_id = alice.id→ 5 rows - Downstream L2: Find all
line_itemsfor those 5 orders → 23 rows - Upstream L1 (from line_items): Find all
productsreferenced → 8 rows - Downstream L3: Find all
shipmentsfor those 5 orders → 5 rows
Why This Approach Works
Referential Integrity: You can restore snapshots to empty databases without constraint violations. No manual dependency sorting needed. Reproducibility: Same configuration + same source data yields the same snapshot shape. If you use random sampling, setsampling.seed for stable row selection.
Composability: Combine multiple root tables to extract overlapping data graphs. Basecut deduplicates automatically.
Safety: Limits prevent accidentally extracting your entire production database. Anonymization runs inline during extraction.