Pipeline (fc-data)¶
fc-data is the primary command for running fc-data. It discovers performance-improving commits from GitHub, classifies them with LLM agents, resolves dependencies, synthesizes Docker build contexts, and publishes verified images.
Quick reference¶
# Run all 7 stages for a date range
fc-data --start-date 2026-02-01 --end-date 2026-03-01
# Resume from where you left off (skips completed stages)
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --resume
# Run only specific stages
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 3
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 5 --stage 6
# Preview what would run without executing
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --dry-run
CLI flags¶
| Flag | Description | Default |
|---|---|---|
--start-date |
Start of date range to scan (YYYY-MM-DD) | required |
--end-date |
End of date range to scan (YYYY-MM-DD) | required |
--resume |
Skip already-completed stages, resume from next pending | false |
--stage N |
Run only stage N (1–7); repeat for multiple (e.g. --stage 1 --stage 2) |
all stages |
--dry-run |
Log what each stage would do without executing | false |
--n-concurrent N |
Max concurrent items per runner stage | auto |
--tasks-per-repo N |
Cap tasks per repo for stages 5–6 (useful for large repos) | unlimited |
--agent AGENT |
Agent for stage 6 synthesis: claude, codex, gemini, qwen, or none |
auto-detect |
--force |
Re-process already-completed tasks in stages 5–6 | false |
--offline-source PATH |
Import PR data from a Parquet file instead of scraping GitHub (stages 1–2) | — |
--min-stars N |
Minimum GitHub stars for repo discovery in stage 1 | 500 |
Pipeline stages¶
The pipeline runs 7 stages in order. Each stage is backed by an async runner that tracks progress in Supabase and isolates per-item failures (a single failure never aborts the run).
Stage 1: Scrape Repos¶
Discovers Python repositories that have ASV (Airspeed Velocity) benchmarks. The discovery strategy combines multiple sources:
- GitHub code search — Searches for repositories containing
asv.conf.json, filtered by language (Python) and minimum star count (--min-stars, default 500). - Offline import — If
--offline-sourceis provided, imports repository names from a Parquet file. - Existing repos — Re-processes repositories already in the database to refresh metadata (description, stars, topics).
For each repository found, the runner fetches and stores metadata including description, primary language, star count, topics, and creation date.
# Discover repos with at least 1000 stars
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 1 --min-stars 1000
# Include repos from an offline dataset
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 1 --offline-source data.parquet
Writes to: repositories table
Runner: ScrapeReposRunner
Stage 2: Scrape Commits¶
For every repository in the repositories table, scrapes all merged pull requests within the --start-date to --end-date range. For each PR, the runner fetches:
- Metadata — title, body, state, labels, merge commit SHA, base/head SHAs, timestamps
- Diff — the full unified diff of the PR
- File changes — per-file additions, deletions, and change counts
- Timeline events — comments, cross-references, review events (used later for link scraping)
If --offline-source is provided, also bulk-imports PR records from the Parquet file (useful for seeding the database with historical data).
Writes to: pull_requests table (one row per PR)
Runner: ScrapeCommitsRunner
Stage 3: Classify PRs¶
Runs two LLM agents in sequence on each unclassified PR:
PerfClassifier— Reads the PR title, body, diff, and file change summary. Outputs a binary YES/NO decision on whether this PR improves performance.ClassifyJudge— For PRs classified as YES, determines the optimization type (one of 14 categories like "algorithmic", "caching", "parallelism", "memory") and difficulty level (easy/medium/hard).
Only PRs that pass a symbolic pre-filter (is_performance_commit_symbolic = True, set during scraping based on keywords in the PR title/body/labels) are sent to the LLM — this keeps costs down by filtering out obviously irrelevant PRs first.
# Classify all unclassified PRs
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 3
# Re-classify all PRs (including already-classified ones)
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 3 --force
Updates: pull_requests table (sets is_performance_commit, classification)
Runner: ClassifyPRsRunner
Requires: LLM backend (DSPY_* variables in tokens.env). See Configuration.
Stage 4: Resolve Packages¶
For each performance-classified PR, checks out the repository at the merge commit SHA and resolves a complete, pinned Python dependency set. The process:
- Parse metadata — Reads
pyproject.toml,setup.py, orsetup.cfgto extract declared dependencies - Resolve with uv — Runs
uv pip compileto produce a fully pinned requirements file, trying multiple Python versions if needed - Validate installability — Confirms the resolved set can actually be installed (sets
can_install = True/False)
The resolved dependency set (env_payload) and the Python version used are stored in the packages table. These are consumed by stage 6 to build docker_build_env.sh — the shell script that installs dependencies inside the Docker image.
Writes to: packages table (env_payload, python_version, can_install)
Runner: ResolvePackagesRunner
Note
Only PRs with can_install = True proceed to later stages. If resolution fails for a commit, check the runner_failures table for the error.
Stage 5: Render Problems¶
Builds a rich, deconstructed problem context for each PR. This context is what the LLM agent sees during synthesis (stage 6) and what ends up in the final FormulaCode dataset. The process:
- Scrape linked issues — Follows
#123,owner/repo#456, and GitHub URL references in the PR body via BFS (up to depth 2). Fetches the full issue body and comments for each linked issue. - Extract problem observations — Uses
ProblemExtractor(an LLM agent) to separate the problem description from the solution details in the PR body. This prevents information leakage — the benchmark evaluator should see the problem, not how the author solved it. - Persist components — Stores the linked issues JSON, extracted observations, repo description, and raw PR fields in the
candidate_prstable, allowing the problem statement to be re-rendered later without re-scraping.
# Render problem contexts
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 5
# Limit to 3 PRs per repo (useful for testing)
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 5 --tasks-per-repo 3
# Re-render already-processed PRs
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 5 --force
Writes to: candidate_prs table
Runner: RenderProblemsRunner
Requires: LLM backend (for ProblemExtractor) and GitHub tokens (for issue scraping).
Stage 6: Synthesize Images¶
The most complex stage. For each PR with resolved packages and a rendered problem context, generates a working Docker build context — the set of shell scripts (docker_build_pkg.sh, docker_build_run.sh, etc.) that build the repository, install dependencies, check out the right commit, and run ASV benchmarks.
The synthesizer is a state machine that tries the cheapest option first:
- Check cache — Look up Supabase for an existing, verified build script for this exact PR
- Find similar — Query for working scripts from the same repository (different PRs)
- Try similar — Run each similar script against the verifier chain (smoke test + profile test)
- LLM generate — Invoke a coding agent (Claude Code, Codex, or Gemini) in a sandboxed workspace to write the build scripts from scratch, iterating until verification passes or attempts are exhausted
Each attempt (success or failure) is logged to the error_logs table with the agent output, failure stage, return code, and stderr — making it possible to diagnose and retry later.
# Use the default auto-detected agent
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 6
# Force a specific agent
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 6 --agent claude
# Skip LLM generation entirely — only use cached/similar scripts
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 6 --agent none
# Limit concurrency and tasks per repo (controls cost)
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 6 --n-concurrent 2 --tasks-per-repo 5
# Re-synthesize already-completed PRs
fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 6 --force
Writes to: candidate_containers table (on success), error_logs table (every attempt)
Runner: SynthesizeImagesRunner
Requires: Docker daemon running, resolved packages (stage 4), rendered problems (stage 5). LLM agent CLI (claude, codex, gemini, or qwen) must be on $PATH unless --agent none.
Warning
Synthesis can be expensive — each LLM attempt may consume significant tokens and each Docker build takes minutes. Use --n-concurrent and --tasks-per-repo to control cost. Start with --agent none to exhaust cached/similar scripts before using LLM generation.
Stage 7: Publish¶
The final stage. Builds three-tier Docker images (base → repo → PR) from synthesized contexts, verifies them, and pushes to DockerHub. Then exports all verified PRs as a versioned Parquet dataset to HuggingFace.
The publish pipeline:
- Query — Fetches all performance-classified PRs with successful synthesis (
container_name IS NOT NULL) that haven't been published yet - Build images — Constructs the Docker image hierarchy: base image, per-repo image, per-PR image
- Verify — Runs the verifier chain (smoke import test + ASV profile test) on each PR image
- Push to DockerHub — Publishes verified images under the
formulacode/namespace - Push to HuggingFace — Exports FormulaCode records as versioned Parquet (e.g.,
formulacode@2026-03) - Mark published — Sets
published_attimestamp on each published PR row
Reads from: pull_requests, packages, candidate_containers
Requires: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN, HF_TOKEN_PATH in tokens.env. Docker daemon running.
Typical workflow¶
A monthly update usually looks like this:
# 1. Run the full pipeline
fc-data --start-date 2026-03-01 --end-date 2026-04-01
# 2. If it gets interrupted, resume where it left off
fc-data --start-date 2026-03-01 --end-date 2026-04-01 --resume
# 3. After fixing a synthesis issue, re-run just stages 6-7
fc-data --start-date 2026-03-01 --end-date 2026-04-01 --stage 6 --stage 7 --force
Monitoring progress¶
Each runner writes live counters to the runner_progress Supabase table. You can monitor progress via Supabase Studio (default http://127.0.0.1:54323) or query directly:
from datasmith.utils.db import get_client
sb = get_client()
rows = sb.table("runner_progress").select("*").execute()
for row in rows.data:
print(f"{row['runner_name']}: {row['completed']}/{row['total']} ({row['failed']} failed)")
Per-item failures are logged to runner_failures with full error messages and tracebacks, so you can diagnose and re-run specific stages without restarting from scratch.
Architecture¶
All runners extend BaseRunner, which provides:
- Concurrency control —
asyncio.Semaphorelimits parallelism (set via--n-concurrent) - Progress tracking — Upserts to
runner_progressevery 10 items or 30 seconds - Error isolation — Per-item failures logged to
runner_failures; the runner never aborts - Graceful shutdown — CTRL+C terminates agent subprocesses cleanly; press twice to force-kill
Database tables¶
| Table | Written by | Purpose |
|---|---|---|
repositories |
Stage 1 | Tracked GitHub repos (language, stars, topics) |
pull_requests |
Stages 2–3 | PR metadata, classification, diffs, rendered problems |
packages |
Stage 4 | Pinned env_payload and python_version per commit |
candidate_prs |
Stage 5 | Deconstructed PR context for re-rendering |
candidate_containers |
Stage 6 | Successful agent-generated build scripts per SHA |
error_logs |
Stage 6 | Per-attempt synthesis results (agent output, failure details) |
runner_progress |
All stages | Live progress counters (total/completed/failed) |
runner_failures |
All stages | Per-item failure details (error message, traceback) |
hook_cache |
Various | Memoization cache for @supabase_cached functions |