Home

banner

fc-data is a python package for automatically curating and managing FormulaCode tasks. After installation, fc-data is designed to run as a monthly CRON job that updates the FormulaCode dataset with new commits and repositories.

FormulaCode is a continually updating benchmark for evaluating the holistic ability of LLM agents to optimize codebases. FormulaCode consists of two parts: a pipeline to construct performance optimization tasks, and an execution harness that connects a language model to our terminal sandbox.

How it works¶

graph LR
    A --->|scrape| B
    A2 <-->|sync| B
    B -->|publish| C
    B -->|publish| D

    A[GitHub]
    A2[Supabase]
    B["fc-data<br/>(This repository)"]
    C[DockerHub]
    D[HuggingFace]

Get started¶

Most interaction with fc-data is through a single command:

fc-data --start-date 2026-03-01 --end-date 2026-04-01

This runs all 7 pipeline stages: repo discovery, PR scraping, LLM classification, dependency resolution, problem rendering, Docker synthesis, and publishing. See the Pipeline guide for the full CLI reference.

Key features¶

Single-command pipeline — fc-data runs all stages with --resume, --stage, and --dry-run support
GitHub scraping — Async httpx client with automatic token rotation across multiple GH_TOKENS
LLM classification — DSPy-based agents classify PRs by performance category and difficulty
Docker synthesis — Automatically generate Docker build contexts using coding agents (Claude, Codex, Gemini)
Scalable runners — Async runners with concurrency control, Supabase progress tracking, and per-item error isolation
Dataset publishing — Versioned Parquet datasets on HuggingFace, Docker images on DockerHub

Quick links¶

Installation — Set up your development environment
Pipeline guide (fc-data) — The primary entrypoint — full CLI reference and stage descriptions
Quickstart — Python API examples
Configuration — tokens.env and environment variables