Quickstart¶

This guide walks through common fc-data operations with code examples.

Working with pull requests¶

Every task starts with a PR. You can construct one directly or fetch it from the database:

from datasmith.github import PR, GitHubClient
from datasmith.utils import TokenPool

# Construct a bare PR (fields empty — useful when you already have the data)
pr = PR(repository="astropy/astropy", issue_number=16222)

# PRs are frozen Pydantic v2 models — immutable after creation
pr.merge_commit_sha   # the merge commit sha
pr.base_sha           # base branch commit
pr.cache_key          # "astropy/astropy:16222" — used for Supabase caching

# Or fetch a fully-hydrated PR (tries Supabase first, then GitHub API)
pr = await PR.fetch("astropy/astropy", 16222)

Fetching data from GitHub¶

Use the async client to fetch live data:

pool = TokenPool()   # reads GH_TOKENS env var, rotates tokens on rate-limit
gh = GitHubClient(pool)

pr = await gh.get_pr("pandas-dev", "pandas", 16222)
diff = await gh.get_diff("pandas-dev", "pandas", 16222)
events = await gh.get_timeline("pandas-dev", "pandas", 16222)

Rendering problem statements¶

Turn a PR into a problem statement for LLM evaluation:

from datasmith.github import render_problem_statement, scrape_links

# Render with anonymization (@alice -> @user_1, emails stripped)
statement = render_problem_statement(pr, anonymize=True)

# Scrape linked issues via BFS for richer context
issues = await scrape_links(pr, gh.get_issue_expanded, depth=2, only_issues=True, limit=6)
statement = render_problem_statement(
    pr, issues=issues, repo_description="pandas is a data analysis library"
)

Custom hooks¶

Register custom operations that are automatically cached in Supabase:

from datasmith.github import HookRegistry
from dspy import ChainOfThought

summarizer = ChainOfThought("document -> summary")

def summarize(pr):
    doc = render_problem_statement(pr, anonymize=True)
    return summarizer(doc).summary

HookRegistry.register("summarize", summarize)   # auto-wrapped with @supabase_cached

HookRegistry.call("summarize", pr)   # first call: hits LLM
HookRegistry.call("summarize", pr)   # second call: reads from cache

Classification at scale¶

Run LLM-based classification across many PRs concurrently:

from datasmith.runners import ClassifyPRsRunner
from datasmith.agents import PerfClassifier, ClassifyJudge

runner = ClassifyPRsRunner(PerfClassifier(), ClassifyJudge(), n_concurrent=64)
await runner.run(pr_items)
# Progress tracked in Supabase runner_progress table
# Per-item failures logged in runner_failures — the runner never aborts

Building Docker images¶

Build reproducible environments using the three-tier image hierarchy:

from datasmith.docker import ImageManager

mgr = ImageManager()
mgr.build_base_image()                               # formulacode/base:latest
mgr.build_repo_image("pandas-dev", "pandas")          # formulacode/pandas-dev-pandas:latest
mgr.build_pr_image("pandas-dev", "pandas", 16222)     # formulacode/pandas-dev-pandas:16222

Next steps¶

Pipeline guide — Run the monthly update workflow
Synthesis guide — Automatically generate Docker build contexts
Configuration — Environment variables reference