Publishing¶
fc-data publishes verified tasks to DockerHub (Docker images) and HuggingFace (Parquet datasets).
Publishing workflow¶
from datasmith.utils.db import get_client
from datasmith.publish import records_from_supabase, HuggingFacePublisher
# Query all verified, unpublished perf PRs from the last month
records = records_from_supabase(start_date="2026-02-01", end_date="2026-03-01")
# Publish to HuggingFace as a versioned Parquet dataset
hf = HuggingFacePublisher()
hf.publish(records, version="formulacode@2026-03")
Querying the database directly¶
For more control, query Supabase directly:
from datasmith.utils.db import get_client
sb = get_client()
rows = sb.table("pull_requests") \
.select("*") \
.eq("is_performance_commit", True) \
.not_.is_("container_name", "null") \
.execute()
Dataset versioning¶
- Versions follow
@YYYY-MMformat (e.g.,formulacode@2026-03) - The dataset is updated monthly
- Each publish run creates a new version tag on HuggingFace
- Publishing is append-only — prior versions are never overwritten
DockerHub publishing¶
DockerHubPublisher handles:
- Lazy authentication with DockerHub credentials
- Retry with exponential backoff via
@with_backoff - Version tagging with
@YYYY-MMsuffix - Remote tag listing for delta publish (only push new/changed images)
Pipeline integration¶
Publishing is the final stage of the fc-data pipeline (stage 7). The publish_pipeline() function orchestrates:
- Query DB for unpublished, verified PRs
- Push Docker images to DockerHub
- Upload Parquet dataset to HuggingFace
- Mark records with
published_attimestamp