Technical Documentation Overview

Snorkel AI's data development pipeline is built around several key technical components:

Data Types Supported

Expert demonstrations & reasoning – Human solution traces, reasoning traces, SME Q&A rationales, tool-use demos, workflow decision demos
Preference labels & rankings – Patch/draft/report quality ranking, trajectory QA, risk/safety/style calibration, helpful/harmless ranking
Rubrics & verifiable outcomes – Unit tests, deterministic graders, citation correctness checks, numerical consistency scoring for math/science tasks
Long-horizon task environments – Standard and custom environments including repo + CLI tools, browser/GUI harness, multi-step stateful workflows, and simulated environments

Data Quality Pipeline

Task specification – Tasks scoped to actual model failure modes with target distributions, acceptance criteria, and verifier definitions — each spec is a research artifact
Calibrated expert review – Reviewers trained against gold sets authored by Snorkel researchers, scored for agreement and bias, re-calibrated per task
Programmatic checks – Fine-tuned evaluator models co-designed with domain experts, distilled into programmatic graders
Adjudication & provenance – Full author → multi-reviewer → final-adjudicator pipeline with complete audit trails

Benchmarks & Evals

Eval harnesses are built alongside datasets, featuring task-specific rubrics, deterministic graders, and runnable environments producing reproducible scores across model versions. Published benchmarks include Terminal-Bench 2.0, SlopCodeBench, RIFT, Agents' Last Exam, and Harvey BigLaw Bench.