Snorkel AI
Expert data development and specialized agents for frontier AI models
Enterprise·Technical
Key strengths
Research-grade, expert-curated training datasets for frontier AIRigorous calibration pipelines with full label provenance and audit trailsCustom benchmark and evaluation harness developmentSpecialized AI agent development grounded in domain-specific dataDeep academic roots — founded out of Stanford AI Lab with peer-reviewed research
Enterprise pricing
Redwood City, USA
Founded 2019
No ratings yet
Technical Documentation Overview
Snorkel AI's data development pipeline is built around several key technical components:
Data Types Supported
- Expert demonstrations & reasoning – Human solution traces, reasoning traces, SME Q&A rationales, tool-use demos, workflow decision demos
- Preference labels & rankings – Patch/draft/report quality ranking, trajectory QA, risk/safety/style calibration, helpful/harmless ranking
- Rubrics & verifiable outcomes – Unit tests, deterministic graders, citation correctness checks, numerical consistency scoring for math/science tasks
- Long-horizon task environments – Standard and custom environments including repo + CLI tools, browser/GUI harness, multi-step stateful workflows, and simulated environments
Data Quality Pipeline
- Task specification – Tasks scoped to actual model failure modes with target distributions, acceptance criteria, and verifier definitions — each spec is a research artifact
- Calibrated expert review – Reviewers trained against gold sets authored by Snorkel researchers, scored for agreement and bias, re-calibrated per task
- Programmatic checks – Fine-tuned evaluator models co-designed with domain experts, distilled into programmatic graders
- Adjudication & provenance – Full author → multi-reviewer → final-adjudicator pipeline with complete audit trails
Benchmarks & Evals
Eval harnesses are built alongside datasets, featuring task-specific rubrics, deterministic graders, and runnable environments producing reproducible scores across model versions. Published benchmarks include Terminal-Bench 2.0, SlopCodeBench, RIFT, Agents' Last Exam, and Harvey BigLaw Bench.
