Snorkel AI logo

Snorkel AI

Expert data development and specialized agents for frontier AI models

Enterprise·Technical

Key strengths

Research-grade, expert-curated training datasets for frontier AIRigorous calibration pipelines with full label provenance and audit trailsCustom benchmark and evaluation harness developmentSpecialized AI agent development grounded in domain-specific dataDeep academic roots — founded out of Stanford AI Lab with peer-reviewed research
Enterprise pricing
Redwood City, USA
Founded 2019
No ratings yet

Technical Documentation Overview

Snorkel AI's data development pipeline is built around several key technical components:

Data Types Supported

  • Expert demonstrations & reasoning – Human solution traces, reasoning traces, SME Q&A rationales, tool-use demos, workflow decision demos
  • Preference labels & rankings – Patch/draft/report quality ranking, trajectory QA, risk/safety/style calibration, helpful/harmless ranking
  • Rubrics & verifiable outcomes – Unit tests, deterministic graders, citation correctness checks, numerical consistency scoring for math/science tasks
  • Long-horizon task environments – Standard and custom environments including repo + CLI tools, browser/GUI harness, multi-step stateful workflows, and simulated environments

Data Quality Pipeline

  • Task specification – Tasks scoped to actual model failure modes with target distributions, acceptance criteria, and verifier definitions — each spec is a research artifact
  • Calibrated expert review – Reviewers trained against gold sets authored by Snorkel researchers, scored for agreement and bias, re-calibrated per task
  • Programmatic checks – Fine-tuned evaluator models co-designed with domain experts, distilled into programmatic graders
  • Adjudication & provenance – Full author → multi-reviewer → final-adjudicator pipeline with complete audit trails

Benchmarks & Evals

Eval harnesses are built alongside datasets, featuring task-specific rubrics, deterministic graders, and runnable environments producing reproducible scores across model versions. Published benchmarks include Terminal-Bench 2.0, SlopCodeBench, RIFT, Agents' Last Exam, and Harvey BigLaw Bench.