DVC (Data Version Control)
Free tierManage data the way code is managed — Git-like version control for AI/ML and data science.
Free tier available·Technical·API available·Open source
Key strengths
Git-like versioning for datasets and ML modelsOpen source with a large, active communitySeamlessly integrates with existing Git workflowsSupports petabyte-scale data lakes and object stores via lakeFSWorks with major cloud storage providers and local filesystems
Free tier + paid plans
San Francisco, USA
Founded 2017
Self-hostable
No ratings yet
- ML pipeline orchestration — Define multi-stage DAG pipelines in
dvc.yamlwith caching, enabling efficient retraining when only subsets of data or code change. - Remote artifact management — Version and store large model checkpoints and datasets on S3, GCS, or Azure without bloating Git repos.
- CI/CD for ML — Integrate
dvc reproanddvc metrics diffinto GitHub Actions or GitLab CI to automatically validate model performance on every pull request. - Experiment branching — Use
dvc exp branchto promote successful experiments to Git branches, keeping experiment history clean and auditable. - Data lake versioning at scale (lakeFS) — Apply Git semantics (branch, merge, revert) directly to petabyte-scale object stores for data engineering teams managing complex ETL and AI data pipelines.
- Programmatic data access — Use
dvc.api.open()ordvc.api.read()in Python scripts to fetch versioned datasets from remote storage with a single line of code.
