1.Getting Started with DVC
Data science projects often face a fundamental tension between code and data. While Git handles source code versioning elegantly, it struggles with large datasets and machine learning models that exceed repository size limits. DVC (Data Version Control) bridges this gap by extending Git's versioning capabilities to data files, creating a unified history where code, configuration, and data evolve together.
DVC operates as a layer on top of Git, introducing the concept of data versioning for files too large for traditional version control. Rather than storing data directly in Git, DVC creates lightweight metafiles that act as pointers to actual data content stored in a separate cache. These metafiles—primarily .dvc files and dvc.yaml configurations—are tracked by Git, allowing you to version data indirectly while keeping repositories lightweight. When you check out a specific commit, DVC synchronizes the corresponding data versions to your workspace, enabling complete reproducibility of any project state.
The tool serves three primary functions in machine learning workflows. First, it manages data versioning through a cache-based storage system that deduplicates files and optimizes storage. Second, it automates ML pipelines by defining reproducible workflows in YAML format, creating a build system similar to Make but designed for data science. Third, it tracks experiments by capturing parameters, metrics, and artifacts, organizing them as Git commits that don't clutter your main branch history. DVC achieves this without requiring additional databases or services—just Git and your existing storage infrastructure.
Unlike Git-LFS or other large file storage solutions, DVC doesn't mandate specific server infrastructure. It works with any cloud storage provider—Amazon S3, Google Cloud Storage, Azure Blob Storage, SSH servers, or even local network drives. This flexibility allows teams to leverage existing storage investments while maintaining the collaboration workflows Git enables. DVC also optimizes file operations using reflinks or hardlinks where supported, avoiding unnecessary data copying when switching between versions.
Cross-Platform Installation
Installing DVC requires minimal setup, though preparation depends on your operating system and preferred installation method. Python 3.9 or higher is necessary for the latest DVC version, and while Git is optional, DVC's versioning features require a Git repository to function fully.
The most universal installation method uses pip. We strongly recommend creating a virtual environment first to isolate dependencies:
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install dvc
For specific cloud storage support, install optional dependencies using bracket notation. For Amazon S3 support, use pip install "dvc[s3]", which installs the boto3 library alongside DVC. Other storage options include [azure], [gdrive], [gs] for Google Cloud Storage, [oss], [ssh], and [hdfs]. Use [all] to include every supported remote type.
Conda users can install DVC from the conda-forge channel. Mamba accelerates this process significantly:
conda install -c conda-forge mamba
mamba install -c conda-forge dvc
System-level installations provide cleaner integration on Linux distributions. On Debian or Ubuntu, add the official DVC repository and install via apt:
sudo apt install wget gpg
sudo mkdir -p /etc/apt/keyrings
wget -qO - https://dvc.org/deb/iterative.asc | sudo gpg --dearmor -o /etc/apt/keyrings/packages.iterative.gpg
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/packages.iterative.gpg] https://dvc.org/deb/ stable main" | sudo tee /etc/apt/sources.list.d/dvc.list
sudo apt update
sudo apt install dvc
Fedora and CentOS users can similarly configure the DVC yum repository:
sudo wget https://dvc.org/rpm/dvc.repo -O /etc/yum.repos.d/dvc.repo
sudo rpm --import https://dvc.org/rpm/iterative.asc
sudo yum update
sudo yum install dvc
Snap provides another Linux option with automatic updates:
snap install --classic dvc
Windows users have multiple pathways. The winget package manager offers the simplest approach:
winget install --id Iterative.DVC
Alternatively, Chocolatey and Scoop provide community-maintained packages:
choco install dvc
scoop install dvc
A standalone Windows installer is also available, which configures symbolic link permissions by default to optimize DVC's file operations. After installation, verify the setup by running dvc version, which displays the installed version along with platform information and supported remote storage types.
Project Initialization
Initializing a DVC project establishes the infrastructure for data versioning within your workspace. DVC expects to run within a Git repository for full functionality, though it can operate independently with limited capabilities.
Begin by creating a project directory and initializing Git:
$ mkdir example-project
$ cd example-project
$ git init
Then initialize DVC:
$ dvc init
This command creates a .dvc/ directory containing configuration files, the local cache location, and internal utilities. Specifically, it generates .dvc/config for project settings and .dvc/.gitignore to exclude cache contents from Git tracking. The initialization automatically stages these files with Git, preparing them for your initial commit:
$ git status
Changes to be committed:
new file: .dvc/.gitignore
new file: .dvc/config
$ git commit -m "Initialize DVC"
The .dvc/ directory structure includes several important components. The config file stores project-specific settings like default remotes and cache locations. The cache subdirectory (excluded from Git) stores actual data content hashed by content rather than filename. A tmp directory handles transient operations, while .gitignore prevents accidental versioning of cached data.
DVC supports two alternative initialization modes for specific workflows. The --subdir flag initializes DVC in a subdirectory of a Git repository, useful for monorepos containing multiple projects. In this mode, DVC searches upward for the Git root but treats the current directory as the project root for commands like dvc repro or dvc pull.
The --no-scm flag initializes DVC without Git integration. This mode suits deployment automation or version control systems other than Git, though it disables features requiring Git history like dvc diff or automatic .gitignore updates. When using this mode, DVC sets core.no_scm to true in the configuration, maintaining separation even if Git is later initialized in the directory.
Shell Autocomplete Setup
Command-line efficiency improves significantly with tab completion, which DVC supports across major shells. Completion is automatically enabled when installing via Homebrew on macOS, or through deb/rpm repositories and Snap on Linux. Other installation methods require manual configuration.
To check your current shell, run echo $0. For Bash users on macOS, first ensure completion support is installed:
$ brew install bash-completion
Then add DVC completions to the system directory:
$ dvc completion -s bash | sudo tee "$(brew --prefix)"/etc/bash_completion.d/dvc
On Debian or Ubuntu systems, install the completion infrastructure:
$ sudo apt install --reinstall bash-completion
$ dvc completion -s bash | sudo tee /etc/bash_completion.d/dvc
Zsh users should place the completion script in a directory included in $fpath, typically with the filename _dvc:
$ dvc completion -s zsh | sudo tee /usr/local/share/zsh/site-functions/_dvc
After installation, open a new terminal session to activate completions. Once active, pressing <tab> after partial commands suggests available subcommands, options, and arguments. For example, dvc r<tab> lists commands starting with 'r', while dvc add --<tab> displays available flags like --recursive or --remote.
For enhanced Zsh experience, add configuration to .zshrc enabling case-insensitive matching and color hints:
zstyle ':completion:*' matcher-list 'm:{a-zA-Z}={A-Za-z}' 'r:|[._-]=* r:|=*' 'l:|=* r:|=*'
zstyle ':completion:*' list-colors ''
IDE Plugin Integration
While DVC operates primarily as a command-line tool, Visual Studio Code users can access DVC functionality through a dedicated extension. The DVC Extension for VS Code integrates data versioning, experiment tracking, and pipeline visualization directly into the IDE interface.
Install the extension through the VS Code marketplace by searching for "DVC" or visiting the extension panel. Once installed, the extension provides several integrated views:
The Experiments view displays tracked experiments in a table format, allowing comparison of metrics and parameters across different runs. You can sort, filter, and select specific experiments to apply to your workspace or promote to Git branches. The Plots view visualizes metrics and custom plots generated during experiments, updating in real-time as new data arrives.
The extension also adds DVC commands to the Command Palette (Ctrl+Shift+P), enabling operations like pushing and pulling data, checking out specific versions, and running pipelines without leaving the editor. For repository navigation, the Source Control panel integrates DVC file status, showing which data files are modified, staged, or out of sync with the cache.
When working with DVCLive for experiment tracking, the extension captures real-time metrics and parameters, displaying training progress alongside your code. This integration proves particularly valuable for iterating on machine learning models, as it eliminates context switching between terminal commands and code editing.
To complete the setup, ensure the DVC CLI is installed and available in your system PATH. The extension detects DVC projects automatically when opening a workspace containing a .dvc/ directory or dvc.yaml file, enabling the full suite of IDE features immediately.