How to Get Started with Data Version Control (DVC) | Hacker Noon

Author profile picture

What is Data Version Control?

One of the wonders of software development is the invention of Git. With Git, you can manage different versions of your code base. The benefit of this is that you can introduce and test changes in the code with the assurance that if things go wrong you can always revert to the previous working version.

Another benefit of Git is breeze of collaboration. A project can be organized around a central repository. Each developer or subteam working on a particular feature can push changes into that repository through a specific branch. Added to this benefit are Github and Gitlab, where the project repositories can be managed remotely.

Data scientists and engineers have the same needs for their data. They need to have a way to manage different versions of data and collaborate. Git, technically speaking, can do the job. However, it’s not ideal for several reasons:

  • Pushing and pulling massive amounts of data can be a bottleneck.
  • Reviewing changes can be cumbersome (still due to the massive quantity of data)
  • Every local or remote repository will clog up disk space.

This is where Data Version Control (DVC) comes in. Simply put, DVC is a data-focused version of Git. In fact, it’s almost exactly like Git in terms of features and workflows associated with it.

While in Git, the repository keeps everything about each version, DVC only keeps information (or metadata) about each version of the data. The actual data can be hosted remotely in data storage platforms. 

What follows is an overview of how to start using DVC for your data science/engineering projects. By no means is this intended to be a comprehensive introduction/manual. But I hope this is enough to help you hit the ground running with DVC.

Getting Started With Data Version Control


DVC works on Windows, Mac, and Linux. The official documentation page provides more detailed instructions on installation. For our purposes here, though, we’ll be demonstrating installing it on Linux.

Installing DVC is very simple. Fire up a terminal and type the below command:

Source code

Initialize DVC

DVC can now be used along with a Git tracked project. So if you want to use DVC, please make sure you first have a project that’s already been initialized on Git.

Ok now so inside such a Git tracked project/repository, you may initialize DVC by running:

Source code

You will see that a couple of files are created. So checking in with Git, we should find those two files:

Source code

We now need to commit these files to Git:

Source code

At this point, we have successfully installed and initialized DVC. We can now use it to track our data and changes to it.

Some basic commands to be familiar with

In this section, we will just be covering the two most basic tasks you must be familiar with in using DVC – tracking data and accessing/reading data.

While you can do so much more with DVC, the official documentation can better help you navigate all the other features.

Start Tracking Data

Every piece of data tracked by DVC will have its information stored in a .dvc file. This file has information specific to the data stored but not the data itself. Git tracks this .dvc file (i.e., new versions of the data create new versions of this same file).

You can run the `dvc add` command to start tracking a specific data file or a whole directory. For instance, if you want to track a data file called `names.json` inside the `data` directory inside your repository, you do the following:

Source code

This creates a `data/names.json.dvc` file.

You can now commit this new file into the Git repository:

Source code

Remote Repository for Data

While using Github alone is useful for smaller projects, larger projects may require a remote repository for data versioning. We need a tool like DAGsHub. Think of DAGsHub as a GitHub for our data science. The beautiful thing about DAGsHub is that it works fresh out of the box. You don’t need too much configuration before you can start pushing your data into it.

First, of course, you need to go to DAGsHub and create an account.

Next, just like in Github, you need to create a new repository:

All the images in this section have been sourced from DAGsHub’s documentation.

It will then open up a dialog to input information about this repository, such as name and description.

If you have an existing Github repository, you may connect it to your DAGsHub repository. You can do this by clicking on Create from the DAGsHub navigation:

In the option that follows, just fill up the necessary details about your Github repository, and you’ve connected your two repos.

Pushing your Data to the Cloud

Now, all we need to do now is make sure we have a place to store the actual data (DVC and Dagshub stores the versioning and data information but the actual data we can store somewhere else).

For this example, we will be using Google Drive to make it simple. First, you have to make sure you have created a folder in Google Drive.

Next, run the following command in your terminal:

Source code

This command adds information about the new remote storage into the `.dvc/config` file. Therefore, you must then commit this file into Git:

Source code

Then to send our data to the cloud storage, simply run:

Source code

And later on, if you want to pull the data from the storage, you may run:

Source code

DAGsHub automatically detects your data’s remote location.  Once your remote storage and DAGsHub are connected, they can simply be accessed through links.

There you have it. These are just some of the tasks you can perform with DVC. Please note that it’s still used in conjunction with Git.

Next Steps

DVC is such a marvelous tool for data scientists and engineers. It’s, therefore, essential to master it like you’d master Git or other development tools. So the best step forward is to explore DVC’s official documentation to get a grip on the different commands and features.


Join Hacker Noon

Create your free account to unlock your custom reading experience.

read original article here