What is Data Version Control?
One of the wonders of software development is the invention of Git. With Git, you can manage different versions of your code base. The benefit of this is that you can introduce and test changes in the code with the assurance that if things go wrong you can always revert to the previous working version.
Another benefit of Git is breeze of collaboration. A project can be organized around a central repository. Each developer or subteam working on a particular feature can push changes into that repository through a specific branch. Added to this benefit are Github and Gitlab, where the project repositories can be managed remotely.
Data scientists and engineers have the same needs for their data. They need to have a way to manage different versions of data and collaborate. Git, technically speaking, can do the job. However, it’s not ideal for several reasons:
- Pushing and pulling massive amounts of data can be a bottleneck.
- Reviewing changes can be cumbersome (still due to the massive quantity of data)
- Every local or remote repository will clog up disk space.
This is where Data Version Control (DVC) comes in. Simply put, DVC is a data-focused version of Git. In fact, it’s almost exactly like Git in terms of features and workflows associated with it.
While in Git, the repository keeps everything about each version, DVC only keeps information (or metadata) about each version of the data. The actual data can be hosted remotely in data storage platforms.
What follows is an overview of how to start using DVC for your data science/engineering projects. By no means is this intended to be a comprehensive introduction/manual. But I hope this is enough to help you hit the ground running with DVC.
Getting Started With Data Version Control
DVC works on Windows, Mac, and Linux. The official documentation page provides more detailed instructions on installation. For our purposes here, though, we’ll be demonstrating installing it on Linux.
Installing DVC is very simple. Fire up a terminal and type the below command:
DVC can now be used along with a Git tracked project. So if you want to use DVC, please make sure you first have a project that’s already been initialized on Git.
Ok now so inside such a Git tracked project/repository, you may initialize DVC by running:
You will see that a couple of files are created. So checking in with Git, we should find those two files:
We now need to commit these files to Git:
At this point, we have successfully installed and initialized DVC. We can now use it to track our data and changes to it.
Some basic commands to be familiar with
In this section, we will just be covering the two most basic tasks you must be familiar with in using DVC – tracking data and accessing/reading data.
While you can do so much more with DVC, the official documentation can better help you navigate all the other features.
Start Tracking Data
Every piece of data tracked by DVC will have its information stored in a .dvc file. This file has information specific to the data stored but not the data itself. Git tracks this .dvc file (i.e., new versions of the data create new versions of this same file).
You can run the `dvc add` command to start tracking a specific data file or a whole directory. For instance, if you want to track a data file called `names.json` inside the `data` directory inside your repository, you do the following:
This creates a `data/names.json.dvc` file.
You can now commit this new file into the Git repository:
Remote Repository for Data
While using Github alone is useful for smaller projects, larger projects may require a remote repository for data versioning. We need a tool like DAGsHub. Think of DAGsHub as a GitHub for our data science. The beautiful thing about DAGsHub is that it works fresh out of the box. You don’t need too much configuration before you can start pushing your data into it.
First, of course, you need to go to DAGsHub and create an account.
Next, just like in Github, you need to create a new repository:
All the images in this section have been sourced from DAGsHub’s documentation.
It will then open up a dialog to input information about this repository, such as name and description.
If you have an existing Github repository, you may connect it to your DAGsHub repository. You can do this by clicking on Create from the DAGsHub navigation:
In the option that follows, just fill up the necessary details about your Github repository, and you’ve connected your two repos.
Pushing your Data to the Cloud
Now, all we need to do now is make sure we have a place to store the actual data (DVC and Dagshub stores the versioning and data information but the actual data we can store somewhere else).
For this example, we will be using Google Drive to make it simple. First, you have to make sure you have created a folder in Google Drive.
Next, run the following command in your terminal:
This command adds information about the new remote storage into the `.dvc/config` file. Therefore, you must then commit this file into Git:
Then to send our data to the cloud storage, simply run:
And later on, if you want to pull the data from the storage, you may run:
DAGsHub automatically detects your data’s remote location. Once your remote storage and DAGsHub are connected, they can simply be accessed through links.
There you have it. These are just some of the tasks you can perform with DVC. Please note that it’s still used in conjunction with Git.
DVC is such a marvelous tool for data scientists and engineers. It’s, therefore, essential to master it like you’d master Git or other development tools. So the best step forward is to explore DVC’s official documentation to get a grip on the different commands and features.
Create your free account to unlock your custom reading experience.