DVC has proved to be very good at managing ML project datasets and workflow. It works hand-in-hand with Git, and can show you the state of the datasets corresponding to any Git commit. Simply by checking out a commit, DVC can rearrange the data files to exactly match what was present at the time of that commit.
The speed is rather magical, considering that potentially many gigabytes of data are being rearranged nearly instantaneously. So I was wondering: How does DVC pull off this trick?
Turns out DVC’s secret to rearranging data and model files as quickly as Git is to link files rather than copy them. Git, of course, copies files into place when it checks out a commit, but Git typically deals with relatively small text files, as opposed to the large binary blobs used in ML projects. Linking a file, like DVC does, is incredibly fast, making it possible to rearrange any amount of files in the blink of an eye, while avoiding copying, thus saving disk space.
The data used is two “stub-articles” dumps of the Wikipedia website retrieved from two different days. Each is about 38 GB of XML, giving us enough data to be similar to an ML project. We then set up a Git/DVC workspace where one can switch between these two files by checking out different Git commits.
$ ls -hl wikidatawiki-20190401-stub-articles.xml -rw-r--r-- 1 david staff 35G Jul 20 21:35 wikidatawiki-20190401-stub-articles.xml $ time cp wikidatawiki-20190401-stub-articles.xml wikidatawiki-stub-articles.xml real 14m16.918s ...
As a base-line measure we’ll note that copying these files into the workspace took about 15 minutes apiece. Obviously an ML researcher would not have a pleasant life if it took 15 minutes to switch between commits in the repository.
Instead, we’ll be exploring another technique DVC and some other tools utilize – linking. There are two types of links all modern OSs support: Hard Link, Symbolic Link. A new type of link, the Reflink (copy-on-write), is starting to be available in newer releases of Mac OS X and Linux (for which one needs the desired filesystem driver). We’ll use each of them in turn and see how well they work.
DVC defaults to using reflinks and, if not available, to fall back to file copying. It avoids using symlinks and hardlinks because of the risk of accidental cache or repository corruption. We’ll see all this in the coming sections.
Versioned datasets using File Copying
The basic (or naive) strategy of copying files into place when checking out a Git tag is the equivalent of these commands:
$ rm data/wikidatawiki-stub-articles.xml $ cp .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml
This will run on any filesystem, but it will take a long time to copy the file, and it will consume twice the disk space.
In practice this is how it works in DVC:
Oh boy, that sure took a long time. The Git portion of this is very fast, but DVC took a very long time. This is as expected, since we told DVC to perform a file copy, and we already knew it took about 16 minutes or so to use the cp command to copy the file.
As for disk space, obviously there are now two copies of the data file. There is the copy in the DVC cache directory, and the other that was copied into the workspace.
” can do both styles of links as well.
Hard links are a byproduct of the Unix model for file systems. What we think of as the filename is really just an entry in a directory file. The directory file entry contains the file name, and the “inode number” which is simply an index into the inode table. Inode table entries are data structures containing file attributes, and a pointer to the actual data. A hard link is simply two directory entries with the same inode number. In effect, it is the exact same file appearing at two locations in the filesystem. Hard links can only be made within a given mounted volume.
A symbolic link is a special file where the attributes contains a pathname specifying the target of the link. Because it contains a pathname, symbolic links can point to any file in the filesystem, even across mounted volumes or across network file systems.
The equivalent commands in this case are:
$ rm data/wikidatawiki-stub-articles.xml $ ln .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml
Then to perform the hard link scenario:
. The timing is similar for both cases.
Two seconds (or less) is sure a lot faster than the 16 minutes or so it took to copy the files. It happens so fast we use the word “instantaneous”. File links are that much faster than copying files around. This is a big win.
As for disk space consumption, consider this:
$ ls -l data/ total 8 lrwxr-xr-x 1 david staff 70 Jul 21 18:43 wikidatawiki-stub-articles.xml -> /Users/david/dvc/linktest/.dvc/cache/2c/82d0130fb32a17d58e2b5a884cd3ce
The link takes up a negligible amount of disk space. But there is a wrinkle to consider.
Versioned Datasets using Reflinks
Hard links and symbolic links have been in the Unix/Linux ecosystem for a long time. I first used symbolic links in 1984 on 4.2BSD, and hard links date back even further. Both hard links and symbolic links can be used to do what DVC does, namely quickly rearranging data files in a working directory. But surely in the last 35+ years there has been an advancement or two in file systems?
” and Linux “
” features are examples.
Copy On Write links, a.k.a. reflinks, offer a solution to quickly linking a file into the workspace while avoiding any risk of polluting the cache. The hard link and symbolic link approaches are big wins because of their speed, but doing so runs the risk of polluting the cache. With reflinks, the copy-on-write behavior means that if someone were to modify the data file the copy in the cache would not be polluted. That means we’d have the same performance advantage as traditional links, with the added advantage of data safety.
Maybe, like me, you don’t know what a reflink is. This technique means to duplicate a file on the disk such that the “copy” is a “clone” similar to a hard link. Unlike a hard link where two directory entries refer to the same inode entry, with reflinks there are two inode entries, and it is the data blocks that are shared. It happens as quickly as a hard link, but there is an important difference. Any write to the cloned file causes new data blocks to be allocated to hold that data. The cloned file appears changed, and the original file is unmodified. The clone is perfectly suitable for the case of duplicating a dataset, allowing modifications to the dataset without polluting the original dataset.
Like with hard links, reflinks only work within a given mounted volume.
Reflinks are easily available on Mac OS X, and with a little work is available on Linux. This feature is supported only on certain file systems:
- Linux: BTRFS, XFS, OCFS2
- Mac OS X: APFS
APFS is supported out of the box on macOS, and Apple strongly suggest we use it. For Linux, XFS is the easiest to set up as shown in this tutorial.
For APFS the equivalent commands are:
$ rm data/wikidatawiki-stub-articles.xml $ cp -c .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml
option, the macOS cp command uses
system call. The
function sets up a reflink clone of the named file. On Linux the cp command uses the
Then to run the test:
The performance is, as expected, similar to the hard links and symbolic links strategies. What we learn is that reflinks are about as fast as hard links and symlinks, and disk space consumption is again negligible.
The cool stuff about this link is even though the files are connected you can edit the file without modifying the file in the cache. The changed data are copied under the hood.
On Linux the same scenario runs with similar performance.
We’ve learned something about how to efficiently manage a large dataset, like is typical in machine learning projects. If we need to revisit any development stage in such projects, we’ll wanta system for efficiently rearranging large datasets to match each stage.
We’ve seen it is possible to keep a list of files that were present at any Git commit. With that list we can link or copy those files into the working directory. That is exactly how DVC manages data files in a project. Using links, rather than file copying, lets us quickly and efficiently switch between revisions of the project.
Reflinks are an interesting new feature for file systems, and they are perfect for this scenario. Reflinks are as fast to create as traditional hard links and symbolic links, letting us quickly duplicate a file, or a whole directory structure, while consuming negligible extra space. And, since reflinks keeps modifications in the linked file, they give us many more possibilities than traditional links. In this article we examined using reflinks in machine learning projects, but they are used in other sorts of applications. For example, some database systems utilizing them to manage data on disk more efficiently. Now that you’ve learned about reflinks, how will you go about using them?