What is Data Version Control?

One of the wonders of software development is the invention of Git. With Git, you can manage different versions of your code base. The benefit of this is that you can introduce and test changes in the code with the assurance that if things go wrong you can always revert to the previous working version.

Another benefit of Git is breeze of collaboration. A project can be organized around a central repository. Each developer or subteam working on a particular feature can push changes into that repository through a specific  branch. Added to this benefit are  Github and  Gitlab, where the project repositories can be managed remotely.

Data scientists and engineers have the same needs for their data. They need to have a way to manage different versions of data and collaborate. Git, technically speaking, can do the job. However, it’s not ideal for several reasons:

  • Pushing and pulling massive amounts of data can be a bottleneck.
  • Reviewing changes can be cumbersome (still due to the massive quantity of data)
  • Every local or remote repository will clog up disk space.

This is where Data Version Control (DVC) comes in. Simply put, DVC is a data-focused version of Git. In fact, it’s almost exactly like Git in terms of features and workflows associated with it.

While in Git, the repository keeps everything about each version, DVC only keeps information (or metadata) about each version of the data. The actual data can be hosted remotely in data storage platforms.

#data-science #git #dvc

How to Get Started with Data Version Control (DVC)
1.35 GEEK