While working in a software project it is very common and, in fact, a standard to start right away versioning code, and the benefits are already pretty obvious for the software community: it tracks every modification of the code in a particular code repository. If any mistake is made, developers can always travel through time and compare earlier versions of the code in order to solve the problem while minimizing disruption to all the team members. Code for software projects is the most precious asset and for that reason must be protected at all costs!

Well, for Data Science projects, data can also be considered the crown jewels, so why us, as Data Scientists, don’t treat as the most precious thing on earth through versioning control?

For those familiar with Git, you might be thinking, _“Git cannot handle large files and directories… at least it can’t with the same performance as it deals with small code files. So how can I version control my data in the same old fashion we version control code?”. _Well, this is now possible, and it’s easy as just typing git cloneand see the data files and ML model files saved in the workspace, and all this magic can be achieved with DVC.

Quick start with DVC

Image for post

Data Version Control (DVC)

First things first, we have to get DVC installed in our machines. It’s pretty straightforward and you can do it by following these steps.

As I’ve already mentioned, tools for data version control such as DVC makes it possible to build large projects while making it possible to reproduce the pipelines. Using DVC it’s very simple to add datasets into a git repository, and when I mean by simple, is as easy as typing the line below:

dvc add path/to/dataset

Regardless of the size of the dataset, the data is added to the repository. Assuming that we also want to push the dataset into the cloud, it is also possible with the below command:

dvc push path/to/dataset.dvc

Out of the box, DVC supports many cloud storage services such as S3, Google Storage, Azure Blobs, Google Drive, etc… And since the dataset was pushed to the cloud through the version control system, if I clone the project into another machine, I’m able to download the data, or any other artifact, using the following command:

dvc pull

Well, now that you know how to start with DVC, I suggest you to go and further explore the tool, or similar ones. Version control should be your best friend as a Data Scientist, as they allow not only to version datasets but also to create reproducible pipelines, while keeping all the developments traceable and reproducible.

If this hasn’t yet convinced, next I’ll tell why **_you must start versioning control _**your data!!

#data #machine-learning #data-science #software-development #version-control

4 Reasons Why Data Scientists Should Version Data
1.55 GEEK