How To Use Git: A Reference Guide

How To Use Git: A Reference Guide

Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files.


Teams of developers and open-source software maintainers typically manage their projects through Git, a distributed version control system that supports collaboration.

This cheat sheet-style guide provides a quick reference to commands that are useful for working and collaborating in a Git repository. To install and configure Git, be sure to read “How To Contribute to Open Source: Getting Started with Git.”

How to Use This Guide:

  • This guide is in cheat sheet format with self-contained command-line snippets.
  • Jump to any section that is relevant to the task you are trying to complete.
  • When you see <span class="highlight">highlighted text</span> in this guide’s commands, keep in mind that this text should refer to the commits and files in your own repository.
Set Up and Initialization

Check your Git version with the following command, which will also confirm that Git is installed.

git --version

You can initialize your current working directory as a Git repository with <span class="highlight">highlighted text</span>.

git --version

To copy an existing Git repository hosted remotely, you’ll use <span class="highlight">highlighted text</span> with the repo’s URL or server location (in the latter case you will use <span class="highlight">highlighted text</span>).

git --version

Show your current Git directory’s remote repository.

git --version

For a more verbose output, use the <span class="highlight">highlighted text</span> flag.

git --version

Add the Git upstream, which can be a URL or can be hosted on a server (in the latter case, connect with <span class="highlight">highlighted text</span>).

git --version


When you’ve modified a file and have marked it to go in your next commit, it is considered to be a staged file.

Check the status of your Git repository, including files added that are not staged, and files that are staged.

git --version

To stage modified files, use the <span class="highlight">highlighted text</span> command, which you can run multiple times before a commit. If you make subsequent changes that you want included in the next commit, you must run <span class="highlight">highlighted text</span> again.

You can specify the specific file with <span class="highlight">highlighted text</span>.

git --version

With <span class="highlight">highlighted text</span> you can add all files in the current directory including files that begin with a <span class="highlight">highlighted text</span>.

git --version

You can remove a file from staging while retaining changes within your working directory with <span class="highlight">highlighted text</span>.

git --version


Once you have staged your updates, you are ready to commit them, which will record changes you have made to the repository.

To commit staged files, you’ll run the <span class="highlight">highlighted text</span> command with your meaningful commit message so that you can track commits.

git --version

You can condense staging all tracked files with committing them in one step.

git --version

If you need to modify your commit message, you can do so with the <span class="highlight">highlighted text</span> flag.

git --version


A branch in Git is a movable pointer to one of the commits in the repository, it allows you to isolate work and manage feature development and integrations. You can learn more about branches by reading the Git documentation.

List all current branches with the <span class="highlight">highlighted text</span> command. An asterisk (<span class="highlight">highlighted text</span>) will appear next to your currently active branch.

git --version

Create a new branch. You will remain on your currently active branch until you switch to the new one.

git --version

Switch to any existing branch and check it out into your current working directory.

git --version

You can consolidate the creation and checkout of a new branch by using the <span class="highlight">highlighted text</span> flag.

git --version

Rename your branch name.

git --version

Merge the specified branch’s history into the one you’re currently working in.

git --version

Abort the merge, in case there are conflicts.

git --version

You can also select a particular commit to merge with <span class="highlight">highlighted text</span> with the string that references the specific commit.

git --version

When you have merged a branch and no longer need the branch, you can delete it.

git --version

If you have not merged a branch to master, but are sure you want to delete it, you can force delete a branch.

git --version

Collaborate and Update

To download changes from another repository, such as the remote upstream, you’ll use <span class="highlight">highlighted text</span>.

git --version

Merge the fetched commits.

git --version

Push or transmit your local branch commits to the remote repository branch.

git --version

Fetch and merge any commits from the tracking remote branch.

git --version


Display the commit history for the currently active branch.

git --version

Show the commits that changed a particular file. This follows the file regardless of file renaming.

git --version

Show the commits that are on one branch and not on the other. This will show commits on <span class="highlight">highlighted text</span> that are not on <span class="highlight">highlighted text</span>.

git --version

Look at reference logs (<span class="highlight">highlighted text</span>) to see when the tips of branches and other references were last updated within the repository.

git --version

Show any object in Git via its commit string or hash in a more human-readable format.

git --version

Show Changes

The <span class="highlight">highlighted text</span> command shows changes between commits, branches, and more. You can read more fully about it through the Git documentation.

Compare modified files that are on the staging area.

git --version

Display the diff of what is in <span class="highlight">highlighted text</span> but is not in <span class="highlight">highlighted text</span>.

git --version

Show the diff between two specific commits.

git --version


Sometimes you’ll find that you made changes to some code, but before you finish you have to begin working on something else. You’re not quite ready to commit the changes you have made so far, but you don’t want to lose your work. The <span class="highlight">highlighted text</span> command will allow you to save your local modifications and revert back to the working directory that is in line with the most recent <span class="highlight">highlighted text</span> commit.

Stash your current work.

git --version

See what you currently have stashed.

git --version

Your stashes will be named <span class="highlight">highlighted text</span>, <span class="highlight">highlighted text</span>, and so on.

Show information about a particular stash.

git --version

To bring the files in a current stash out of the stash while still retaining the stash, use <span class="highlight">highlighted text</span>.

git --version

If you want to bring files out of a stash, and no longer need the stash, use <span class="highlight">highlighted text</span>.

git --version

If you no longer need the files saved in a particular stash, you can <span class="highlight">highlighted text</span> the stash.

git --version

If you have multiple stashes saved and no longer need to use any of them, you can use <span class="highlight">highlighted text</span> to remove them.

git --version

Ignoring Files

If you want to keep files in your local Git directory, but do not want to commit them to the project, you can add these files to your <span class="highlight">highlighted text</span> file so that they do not cause conflicts.

Use a text editor such as nano to add files to the <span class="highlight">highlighted text</span> file.

git --version

To see examples of <span class="highlight">highlighted text</span> files, you can look at GitHub’s <span class="highlight">highlighted text</span> template repo.


A rebase allows us to move branches around by changing the commit that they are based on. With rebasing, you can squash or reword commits.

You can start a rebase by either calling the number of commits you have made that you want to rebase (<span class="highlight">highlighted text</span> in the case below).

git --version

Alternatively, you can rebase based on a particular commit string or hash.

git --version

Once you have squashed or reworded commits, you can complete the rebase of your branch on top of the latest version of the project’s upstream code.

git --version

To learn more about rebasing and updating, you can read How To Rebase and Update a Pull Request, which is also applicable to any type of commit.


Sometimes, including after a rebase, you need to reset your working tree. You can reset to a particular commit, and delete all changes, with the following command.

git --version

To force push your last known non-conflicting commit to the origin repository, you’ll need to use <span class="highlight">highlighted text</span>.

Warning: Force pushing to master is often frowned upon unless there is a really important reason for doing it. Use sparingly when working on your own repositories, and work to avoid this when you’re collaborating.

git --version

To remove local untracked files and subdirectories from the Git directory for a clean working branch, you can use <span class="highlight">highlighted text</span>.

git --version

If you need to modify your local repository so that it looks like the current upstream master (that is, there are too many conflicts), you can perform a hard reset.

Note: Performing this command will make your local repository look exactly like the upstream. Any commits you have made but that were not pulled into the upstream will be destroyed.

git --version


This guide covers some of the more common Git commands you may use when managing repositories and collaborating on software.

An Introduction to Git Merge and Git Rebase: What They Do and When to Use Them

An Introduction to Git Merge and Git Rebase: What They Do and When to Use Them

As a Developer, many of us have to choose between Merge and Rebase. With all the references we get from the internet, everyone believes “Don’t use Rebase, it could cause serious problems.” Here I will explain what merge and rebase are, why you should (and shouldn’t) use them, and how to do so.

As a Developer, many of us have to choose between Merge and Rebase. With all the references we get from the internet, everyone believes “Don’t use Rebase, it could cause serious problems.” Here I will explain what merge and rebase are, why you should (and shouldn’t) use them, and how to do so.

Git Merge and Git Rebase serve the same purpose. They are designed to integrate changes from multiple branches into one. Although the final goal is the same, those two methods achieve it in different ways, and it's helpful to know the difference as you become a better software developer.

This question has split the Git community. Some believe you should always rebase and others that you should always merge. Each side has some convincing benefits.

Git Merge

Merging is a common practice for developers using version control systems. Whether branches are created for testing, bug fixes, or other reasons, merging commits changes to another location. To be more specific, merging takes the contents of a source branch and integrates them with a target branch. In this process, only the target branch is changed. The source branch history remains the same.

Merge Master -> Feature branch


  • Simple and familiar
  • Preserves complete history and chronological order
  • Maintains the context of the branch


  • Commit history can become polluted by lots of merge commits
  • Debugging using git bisect can become harder

How to do it

Merge the master branch into the feature branch using the checkout and merge commands.

$ git checkout feature
$ git merge master
$ git merge master feature

This will create a new “Merge commit” in the feature branch that holds the history of both branches.

Git Rebase

Rebase is another way to integrate changes from one branch to another. Rebase compresses all the changes into a single “patch.” Then it integrates the patch onto the target branch.

Unlike merging, rebasing flattens the history because it transfers the completed work from one branch to another. In the process, unwanted history is eliminated.

Rebases are how changes should pass from the top of the hierarchy downwards, and merges are how they flow back upwards

Rebase feature branch into master


  • Streamlines a potentially complex history
  • Manipulating a single commit is easy (e.g. reverting them)
  • Avoids merge commit “noise” in busy repos with busy branches
  • Cleans intermediate commits by making them a single commit, which can be helpful for DevOps teams


  • Squashing the feature down to a handful of commits can hide the context
  • Rebasing public repositories can be dangerous when working as a team
  • It’s more work: Using rebase to keep your feature branch updated always
  • Rebasing with remote branches requires you to force push. The biggest problem people face is they force push but haven’t set git push default. This results in updates to all branches having the same name, both locally and remotely, and that is dreadful to deal with.
If you rebase incorrectly and unintentionally rewrite the history, it can lead to serious issues, so make sure you know what you are doing!

How to do it

Rebase the feature branch onto the master branch using the following commands.

$ git checkout feature
$ git rebase master

This moves the entire feature branch on top of the master branch. It does this by re-writing the project history by creating brand new commits for each commit in the original (feature) branch.

Interactive Rebasing

This allows altering the commits as they are moved to the new branch. This is more powerful than automated rebase, as it offers complete control over the branch’s commit history. Typically this is used to clean up a messy history before merging a feature branch into master.

$ git checkout feature
$ git rebase -i master

This will open the editor by listing all the commits that are about to be moved.

pick 22d6d7c Commit message#1
pick 44e8a9b Commit message#2
pick 79f1d2h Commit message#3

This defines exactly what the branch will look like after the rebase is performed. By re-ordering the entities, you can make the history look like whatever you want. For example, you can use commands like fixupsquashedit etc, in place of pick.

Which one to use

So what’s best? What do the experts recommend?

It’s hard to generalize and decide on one or the other, since every team is different. But we have to start somewhere.

Teams need to consider several questions when setting their Git rebase vs. merge policies. Because as it turns out, one workflow strategy is not better than the other. It is dependent on your team.

Consider the level of rebasing and Git competence across your organization. Determine the degree to which you value the simplicity of rebasing as compared to the traceability and history of merging.

Finally, decisions on merging and rebasing should be considered in the context of a clear branching strategy (Refer this article to understand more about branching strategy). A successful branching strategy is designed around the organization of your teams.

What do I recommend?

As the team grows, it will become hard to manage or trace development changes with an always merge policy. To have a clean and understandable commit history, using Rebase is reasonable and effective.

By considering the following circumstances and guidelines, you can get best out of Rebase:

  • You’re developing locally: If you have not shared your work with anyone else. At this point, you should prefer rebasing over merging to keep your history tidy. If you’ve got your personal fork of the repository and that is not shared with other developers, you’re safe to rebase even after you’ve pushed to your branch.
  • Your code is ready for review: You created a pull request. Others are reviewing your work and are potentially fetching it into their fork for local review. At this point, you should not rebase your work. You should create ‘rework’ commits and update your feature branch. This helps with traceability in the pull request and prevents the accidental history breakage.
  • The review is done and ready to be integrated into the target branch.Congratulations! You’re about to delete your feature branch. Given that other developers won’t be fetch-merging in these changes from this point on, this is your chance to sanitize your history. At this point, you can rewrite history and fold the original commits and those pesky ‘pr rework’ and ‘merge’ commits into a small set of focused commits. Creating an explicit merge for these commits is optional, but has value. It records when the feature graduated to master.


I hope this explanation has given some insights on Git merge and Git rebase.Merge vs rebase strategy is always debatable. But perhaps this article will help dispel your doubts and allow you to adopt an approach that works for your team.

Originally published by Vali Shah at

Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis

Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis

<strong>Originally published by </strong><a href="" target="_blank">David Herron</a><strong> </strong><em>at&nbsp;</em><a href="" target="_blank"></a>

Some claim the machine learning field is in a crisis due to software tooling that’s insufficient to ensure repeatable processes. The crisis is about difficulty in reproducing results such as machine learning models. The crisis could be solved with better software tools for machine learning practitioners.

The reproducibility issue is so important that the annual NeurIPS conference plans to make this a major topic of discussion at NeurIPS 2019. The “Call for Papers” announcement has more information

The so-called crisis is because of the difficulty in replicating the work of co-workers or fellow scientists, threatening their ability to build on each other’s work or to share it with clients or to deploy production services. Since machine learning, and other forms of artificial intelligence software, are so widely used across both academic and corporate research, replicability or reproducibility is a critical problem.

We might think this can be solved with typical software engineering tools, since machine learning development is similar to regular software engineering. In both cases we generate some sort of compiled software asset for execution on computer hardware hoping to get accurate results. Why can’t we tap into the rich tradition of software tools, and best practices for software quality, to build repeatable processes for machine learning teams?

Unfortunately traditional software engineering tools do not fit well with the needs of machine learning researchers.

A key issue is the training data. Often this is a large amount of data, such as images, videos, or texts, that are fed into machine learning tools to train an ML model. Often the training data is not under any kind of source control mechanism, if only because systems like Git do not deal well with large data files, and source control management systems designed to generate delta’s for text files do not deal well with changes to large binary files. Any experienced software engineer will tell you that a team without source control will be in a state of barely managed chaos. Changes won’t always be recorded and team members might forget what was done.

At the end of the day that means a model trained against the training data cannot be replicated because the training data set will have changed in unknown-able ways. If there is no software system to remember the state of the data set on any given day, then what mechanism is there to remember what happened when?

Git-LFS is your solution, right?

The first response might be to simply use Git-LFS (Git Large File Storage) because it, as the name implies, deals with large files while building on Git. The pitch is that Git-LFS “replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like or GitHub Enterprise.” One can just imagine a harried machine learning team saying “sounds great, let’s go for it”. It handles multi-gigabyte files, speeds up checkout from remote repositories, and uses the same comfortable workflow. That sure ticks a lot of boxes, doesn’t it?

Not so fast, didn’t your manager instruct you to evaluate carefully before jumping in with both feet? Another life lesson to recall is to look both ways before crossing the street.

The first thing your evaluation should turn up is that Git-LFS requires an LFS server, and that server is not available through every Git hosting service. The big three (Github, Gitlab and Atlassian) all support Git-LFS, but maybe you have a DIY bone in your body. Instead of using a 3rd party Git hosting service, you might prefer to host your own Git service. Gogs, for example, is a competent Git service you can easily run on your own hardware, but it does not have built-in support for Git-LFS.

Depending on your data needs this next could be a killer: Git LFS lets you store files up to 2 GB in size. That is a Github limitation rather than Git-LFS limitation, however all Git-LFS implementations seem to come with various limitations. Gitlab and Atlassian both have their own lists of Git-LFS limitations. Consider this 2GB limit from Github: One of the use-cases in the Git-LFS pitch is storing video files, but isn’t it common for videos to be way beyond 2GB in size? Therefore GIt-LFS on Github is probably unsuitable for machine learning datasets.

It’s not just the 2GB file size limit, but Github places such a tight limit on the free tier of Git-LFS use that one must purchase a data plan covering both data and bandwidth usage.

An issue related to bandwidth is that when using a hosted Git-LFS solution, your training data is stored in a remote server and must be downloaded over the Internet. The time to download training data is a serious user experience problem.

Another issue is the ease of placing data files on a cloud storage system (AWS, GCP, etc) as is often required when to run cloud-based AI software. This is not supported, since the main Git-LFS offerings from the big 3 Git services store your LFS files on their server. There is a DIY Git-LFS server that does store files on AWS S3 at But setting up a custom Git-LFS server of course requires additional work. And, what if you need the files to be on GCP instead of AWS infrastructure? Is there a Git-LFS server which stores data on the cloud storage platform of your choice? Is there a Git-LFS server that utilizes a simple SSH server? In other words, GIt-LFS limits your choices of where the data is stored.

Does using Git-LFS solve the so-called Machine Learning Reproducibility Crisis?

With Git-LFS your team has better control over the data, because it is now version controlled. Does that mean the problem is solved?

Earlier we said the “key issue is the training data”, but that was a lie. Sort of. Yes keeping the data under version control is a big improvement. But is the lack of version control of the data files the entire problem? No.

What determines the results of training a model or other activities? The determining factors include the following, and perhaps more:

  • Training data — the image database or whatever data source is used in training the model
  • The scripts used in training the model
  • The libraries used by the training scripts
  • The scripts used in processing data
  • The libraries or other tools used in processing data
  • The operating system and CPU/GPU hardware
  • Production system code
  • Libraries used by production system code

Obviously the result of training a model depends on a variety of conditions. Since there are so many variables to this, it is hard to be precise, but the general problem is a lack of what’s now called Configuration Management. Software engineers have come to recognize the importance of being able to specify the precise system configuration used in deploying systems.

Solutions to machine learning reproducibility

Humans are an inventive lot, and there are many possible solutions to this “crisis”.

Environments like R Studio or Jupyter Notebook offer a kind of interactive Markdown document which can be configured to execute data science or machine learning workflows. This is useful for documenting machine learning work, and specifying which scripts and libraries are used. But these systems do not offer a solution to managing data sets.

Likewise, Makefiles and similar workflow scripting tools offer a method to repeatedly execute a series of commands. The executed commands are determined through file-system time stamps. These tools offer no solution for data management.

At the other end of the scale are companies like Domino Data Labs or C3 IoT offering hosted platforms for data science and machine learning. Both package together an offering built upon a wide swath of data science tools. In some cases, like C3 IoT, users are coding in a proprietary language and storing their data in a proprietary data store. It can be enticing to use a one-stop-shopping service, but will it offer the needed flexibility?

In the rest of this article we’ll discuss DVC. It was designed to closely match Git functionality, to leverage the familiarity most of us have with Git, but with features making it work well for both workflow and data management in the machine learning context.

DVC ( takes on and solves a larger slice of the machine learning reproducibility problem than does Git-LFS or several other potential solutions. It does this by managing the code (scripts and programs), alongside large data files, in a hybrid between DVC and a source code management (SCM) system like Git. In addition DVC manages the workflow required for processing files used in machine learning experiments. The data files and commands-to-execute are described in DVC files which we’ll learn about in the following sections. Finally, with DVC it is easy to store data on many storage systems from the local disk, to an SSH server, or to cloud systems (S3, GCP, etc). Data managed by DVC can be easily shared with others using this storage system.

Image courtesy

DVC uses a similar command structure to Git. As we see here, just like git push and git pull are used for sharing code and configuration with collaborators, dvc push and dvc pull is used for sharing data. All this is covered in more detail in the coming sections, or if you want to skip right to learning about DVC see the tutorial at

DVC remembers precisely which files were used at what point of time

At the core of DVC is a data store (the DVC cache) optimized for storing and versioning large files. The team chooses which files to store in the SCM (like Git) and which to store in DVC. Files managed by DVC are stored such that DVC can maintain multiple versions of each file, and to use file-system links to quickly change which version of each file is being used.

Conceptually the SCM (like Git) and DVC both have repositories holding multiple versions of each file. One can check out “version N” and the corresponding files will appear in the working directory, then later check out “version N+1” and the files will change around to match.

Image courtesy

On the DVC side, this is handled in the DVC cache. Files stored in the cache are indexed by a checksum (MD5 hash) of the content. As the individual files managed by DVC change, their checksum will of course change, and corresponding cache entries are created. The cache holds all instances of each file.

For efficiency, DVC uses several linking methods (depending on file system support) to insert files into the workspace without copying. This way DVC can quickly update the working directory when requested.

DVC uses what are called “DVC files” to describe both the data files and the workflow steps. Each workspace will have multiple DVC files, with each describing one or more data files with the corresponding checksum, and each describing a command to execute in the workflow.

cmd: python src/ data/data.xml
- md5: b4801c88a83f3bf5024c19a942993a48
  path: src/
- md5: a304afb96060aad90176268345e10355
  path: data/data.xml
md5: c3a73109be6c186b9d72e714bcedaddb
- cache: true
  md5: 6836f797f3924fb46fcfd6b9f6aa6416.dir
  metric: false
  path: data/prepared
wdir: .

This example DVC file comes from the DVC Getting Started example ( and shows the initial step of a workflow. We’ll talk more about workflows in the next section. For now, note that this command has two dependencies, src/ and data/data.xml, and an output data directory named data/prepared. Everything has an MD5 hash, and as these files change the MD5 hash will change and a new instance of changed data files are stored in the DVC cache.

DVC files are checked into the SCM managed (Git) repository. As commits are made to the SCM repository each DVC file is updated (if appropriate) with new checksums of each file. Therefore with DVC one can recreate exactly the data set present for each commit, and the team can exactly recreate each development step of the project.

DVC files are roughly similar to the “pointer” files used in Git-LFS.

The DVC team recommends using different SCM tags or branches for each experiment. Therefore accessing the data files, and code, and configuration, appropriate to that experiment is as simple as switching branches. The SCM will update the code and configuration files, and DVC will update the data files, automatically.

This means there is no more scratching your head trying to remember which data files were used for what experiment. DVC tracks all that for you.

DVC remembers the exact sequence of commands used at what point of time

The DVC files remember not only the files used in a particular execution stage, but the command that is executed in that stage.

Reproducing a machine learning result requires not only using the precise same data files, but the same processing steps and the same code/configuration. Consider a typical step in creating a model, of preparing sample data to use in later steps. You might have a Python script,, to perform that split, and you might have input data in an XML file named data/data.xml.

$ dvc run -d data/data.xml -d code/ \
            -o data/prepared \
            python code/

This is how we use DVC to record that processing step. The DVC “run” command creates a DVC file based on the command-line options.

The -d option defines dependencies, and in this case we see an input file in XML format, and a Python script. The -o option records output files, in this case there is an output data directory listed. Finally, the executed command is a Python script. Hence, we have input data, code and configuration, and output data, all dutifully recorded in the resulting DVC file, which corresponds to the DVC file shown in the previous section.

If is changed from one commit to the next, the SCM will automatically track the change. Likewise any change to data.xml results in a new instance in the DVC cache, which DVC will automatically track. The resulting data directory will also be tracked by DVC if they change.

A DVC file can also simply refer to a file, like so:

md5: 99775a801a1553aae41358eafc2759a9
- cache: true
  md5: ce68b98d82545628782c66192c96f2d2
  metric: false
  path: data/
  persist: false
wdir: ..

This results from the “dvc add <em>file</em>” command, which is used when you simply have a data file, and it is not the result of another command. For example in this is shown, which results in the immediately preceeding DVC file:

$ wget -P data
$ dvc add data/

The file is then the data source for a sequence of steps shown in the tutorial that derive information from this data.

Take a step back and recognize these are individual steps in a larger workflow, or what DVC calls a pipeline. With “dvc add” and “dvc run” you can string together several Stages, each being created with a “dvc run” command, and each being described by a DVC file. For a complete working example, see and

This means that each working directory will have several DVC files, one for each stage in the pipeline used in that project. DVC scans the DVC files to build up a Directed Acyclic Graph (DAG) of the commands required to reproduce the output(s) of the pipeline. Each stage is like a mini-Makefile in that DVC executes the command only if the dependencies have changed. It is also different because DVC does not consider only the file-system timestamps, like Make does, but whether the file content has changed, as determined by the checksum in the DVC file versus the current state of the file.

Bottom line is that this means there is no more scratching your head trying to remember which version of what script was used for each experiment. DVC tracks all of that for you.

Image courtesy

DVC makes it easy to share data and code between team members

A machine learning researcher is probably working with colleagues, and needs to share data and code and configuration. Or the researcher may need to deploy data to remote systems, for example to run software on a cloud computing system (AWS, GCP, etc), which often means uploading data to the corresponding cloud storage service (S3, GCP, etc).

The code and configuration side of a DVC workspace is stored in the SCM (like Git). Using normal SCM commands (like “git clone”) one can easily share it with colleagues. But how about sharing the data with colleagues?

DVC has the concept of remote storage. A DVC workspace can push data to, or pull data from, remote storage. The remote storage pool can exist on any of the cloud storage platforms (S3, GCP, etc) as well as an SSH server.

Therefore to share code, configuration and data with a colleague, you first define a remote storage pool. The configuration file holding remote storage definitions is tracked by the SCM. You next push the SCM repository to a shared server, which carries with it the DVC configuration file. When your colleague clones the repository, they can immediately pull the data from the remote cache.

This means your colleagues no longer have to scratch their head wondering how to run your code. They can easily replicate the exact steps, and the exact data, used to produce the results.

Image courtesy


The key to repeatable results is using good practices, to keep proper versioning of not only their data but the code and configuration files, and to automate processing steps. Successful projects sometimes requires collaboration with colleagues, which is made easier through cloud storage systems. Some jobs require AI software running on cloud computing platforms, requiring data files to be stored on cloud storage platforms.

With DVC a machine learning research team can ensure their data, configuration and code are in sync with each other. It is an easy-to-use system which efficiently manages shared data repositories alongside an SCM system (like Git) to store the configuration and code.