Should data scientists use Git and Github?

Given the popularity and success of tools like Git and Github, it's worth considering whether you should invest in version control tools for your team's next Data Science project. In fact, many vendors quite strongly promote using version control systems: Microsoft Azure's Data Science Process documentation calls Git "the collaborative code development framework for data science projects".

But like any investment for your team, choosing to organize your Data Science projects around a version control system has its trade-offs. Personally, I wouldn't recommend that every project use version control, even though I always use Git for my own projects: if you have to ask whether you need it, you probably don't!

To explain that a bit more, it's helpful to have a practical understanding of the different types of version control, and how they're typically used for Data Science projects. With that context, it will be much easier to assess the trade-offs from using version control for your team, and I think you'll see why I recommend most projects don't utilize Git or other version control systems, at least initially.

What does version control do?

There are three types of version control:

  1. Source code version control is the most commonly used, and generally what people mean when they say "version control." It helps manage the code - SQL, Python, Scala - in your projects.
  2. Data source version control is less common, but growing in popularity among Data Scientists. It tracks whether model inputs change, so that you can monitor if any of your project's models are different as result of training from a different set of data inputs.
  3. Application version control is common, but less relevant for Data Science projects. If you've ever downloaded a file and seen instructions to "check the hash of the download to ensure its security," that's what application version control does: it makes sure that the copy of your files are identical to other peoples files, so that you don't use compromised software.

Source version control is all about how files change. Tools like Git track every change to a file, and allow annotating those changes with messages that explain the reason for the change. Changes are called "commits," in reference to the fact that changes are recorded forever.

The series of changes is logged in a set of files usually called a repo, short for repository. You can upload those files to a remote server, making them shareable to anyone that has access to the server. Servers host by companies like Github, Gitlab, Bitbucket, and others make it easy to upload your repos to a server and share them with the whole internet.

Those tools are essentially the same as Dropbox or Box, but with additional features more relevant to developers, such as browsing different versions of the files at the same time.

What are the benefits of using Version Control tools like Git?

Version control has become a critical part of every software organization for a reason: it significantly reduces the risk that a change to an application will cause anything to stop working. It also ensures that two different developers working on the same application can work at the same time.

Those are the same benefits that Data Scientists will have if they use version control for the code that builds their  projects. By using a more traditional software development workflow, you begin to treat your models more as an application and less as a script, which leads to higher quality.

Additionally, version control with Git allows you to integrate more easily with automation tools like Github actions, Gitlab workflow, Azure pipelines, and  others. The automated build tools can help you identify any regressions or issues with your models or data pipelines right when they happen, and identify exactly the change that led to the issue. That can help save time and increase quality.

What are the downsides of using Version Control tools like Git?

Although there are significant benefits from using version control tools like Git, they come with a high overhead cost, especially if you're just starting out with version control for the first time.

The overhead comes from needing to ensure that every change goes through the "commit" process, which most often means using the command line and terminal. Since the terminal is so unfamiliar to most analysts (and even data scientists), you don't just need to learn Git, you also need to learn the terminal! This is not quick, and having your efficiency suffer while struggling to remember what command to write is a huge turn-off.

Additionally, if you have two team members working on similar parts of the project at the same time, eventually they'll have to rectify their changes using Git's "merge" feature. Understanding how to perform a successful merge is actually mysterious even for many developers, and can be a complete show-stopper if you don't have an experienced developer to help get you through.

Using version control for your project also won't necessarily solve some of the most frequent and frustrating questions that come up. For example, if you look at the outputs of a model on your coworker's computer, you won't necessarily know exactly the version of their work that produced that output. Unless your process for updating your outputs is fully automated, and you always runs that process from a fresh "checkout" of your project, you'll inevitably run into changes in the output that can't be accounted for.

Perhaps the biggest downside to using version control is that it doesn't necessarily play well with tools like Jupyter Notebooks or WYSIWYG builders like Tableau. In fact, you'd basically have to stop using those if you want the benefits of version control (although there are tools like Jupytext that try to solve this problem).

Conclusion

Version control is an optimization for teams that have stringent requirements for their project's quality. The significant investment in time is usually not worth it unless you have a very low tolerance for errors, such as if your team is maintaining mission critical models used in real-time systems, like automated fraud detection.

If you're starting a project with less stringent requirements, like building new internal reporting, I wouldn't recommend beginning with Git. It's unlikely you'll run into the kinds of problems that version control excels at solving, and if you do, it's usually easy to add version control after the project has started.