Managing software dependencies for Data Science projects

Virtual environments are a must when developing software projects. They allow you to create self-contained, isolated Python installations that prevent your projects from clashing with each other and let other people reproduce your setup.

However, using virtual environments is just the first step towards developing reproducible data projects. This post discusses another important subject: dependency management, which relates to properly documenting virtual environments to ease reproducibility and make them reliable when deployed to production.

For a more in-depth description of Python environments, see this other post.

How dependency managers work

tl; dr; Package managers use heuristics to install package versions that are compatible with each other. A large set of dependencies and version constraints might lead to failures when resolving the environment.

When you install a package, your package manager (pip, conda) has to install dependencies for the requested package and dependencies of those dependencies, until all requirements are satisfied.

Packages usually have version constraints. For example, at this time of writing scikit-learn requires numpy 1.13.3 or higher. When packages share dependencies, the package manager has to find versions that satisfy all constraints (this process is known as dependency resolution), doing so is computationally expensive, current package managers use heuristics to find a solution in a reasonable amount of time.

With large sets of dependencies, it can happen that the solver is unable to find a solution. Some package managers will throw an error, but others might just print a warning message. To avoid these issues, it is important to be conscious about project dependencies.

#machine-learning #data-science #python

How dependency managers work

towardsdatascience.com

Managing software dependencies for Data Science projects