Software Engineering Tips and Best Practices for Data Science

If you’re into data science, then you’re probably familiar with this workflow: you start a project by firing up a jupyter notebook, then begin writing your python code, running complex analyses, or even training a model. As the notebook file grows in size with all the functions, the classes, the plots, and the logs, you find yourself with an enormous blob of monolithic code sitting up in one place in front of you. If you’re lucky, things can go well. Good for you then!

However, jupyter notebooks hide some serious pitfalls that may turn your coding into a living hell. Let’s see how this happens and then discuss coding best practices to prevent it.

The problems with Jupyter Notebook

source: datascience.foundation.

Quite often, things may not go the way you intend if you want to take your jupyter prototyping to the next level. Here are some situations I encountered while using this tool, and that should sound familiar to you:

With all the objects (functions or classes) defined and instantiated in one place,maintainability becomes really hard: even if you want to make a small change to a function, you have to locate it somewhere in the notebook, fix it and rerun the code all over again. You don’t want that, believe me. Wouldn’t it be simple to have your logic and processing functions separated in external scripts?
Because of its interactivity and instant feedback, jupyter notebooks push data scientists to declare variables in the global namespace instead of using functions. This is considered bad practice in python development because it **limits effective code reuse. **It also harms reproducibility because your notebook turns into a large state machine holding all your variables. In this configuration, you’ll have to remember which result is cached and which is not, and you’ll also have to expect other users to follow your cell execution order.
The way notebooks are formatted behind the scenes (JSON objects) makes **code versioning difficult. **This is why I rarely see data scientists using GIT to commit different versions of a notebook or merging branches for specific features. Consequently, team collaboration becomes inefficient and clunky: team members start exchanging code snippets and notebooks via e-mail or Slack, rolling back to a previous version of the code is a nightmare, and the file organization starts to be messy. Here’s what I commonly see in projects after two or three weeks of using a jupyter notebook without proper versioning:
analysis.ipynb
analysis_COPY(1).ipynb
analysis_COPY(2).ipynb
analysis_FINAL.ipynb
analysis_FINAL_2.ipynb

#2020 oct tutorials # overviews #best practices #data science #software engineering #tips

The problems with Jupyter Notebook

kdnuggets.com

Software Engineering Tips and Best Practices for Data Science