Down with technical debt! Clean Python for data scientists..Data science teams tend to pull in two competing directions. On one ... who value highly reliable, robust code which carries low technical debt.
Data science teams tend to pull in two competing directions. On one side there’s the data engineers who value highly reliable, robust code which carries low technical debt. On the other, there are the data scientists who value the rapid prototyping of ideas and algorithms in Proof-of-Concept like settings.
While more mature data science functions enjoy a fruitful working partnership between the two sides, have sophisticated CI/CD pipelines in place, and have well defined segregation of responsibilities, oftentimes early stage teams are dominated by a high ratio of inexperienced data scientists. As a result, code quality suffers, and technical debt accumulates exponentially in the form of glue code, pipeline jungles, dead experimental codepaths, and configuration debt .
Can you imagine a life without xkcd
Recently I wrote a brain dump on why data scientists’ code tends to suffer from mediocrity, in this post I’m hoping to shed light on some of the ways that more fledgling data scientists can write cleaner Python code and better structure small scale projects, with the important side effect of reducing the amount of technical debt you inadvertently burden on yourself and your team.
Neither exhaustive in scope nor rigorous in depth, the below is intended to act as a series of shallow introductions to ways you can institute data science projects in a more thoughtful way. Some points will be obvious, some will be less obvious.
Here’s a quick overview of what to expect: (1) Style Guidelines, (2) Documentation, (3) Type Checking, (4) Project folder structures, (5) Code version control, (6) Model version control, (7) Environments, (8) Jupyter Notebooks, (9) Unit testing, (10) Logging.Python Style Guidelines — PEP 8 and Linting
Readability Counts. So much so that there’s an entire PEP devoted to it: PEP8, which provides coding conventions for writing clean Python code.
Conforming to PEP8 standards is considered the bare minimum of what constitutes Pythonic code. It signals that you’re aware of the most basic conventions expected of a Python developer, it shows that you’re able collaborate more easily with other developers, and most importantly of all, it makes your code more readable, foolishly consistent, and easier to digest for you.
It would be a waste of everyone’s time for me to copy and reformat the PEP8 style guide here. So, at your own pleasure, browse around pep8.org, look at the examples and breathe in what it means to write clean code on the micro level (as oppose to clean code on the macro or system level).
Examples given in PEP8 include setting standards for naming conventions, indentations, imports, and line length.
As an aside, PEP8 is one of the many reasons why you should be using a fully fledged IDE like PyCharm (in my opinion the superior Python IDE) to write your code instead of a simple text editor like Sublime. Heavyweight IDEs for Python will typically conform to the PEP8 style guide, warning when you’re violating its principles and providing automatic reformatting of your codebase.
There are four — though actually many others exist — command line tools that perform static analysis of your source code to keep it clean and consistent:
Side note 1. A linter will not tell you whether you’ve named your variables well. This much laughed at skill by newbie devs is a skill worth mastering.
Side note 2. Before installing these packages, it’s best to be in a virtual environment. More on that later.Documenting your project — PEP257 and Sphynx
While PEP8 outlines coding conventions for Python, PEP257 standardises the high level structure, semantics, and conventions of docstrings: what they should contain, and how to say it in a clear way. As with PEP8, these are not hard rules, but they are guidelines which you’d be wise to follow.
If you violate these conventions, the worst you’ll get is some dirty looks.
So, what then is a docstring? A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the doc special attribute of that object. As with PEP8, I won’t copy and reformat the whole PEP, you should browse through it in your own time, but here are two examples of docstrings for functions.
def add(a, b): """Sum two numbers.""" return a + b
2. Example multi-line docstring for a function complex:
def complex(real=0.0, imag=0.0): """Form a complex number. Keyword arguments: real -- the real part (default 0.0) imag -- the imaginary part (default 0.0) """ if imag == 0.0 and real == 0.0: return complex_zero ...
After all, your well formatted docstrings are in place, next you’re going to want to convert them into prettified project documentation. The de-factor Python documentation generator is called Sphynx, and it can generate output like html, pdfs, unix man pages, and more.
Here’s a good tutorial on getting started with Sphynx. In short, after initialising Sphynx in some directory (typically your docs directory), and setting up its configuration, you will work on your reStructuredText (*.rst) files, which after calling make, will be converted to your preferred output documentation type.
As a fun aside, you can create references to other parts of your Python program directly from your docstrings, which appear as links in the output documentation.
To illustrate output typical of Sphynx documentation generator, here is an incomplete but bulging list of Python projects with Sphynx generated documentation. Examples include matplotlib, networkX, Flask, and pandas.Type Checking — PEP484, PEP526, and mypy
Because Python is a dynamically typed language, you don’t have static type checking by default. This is good in that it offers flexibility and fast paced development, but it’s bad because you won’t catch simple errors before running the system (at compile time), instead catching them at runtime. Typically, dynamically typed languages like Python require more unit testing than statically typed languages do. This is tedious. In languages like Scala, or Java, or C#, type declarations occur in-code, and the compiler checks whether or not variables are legally passed around by way of type.
Ultimately, static typing acts as a safety net that in certain situations you would be wise to use. The good news is that Python actually offers something called type hints.
Here is an example of function annotation type hinting. This implies that
name should be of type
str, and the function should return a
str as well.
def greeting(name: str) -> str: return 'Hello ' + name
Here is an example of variable annotation type hinting. This implies that variable
i should be an integer. Note that the below syntax is for Python 3.6 and above.
i: int = 5
… but the bad news is that type hints are ignored by the Python interpreter, and has no runtime effect (the PEP is currently a work in progress).
But then, why bother with type hinting if it has no runtime effect? Two reasons. Firstly, it makes your code much clearer as to what it’s trying to do and how it flows. And secondly, because you can use a static type checker with Python, it just doesn’t run by default. The main one is mypy.
Side note 1: Because of the optional nature of type hints, you’re free to put them everywhere in your code, in some places, or nowhere. Just try to keep it consistent as to where they’re placed.
Side note 2: as already mentioned, PyCharm is useful for enforcing PEP8 standards. But it’s also useful in the case of type hinting. It automatically checks your type hints and tells you when you’re violating expectations. By default, in PyCharm these are set to warnings, but you can also set them to errors.Project folder structures — cookiecutter
Just as a cluttered desk is a sign of a cluttered mind, so too is a cluttered folder structure.
Many a project would benefit from a well thought out directory structure from the outset. An unfortunately common approach to starting projects is to create a base project directory, and shove everything directly under it — from data to notebooks to generated models to output — without considering the emergent ill effects as your simple playground project becomes an increasingly complex parameterised pipeline.
Eventually, you’ll end up with a form of technical debt which must later on be paid off in the form of time and effort. And the real tragedy? That all this can be avoided upfront, simply by gaining the foresight to properly structure your project from the outset.
Part of the reason we struggle with this is because it’s tedious to create project folder structures. We want to dive head first into exploring the data and building machine learning models. But a relatively minor effort investment can save a lot of effort.
A cleaner folder structure will encourage best practices, simplify separation of concerns, and makes learning (or relearning) old code far more enjoyable.
As luck — otherwise known as the hard work of open source developers — would have it, an oven baked, ready-to-go solution for creating the folder structure we want already exists: cookiecutter.
Creating a clean folder structure common to many data science projects is just a command away. Watch the video below for how to setup a cookiecutter project structure.
Note that cookiecutter is very powerful, and actually does much more than just generating clean project folder structures. For more info, check out the terrific cookiecutter documentation, which is as much a data science project philosophy as anything else.Code Version Control — git
I won’t dwell on this point, as it should be a given. The modern world of software development has moved far away from the murky Wild West of the pre 2000s. Every man and his dog should be using some kind of version control for their projects. Simple collaboration, efficient versioning, wind-backs, and code backup. Enough said.
Using git is just the first step. Using it well is another thing altogether.
Committing code: “commit early and commit often” is sound advice. Avoid committing huge chunks of code in favour of committing small, isolated functionalities. When you commit, write descriptive commit messages that accurately log your change.
Multi user best practices: this is largely context driven. The way I work with my team is to have a master branch that is never directly pushed to, a develop branch where the running version of the code is working but never directly worked on, and then feature branches where individual members of the team will code features which are later merged to the develop branch. When the develop branch is ready for release, it is merged to the master branch. This encapsulated way of working makes it simple for multiple developers to work on specific features without disturbing the main codebase, which makes merge conflicts less likely. For more info, check out this link.Model and Data Version Control — dvc
Models and data are not the same as code, and they should never be pushed into a code repository. They have unique lifecycle management requirements with different operational constraints. However, the same philosophy that applies to code versioning should apply to that of data and model versioning.
A great open source tool called dvc, built on top of git, is what you might be looking for here. It’s essentially a data pipeline building tool with easy options for data versioning, and in part helps to tackle the crisis of reproducibility in data science. It’s able to effectively and efficiently push your data and models to your server, whether local, AWS S3, GCS, Azure, SSH, HDFS, or HTTP.
dvc is centred around 3 main ideas:
Side note: an alternative to version control entire datasets is to store the meta-data necessary to recreate those datasets, with references to the model created off the back of that reference metadata.Build from the environment up — virtualenv
If it’s not in your common practice playbook to compartmentalise your environments, you’ve probably spent an afternoon or two balancing system wide library versions. Maybe you were working on one project, then moved onto another, updated numpy and poof!, your first project has a dependency break.
Imagine another situation where your project is pulled from git by another team member, who’s using a different version of some library used in the project to you. They write some code which relies on a new function that your version doesn’t have and push back into the master branch (word to the wise: don’t ever push directly to the master branch), which you pull. Your code breaks. Great.
Avoid this by using virtual environments. For simple Python projects, use virtualenv. If you have complex environment needs, use something like docker.
Here’s a simple virtualenv workflow:
mkvirtualenvwhen creating a new project
pip installthe packages that your analysis needs
pip freeze > requirements.txtto pin the exact package versions used to recreate the analysis
pip freeze > requirements.txtagain and commit the changes to version control.
Jupyter Notebooks are widespread in data science. They are built around the literate programming paradigm and act as powerful medium that enable a blend of rapid prototyping and ease of development with the capability to produce slick presentations, with intermediate code segments interleaved with output and explanatory text. Beautiful stuff.
But for all the advantages Notebooks bring to the table, they also cause a lot of suffering. How many
Untitled7.ipynb files do you have sitting around your computer? Perhaps the biggest setback of Notebooks is the way they so poorly mesh with version control.
The reason for this is because they are of a class of editors known as What You See is What You Get, whereby the editing software allows the user to view something very similar to the end result. The implication of this is that the document is abstracting away metadata, and in the case of Notebooks, it’s embedding your code by encasing it inside a large JSON data structure, with binary data such as images kept as base-64 encoded blobs.
Some clarification here though. Git can handle Notebooks in that you can push them to your repository. Git cannot handle them well in the case of comparing different versioned Notebooks, and struggle to provide you with reliable analysis on code written in them as well.
If you dig around for checked in Notebooks within your company — or even just on public GitHub — you’re likely to find database credentials, sensitive data, ‘don’t run this cell’ code blobs, and a whole host of other bad practices. To get around much of this pain, you could just purge the output each time you’re ready to checkin a Notebook. But that’s manual, means you’ll have to rerun the code to produce your output each time, and if there’s multiple users pulling and pushing the same Notebook, even purged Notebook metadata will change.
By way of tooling, there are options. One of the most popular is jupytext. Here is a great tutorial by the author on how to use it. All you have to do is install jupytext, which will provide a neat in-Notebook dropdown for downloading markdown versions of your code with output omitted, then explicitly ignore all
.ipynb files in
Unit testing your code can be a useful way to ensure that your isolated code blocks are working as expected. It allows automating your testing processes, catches bugs early, makes the process more agile, and ultimately helps you design better systems. The most commonly used testing framework in python is the ‘batteries included’, built-in and standard module called unittest, which provides a rich set of tools for constructing and running tests. Another testing tool is pytest.
There are many tutorials on how to carry out unit testing, but here is a run down of the key tips. Just as exception handling should minimize the amount of functionality you’re trying to catch potential errors over, each unit test should focus on one tiny bit of functionality to prove it working. Each unit test should be fully independent, and be able to run alone. Your testing suite should be run before and after developing new features; in fact, it’s a good idea to implement a hook to automatically run all tests before pushing code to a shared repo, this type of testing is often part of some CI/CD pipeline, and one of many open source example services to do this is called travis. Try to make your tests fast! Use long and descriptive names for each testing function. The tests should sit in a separate directory to your source code, see the folder structuring section for more info.PEP282,
Logging is a crucial part of any system that moves beyond POC territory. It’s a means of tracking what happens during program execution, and persisting that information to disk. It aids greatly in debugging, setting into stone a trail of breadcrumbs which can help you greatly in identifying your bug.
It is often used for serving one of two purposes. There’s diagnostic logging which records events related to the application’s operation. And audit logging which records basic runtime analytics for MI reporting. And there are several type of each log message: debug, info, warning, error, and critical.
logging is the build-in, standard library Python module for logging.Wrap up
If you’ve stuck with this until the end, kudos. You deserve a hearty pat on the back, here’s hoping at least some of the above information was useful. Now get out there and code.
Guide to Python Programming Language
The course will lead you from beginning level to advance in Python Programming Language. You do not need any prior knowledge on Python or any programming language or even programming to join the course and become an expert on the topic.
The course is begin continuously developing by adding lectures regularly.
Please see the Promo and free sample video to get to know more.
Hope you will enjoy it.
An Enthusiast Mind
Basic Knowledge To Use Computer
What will you learn
Will Be Expert On Python Programming Language
Build Application On Python Programming Language
Python Programming Tutorials For Beginners
Hello and welcome to brand new series of wiredwiki. In this series i will teach you guys all you need to know about python. This series is designed for beginners but that doesn't means that i will not talk about the advanced stuff as well.
As you may all know by now that my approach of teaching is very simple and straightforward.In this series i will be talking about the all the things you need to know to jump start you python programming skills. This series is designed for noobs who are totally new to programming, so if you don't know any thing about
programming than this is the way to go guys Here is the links to all the videos that i will upload in this whole series.
In this video i will talk about all the basic introduction you need to know about python, which python version to choose, how to install python, how to get around with the interface, how to code your first program. Than we will talk about operators, expressions, numbers, strings, boo leans, lists, dictionaries, tuples and than inputs in python. With
Lots of exercises and more fun stuff, let's get started.
Download free Exercise files.
Who is the target audience?
First time Python programmers
Students and Teachers
IT pros who want to learn to code
Aspiring data scientists who want to add Python to their tool arsenal
Students should be comfortable working in the PC or Mac operating system
What will you learn
know basic programming concept and skill
build 6 text-based application using python
be able to learn other programming languages
be able to build sophisticated system using python in the future
Learn Python Programming
Learn Python Programming
Learn Python Programming and increase your python programming skills with Coder Kovid.
Python is the highest growing programming language in this era. You can use Python to do everything like, web development, software development, cognitive development, machine learning, artificial intelligence, etc. You should learn python programming and increase your skills of programming.
In this course of learn python programming you don't need any prior programming knowledge. Every beginner can start with.
No prior knowledge needed to learn this course
What will you learn
Write Basic Syntax of Python Programming
Create Basic Real World Application
Program in a fluent manner
Get Familiar in Programming Environment