Beginner's Guide to Jupyter Notebooks for Data Science

Beginner's Guide to Jupyter Notebooks for Data Science

This tutorial explains how to install, run, and use Jupyter Notebooks for data science, including tips, best practices, and examples.

As a web application in which you can create and share documents that contain live code, equations, visualizations as well as text, the Jupyter Notebook is one of the ideal tools to help you to gain the data science skills you need.

What Is A Jupyter Notebook?

In this case, "notebook" or "notebook documents" denote documents that contain both code and rich text elements, such as figures, links, equations, ... Because of the mix of code and text elements, these documents are the ideal place to bring together an analysis description, and its results, as well as, they can be executed perform the data analysis in real time.

The Jupyter Notebook App produces these documents.

We'll talk about this in a bit.

For now, you should know that "Jupyter" is a loose acronym meaning Julia, Python, and R. These programming languages were the first target languages of the Jupyter application, but nowadays, the notebook technology also supports many other languages.

And there you have it: the Jupyter Notebook.

As you just saw, the main components of the whole environment are, on the one hand, the notebooks themselves and the application. On the other hand, you also have a notebook kernel and a notebook dashboard.

Let's look at these components in more detail.

What Is The Jupyter Notebook App?

As a server-client application, the Jupyter Notebook App allows you to edit and run your notebooks via a web browser. The application can be executed on a PC without Internet access, or it can be installed on a remote server, where you can access it through the Internet.

Its two main components are the kernels and a dashboard.

A kernel is a program that runs and introspects the user’s code. The Jupyter Notebook App has a kernel for Python code, but there are also kernels available for other programming languages.

The dashboard of the application not only shows you the notebook documents that you have made and can reopen but can also be used to manage the kernels: you can which ones are running and shut them down if necessary.

The History of IPython and Jupyter Notebooks

To fully understand what the Jupyter Notebook is and what functionality it has to offer you need to know how it originated.

Let's back up briefly to the late 1980s. Guido Van Rossum begins to work on Python at the National Research Institute for Mathematics and Computer Science in the Netherlands.

Wait, maybe that's too far.

Let's go to late 2001, twenty years later. Fernando Pérez starts developing IPython.

In 2005, both Robert Kern and Fernando Pérez attempted building a notebook system. Unfortunately, the prototype had never become fully usable.

Fast forward two years: the IPython team had kept on working, and in 2007, they formulated another attempt at implementing a notebook-type system. By October 2010, there was a prototype of a web notebook, and in the summer of 2011, this prototype was incorporated, and it was released with 0.12 on December 21, 2011. In subsequent years, the team got awards, such as the Advancement of Free Software for Fernando Pérez on 23 of March 2013 and the Jolt Productivity Award, and funding from the Alfred P. Sloan Foundations, among others.

Lastly, in 2014, Project Jupyter started as a spin-off project from IPython. IPython is now the name of the Python backend, which is also known as the kernel. Recently, the next generation of Jupyter Notebooks has been introduced to the community. It's called JupyterLab.

After all this, you might wonder where this idea of notebooks originated or how it came about to the creators.

A brief research into the history of these notebooks learns that Fernando Pérez and Robert Kern were working on a notebook just at the same time as the Sage notebook was a work in progress. Since the layout of the Sage notebook was based on the layout of Google notebooks, you can also conclude that also Google used to have a notebook feature around that time.

For what concerns the idea of the notebook, it seems that Fernando Pérez, as well as William Stein, one of the creators of the Sage notebook, have confirmed that they were avid users of the Mathematica notebooks and Maple worksheets. The Mathematica notebooks were created as a front end or GUI in 1988 by Theodore Gray.

The concept of a notebook, which contains ordinary text and calculation and/or graphics, was definitely not new.

Also, the developers had close contact with one another and this, together with other failed attempts at GUIs for IPython and the use of "AJAX" = web applications, which didn't require users to refresh the whole page every time you do something, were two other motivations for the team of William Stein to start developing the Sage notebooks.

If you want to know more details, check out the personal accounts of Fernando Pérez and William Stein about the history of their notebooks. Alternatively, you can read more on the history and evolution from IPython to Jupyter notebooks here.

How To Install Jupyter Notebook

Running Jupyter Notebooks With The Anaconda Python Distribution

One of the requirements here is Python, either Python 3.3 or greater or Python 2.7. The general recommendation is that you use the Anaconda distribution to install both Python and the notebook application.

The advantage of Anaconda is that you have access to over 720 packages that can easily be installed with Anaconda's conda, a package, dependency, and environment manager. You can follow the instructions for the installation of Anaconda here for Mac or Windows.

Is something not clear? You can always read up on the Jupyter installation instructions here.

Running Jupyter Notebook The Pythonic Way: Pip

If you don't want to install Anaconda, you just have to make sure that you have the latest version of pip. If you have installed Python, you will typically already have it.

What you do need to do is upgrading pip:

# On Windows
python -m pip install -U pip setuptools
# On OS X or Linux
pip install -U pip setuptools

Once you have pip, you can just run

# Python2
pip install jupyter
# Python 3
pip3 install jupyter

If you need more information about installing packages in Python, you can go to this page.

Running Jupyter Notebooks in Docker Containers

Docker is an excellent platform to run software in containers. These containers are self-contained and isolated processes.

This sounds a bit like a virtual machine, right?

Not really. Go here to read an explanation on why they are different, complete with a fantastic house metaphor.

You can easily get started with Docker if you install the Docker Toolbox: it contains all the tools you need to get your containers up and running. Follow the installation instructions, select the "Docker QuickStart Terminal" and indicate to install the Kitematic Visual Management tool too if you don't have it or any other virtualization platform installed.

The installation through the Docker Quickstart Terminal can take some time, but then you're good to go. Use the command docker run to run Docker "images". You can consider these images as pre-packaged bundles of software that can be automatically downloaded from the Docker Hub when you run them.

Tip: browse the Docker Image Library for thousands of the most popular software tools. You will also find other notebooks that you can run in your Docker container, such as the Data Science Notebook, the R Notebook, and many more.

To run the official Jupyter Notebook image in your Docker container, give in the following command in your Docker Quickstart Terminal:

docker run --rm -it -p 8888:8888 -v "$(pwd):/notebooks" jupyter/notebook 

Tip: if you want to download other images, such as the Data Science Notebook that has been mentioned above, you just have to replace the "Jupyter/notebook" bit by the Repository name you find in the Docker Image Library, such as "Jupyter/datascience-notebook".

The newest Jupyter HTML Notebook image will be downloaded, and it will be started, or you can open the application. Continue to read to see how you can do that!

How To Use Jupyter Notebooks

Now that you know what you'll be working with and you have installed it, it's time to get started for real!

Getting Started With Jupyter Notebooks

Run the following command to open up the application:

jupyter notebook

Then you'll see the application opening in the web browser on the following address: http://localhost:8888. This all is demonstrated in the gif below:

The "Files" tab is where all your files are kept, the "Running" tab keeps track of all your processes and the third tab, "Clusters", is provided by IPython parallel, IPython's parallel computing framework. It allows you to control many individual engines, which are an extended version of the IPython kernel.

You probably want to start by making a new notebook.

You can easily do this by clicking on the "New button" in the "Files tab". You see that you have the option to make a regular text file, a folder, and a terminal. Lastly, you will also see the option to make a Python 3 notebook.

Note that this last option will depend on the version of Python that you have installed. Also, if the application shows python [conda root] and python [default] as kernel names instead of Python 3, you can try executing conda remove _nb_ext_conf or read up on the following Github issue and make the necessary adjustments.

Let's start first with the regular text file. When it opens up, you see that this looks like any other text editor. You can toggle the line numbers or/and the header, you can indicate the programming language you're writing, and you can do a find and replace. Furthermore, you can save, rename or download the file or make a new file.

You can also make folders to keep all your documents organized together. Just press the "Folder" option that appears when you press the "New" button in your initial menu, and a new folder will appear in your overview. You can best rename the folder instantly, as it will appear as a folder with the name 'Untitled Folder'.

Thirdly, the terminal is there to support browser-based interactive terminal sessions. It primarily works just like your terminal or cmd application! Give in python into the terminal, press ENTER, and you're good to go.

Tip: if you would ever need a pure IPython terminal, you can type 'ipython' in your Terminal or Cmd. This can come in handy when, for example, you want to get more clear error messages than the ones that appear in the terminal when you're running the notebook application.

If you want to start on your notebook, go back to the main menu and click the "Python 3" option in the "Notebook" category.

You will immediately see the notebook name, a menu bar, a toolbar and an empty code cell:

You can immediately start with importing the necessary libraries for your code. This is one of the best practices that we will discuss in more detail later on.

After, you can add, remove or edit the cells according to your needs. And don't forget to insert explanatory text or titles and subtitles to clarify your code! That's what makes a notebook a notebook in the end.

Tip: if you want to insert LaTex in your code cells, you just have to put your LaTeX math inside$$, just like this:

$c = \sqrt{a^2 + b^2}$

You can also choose to display your LaTex output:

from IPython.display import display, Math, Latex
display(Math(r'\sqrt{a^2 + b^2}')) 

Are you not sure what a whole notebook looks like? Hop over to the last section to discover the best ones out there!

Toggling Between Python 2 and 3 in Jupyter Notebooks

Up until now, working with notebooks has been quite straightforward.

But what if you don't just want to use Python 3 or 2? What if you want to change between the two?

Luckily, the kernels can solve this problem for you! You can easily create a new conda environment to use different notebook kernels:

# Python 2.7
conda create -n py27 python=2.7 ipykernel
# Python 3.5
conda create -n py35 python=3.5 ipykernel

Restart the application, and the two kernels should be available to you. Very important: don't forget to (de)activate the kernel you (don't) need with the following commands:

source activate py27
source deactivate

If you need more information, check out this page.

You can also manually register your kernels, for example:

conda create -n py27 python=2.7
source activate py27
conda install notebook ipykernel
ipython kernel install --user

To configure the Python 3.5 environment, you can use the same commands but replace py27 by py35 and the version number by 3.5.

Alternatively, if you're working with Python 3 and you want to set up a Python 2 kernel, you can also do this:

python2 -m pip install ipykernel
python2 -m ipykernel install --user

Running R in Your Jupyter Notebook

As the explanation of the kernels in the first section already suggested, you can also run other languages besides Python in your notebook!

If you want to use R with Jupyter Notebooks but without running it inside a Docker container, you can run the following command to install the R essentials in your current environment:

conda install -c r r-essentials

These "essentials" include the packages [dplyr](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0)[shiny](https://www.rdocumentation.org/packages/shiny/versions/0.14.2)[ggplot2](https://www.rdocumentation.org/packages/ggplot2/versions/2.2.0)[tidyr](https://www.rdocumentation.org/packages/tidyr/versions/0.6.0)[caret](https://www.rdocumentation.org/packages/caret/versions/6.0-73), and [nnet](https://www.rdocumentation.org/packages/nnet/versions/7.3-12). If you don't want to install the essentials in your current environment, you can use the following command to create a new environment just for the R essentials:

conda create -n my-r-env -c r r-essentials 

Open up the notebook application to start working with R with the usual command.

If you now want to install additional R packages to elaborate your data science project, you can either build a Conda R package by running, for example:

conda skeleton cran ldavis
conda build r-ldavis/

Or you can install the package from inside of R via install.packages or devtools::install_github (from GitHub). You just have to make sure to add new package to the correct R library used by Jupyter:

install.packages("ldavis", "/home/user/anaconda3/lib/R/library")

Note that you can also install the IRKernel, a kernel for R, to work with R in your notebook. You can follow the installation instructions here.

Note that you also have kernels to run languages such as Julia, SAS, ... in your notebook. Go here for a complete list of the kernels that are available. This list also contains links to the respective pages that have installation instructions to get you started.

Tip: if you're still unsure of how you would be working with these different kernels or if you want to experiment with different kernels yourself, go to this page, where you can try out kernels such as Apache Toree (Scala), Ruby, Julia, ...

Making Your Jupyter Notebook Magical

If you want to get the most out of your notebooks with the IPython kernel, you should consider learning about the so-called "magic commands". Also, consider adding even more interactivity to your notebook so that it becomes an interactive dashboard to others should be one of your considerations!

The Notebook's Built-In Commands

There are some predefined ‘magic functions’ that will make your work a lot more interactive.

To see which magic commands you have available in your interpreter, you can simply run the following:

%lsmagic

Tip: the regular Python help() function also still works and you can use the magic command %quickref to show a quick reference sheet for IPython.

And you'll see a whole bunch of them appearing. You'll probably see some magics commands that you'll grasp, such as %save, %clear or %debug, but others will be less straightforward.

If you're looking for more information on the magics commands or on functions, you can always use the ?, just like this:

# Retrieving documentation on the alias_magic command
?%alias_magic

# Retrieving information on the range() function
?range

Note that if you want to start a single-line expression to run with the magics command, you can do this by using % . For multi-line expressions, use && . The following example illustrates the difference between the two:

%time x = range(100)
%%timeit x = range(100)
    max(x)

Stated differently, the magic commands are either line-oriented or cell-oriented. In the first case, the commands are prefixed with the % character and they work as follows: they get as an argument the rest of the line. When you want to pass not only the line but also the lines that follow, you need cell-oriented magic: then, the commands need to be prefixed with %%.

Besides the %time and %timeit magics, there are some other magic commands that will surely come in handy:

Note that this is just a short list of the handy magic commands out there. There are many more that you can discover with %lsmagic.

You can also use magics to mix languages in your notebook with the IPython kernel without setting up extra kernels: there is rmagics to run R code, SQL for RDBMS or Relational Database Management System access and cythonmagic for interactive work with cython,... But there is so much more!

To make use of these magics, you first have to install the necessary packages:

pip install ipython-sql
pip install cython
pip install rpy2

Tip: if you want to install packages, you can also execute these commands as shell commands from inside your notebook by placing a in front of the commands, just like this:

# Check, manage and install packages
!pip list
!pip install ipython-sql

# Check the files in your working directory
!ls

Only then, after a successful install, can you load in the magics and start using them:

%load_ext sql
%load_ext cython
%load_ext rpy2.ipython

Let's demonstrate how the magics exactly work with a small example:

# Hide warnings if there are any
import warnings
warnings.filterwarnings('ignore')

# Load in the r magic
%load_ext rpy2.ipython

# We need ggplot2
%R require(ggplot2)

# Load in the pandas library
import pandas as pd

# Make a pandas DataFrame
df = pd.DataFrame({'Alphabet': ['a', 'b', 'c', 'd','e', 'f', 'g', 'h','i'],
                   'A': [4, 3, 5, 2, 1, 7, 7, 5, 9],
                   'B': [0, 4, 3, 6, 7, 10, 11, 9, 13],
                   'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})

# Take the name of input variable df and assign it to an R variable of the same name
%%R -i df

# Plot the DataFrame df
ggplot(data=df) + geom_point(aes(x=A, y=B, color=C))

This is just an initial not nearly everything you can do with R magics, though. You can also push variables from Python to R and pull them again to Python. Read up on the documentation (with easily accessible examples!) here.

Interactive Notebooks As Dashboards: Widgets

The magic commands already do a lot to make your workflow with notebooks agreeable, but you can also take additional steps to make your notebook an interactive place for others by adding widgets to it!

To add widgets to your notebook, you need to import widgets from ipywidgets:

from ipywidgets import widgets

That's enough to get started! You might want to think now of what type of widget you want to add. The basic types of widgets are text input, buttons, and input-based widgets.

See an example of a text input widget below:

This example was taken from a wonderful tutorial on building interactive dashboards in Jupyter, which you can find on this page.

Share Your Jupyter Notebooks

In practice, you might want to share your notebooks with colleagues or friends to show them what you have been up to or as a data science portfolio for future employers. However, the notebook documents are JSON documents that contain text, source code, rich media output, and metadata. Each segment of the document is stored in a cell.

Ideally, you don't want to go around and share JSON files.

That's why you want to find and use other ways to share your notebook documents with others.

When you create a notebook, you will see a button in the menu bar that says "File". When you click this, you see that Jupyter gives you the option to download your notebook as an HTML, PDF, Markdown or reStructuredText, or a Python script or a Notebook file.

You can use the nbconvert command to convert your notebook document file to another static format, such as HTML, PDF, LaTex, Markdown, reStructuredText, ... But don't forget to import nbconvert first if you don't have it yet!

Then, you can give in something like the following command to convert your notebooks:

jupyter nbconvert --to html Untitled4.ipynb

With nbconvert, you can make sure that you can calculate an entire notebook non-interactively, saving it in place or to a variety of other formats. The fact that you can do this makes notebooks a powerful tool for ETL and for reporting. For reporting, you just make sure to schedule a run of the notebook every so many days, weeks or months; For an ETL pipeline, you can make use of the magic commands in your notebook in combination with some type of scheduling.

Besides these options, you could also consider the following:

  • You can create, list and load GitHub Gists from your notebook documents. You can find more information here. Gists are a way to share your work because you can share single files, parts of files, or full applications.
  • With jupyterhub, you can spawn, manage, and proxy multiple instances of the single-user Jupyter notebook server. In other words, it's a platform for hosting notebooks on a server with multiple users. That makes it the ideal resource to provide notebooks to a class of students, a corporate data science group, or a scientific research group.
  • Make use of binder and tmpnb to get temporary environments to reproduce your notebook execution.
  • You can use nbviewer to render notebooks as static web pages.
  • To turn your notebooks into slideshows, you can turn to nbpresent and RISE.
  • jupyter_dashboards will come in handy if you want to display notebooks as interactive dashboards.
  • Create a blog from your notebook with Pelican plugin.
Jupyter Notebooks in Practice

This all is very interesting when you're working alone on a data science project. But most times, you're not alone. You might have some friends look at your code, or you'll need your colleagues to contribute to your notebook.

How should you actually use these notebooks in practice when you're working in a team?

The following tips will help you to effectively and efficiently use notebooks on your data science project.

Tips To Effectively and Efficiently Use Your Jupyter Notebooks

Using these notebooks doesn't mean that you don't need to follow the coding practices that you would usually apply.

You probably already know the drill, but these principles include the following:

  • Try to provide comments and documentation to your code. They might be a great help to others!
  • Also consider a consistent naming scheme, code grouping, limit your line length, ...
  • Don't be afraid to refactor when or if necessary.

In addition to these general best practices for programming, you could also consider the following tips to make your notebooks the best source for other users to learn:

  • Don't forget to name your notebook documents!
  • Try to keep the cells of your notebook simple: don't exceed the width of your cell and make sure that you don't put too many related functions in one cell.
  • If possible, import your packages in the first code cell of your notebook, and
  • Display the graphics inline. The magic command %matplotlib inline will definitely come in handy to suppress the output of the function on a final line. Don't forget to add a semicolon to suppress the output and to just give back the plot itself.
  • Sometimes, your notebook can become quite code-heavy, or maybe you just want to have a cleaner report. In those cases, you could consider hiding some of this code. You can already hide some of the code by using magic commands such as %run to execute a whole Python script as if it was in a notebook cell. However, this might not help you to the extent that you expect. In such cases, you can always check out this tutorial on optional code visibility or consider toggling your notebook's code cells.

Jupyter Notebooks for Data Science Teams: Best Practices

Jonathan Whitmore wrote in his article some practices for using notebooks for data science and specifically addresses the fact that working with the notebook on data science problems in a team can prove to be quite a challenge.

That is why Jonathan suggests some best practices:

  • Use two types of notebooks for a data science project, namely, a lab notebook and a deliverable notebook. The difference between the two (besides the obvious that you can infer from the names that are given to the notebooks) is the fact that individuals control the lab notebook, while the deliverable notebook is controlled by the whole data science team,
  • Use some type of versioning control (Git, Github, ...). Don't forget to commit also the HTML file if your version control system lacks rendering capabilities, and
  • Use explicit rules on the naming of your documents.

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Complete Python Bootcamp: Go from zero to hero in Python 3

NumPy Tutorial for Beginners

A Beginner’s Tutorial to Jupyter Notebooks

How To Set Up Jupyter Notebook with Python 3 on Ubuntu 18.04

Jupyter Notebook: Forget CSV, Fetch Data With Python

Step-by-Step Guide to Creating R and Python Libraries (in JupyterLab)

Set Your Jupyter Notebook up Right with this Extension

Data Science with Python explained

Data Science with Python explained

An overview of using Python for data science including Numpy, Scipy, pandas, Scikit-Learn, XGBoost, TensorFlow and Keras.

An overview of using Python for data science including Numpy, Scipy, pandas, Scikit-Learn, XGBoost, TensorFlow and Keras.

So you’ve heard of data science and you’ve heard of Python.

You want to explore both but have no idea where to start — data science is pretty complicated, after all.

Don’t worry — Python is one of the easiest programming languages to learn. And thanks to the hard work of thousands of open source contributors, you can do data science, too.

If you look at the contents of this article, you may think there’s a lot to master, but this article has been designed to gently increase the difficulty as we go along.

One article obviously can’t teach you everything you need to know about data science with python, but once you’ve followed along you’ll know exactly where to look to take the next steps in your data science journey.

Table contents:

  • Why Python?
  • Installing Python
  • Using Python for Data Science
  • Numeric computation in Python
  • Statistical analysis in Python
  • Data manipulation in Python
  • Working with databases in Python
  • Data engineering in Python
  • Big data engineering in Python
  • Further statistics in Python
  • Machine learning in Python
  • Deep learning in Python
  • Data science APIs in Python
  • Applications in Python
  • Summary
Why Python?

Python, as a language, has a lot of features that make it an excellent choice for data science projects.

It’s easy to learn, simple to install (in fact, if you use a Mac you probably already have it installed), and it has a lot of extensions that make it great for doing data science.

Just because Python is easy to learn doesn’t mean its a toy programming language — huge companies like Google use Python for their data science projects, too. They even contribute packages back to the community, so you can use the same tools in your projects!

You can use Python to do way more than just data science — you can write helpful scripts, build APIs, build websites, and much much more. Learning it for data science means you can easily pick up all these other things as well.

Things to note

There are a few important things to note about Python.

Right now, there are two versions of Python that are in common use. They are versions 2 and 3.

Most tutorials, and the rest of this article, will assume that you’re using the latest version of Python 3. It’s just good to be aware that sometimes you can come across books or articles that use Python 2.

The difference between the versions isn’t huge, but sometimes copying and pasting version 2 code when you’re running version 3 won’t work — you’ll have to do some light editing.

The second important thing to note is that Python really cares about whitespace (that’s spaces and return characters). If you put whitespace in the wrong place, your programme will very likely throw an error.

There are tools out there to help you avoid doing this, but with practice you’ll get the hang of it.

If you’ve come from programming in other languages, Python might feel like a bit of a relief: there’s no need to manage memory and the community is very supportive.

If Python is your first programming language you’ve made an excellent choice. I really hope you enjoy your time using it to build awesome things.

Installing Python

The best way to install Python for data science is to use the Anaconda distribution (you’ll notice a fair amount of snake-related words in the community).

It has everything you need to get started using Python for data science including a lot of the packages that we’ll be covering in the article.

If you click on Products -> Distribution and scroll down, you’ll see installers available for Mac, Windows and Linux.

Even if you have Python available on your Mac already, you should consider installing the Anaconda distribution as it makes installing other packages easier.

If you prefer to do things yourself, you can go to the official Python website and download an installer there.

Package Managers

Packages are pieces of Python code that aren’t a part of the language but are really helpful for doing certain tasks. We’ll be talking a lot about packages throughout this article so it’s important that we’re set up to use them.

Because the packages are just pieces of Python code, we could copy and paste the code and put it somewhere the Python interpreter (the thing that runs your code) can find it.

But that’s a hassle — it means that you’ll have to copy and paste stuff every time you start a new project or if the package gets updated.

To sidestep all of that, we’ll instead use a package manager.

If you chose to use the Anaconda distribution, congratulations — you already have a package manager installed. If you didn’t, I’d recommend installing pip.

No matter which one you choose, you’ll be able to use commands at the terminal (or command prompt) to install and update packages easily.

Using Python for Data Science

Now that you’ve got Python installed, you’re ready to start doing data science.

But how do you start?

Because Python caters to so many different requirements (web developers, data analysts, data scientists) there are lots of different ways to work with the language.

Python is an interpreted language which means that you don’t have to compile your code into an executable file, you can just pass text documents containing code to the interpreter!

Let’s take a quick look at the different ways you can interact with the Python interpreter.

In the terminal

If you open up the terminal (or command prompt) and type the word ‘python’, you’ll start a shell session. You can type any valid Python commands in there and they’d work just like you’d expect.

This can be a good way to quickly debug something but working in a terminal is difficult over the course of even a small project.

Using a text editor

If you write a series of Python commands in a text file and save it with a .py extension, you can navigate to the file using the terminal and, by typing python YOUR_FILE_NAME.py, can run the programme.

This is essentially the same as typing the commands one-by-one into the terminal, it’s just much easier to fix mistakes and change what your program does.

In an IDE

An IDE is a professional-grade piece of software that helps you manage software projects.

One of the benefits of an IDE is that you can use debugging features which tell you where you’ve made a mistake before you try to run your programme.

Some IDEs come with project templates (for specific tasks) that you can use to set your project out according to best practices.

Jupyter Notebooks

None of these ways are the best for doing data science with python — that particular honour belongs to Jupyter notebooks.

Jupyter notebooks give you the capability to run your code one ‘block’ at a time, meaning that you can see the output before you decide what to do next — that’s really crucial in data science projects where we often need to see charts before taking the next step.

If you’re using Anaconda, you’ll already have Jupyter lab installed. To start it you’ll just need to type ‘jupyter lab’ into the terminal.

If you’re using pip, you’ll have to install Jupyter lab with the command ‘python pip install jupyter’.

Numeric Computation in Python

It probably won’t surprise you to learn that data science is mostly about numbers.

The NumPy package includes lots of helpful functions for performing the kind of mathematical operations you’ll need to do data science work.

It comes installed as part of the Anaconda distribution, and installing it with pip is just as easy as installing Jupyter notebooks (‘pip install numpy’).

The most common mathematical operations we’ll need to do in data science are things like matrix multiplication, computing the dot product of vectors, changing the data types of arrays and creating the arrays in the first place!

Here’s how you can make a list into a NumPy array:

Here’s how you can do array multiplication and calculate dot products in NumPy:

And here’s how you can do matrix multiplication in NumPy:

Statistics in Python

With mathematics out of the way, we must move forward to statistics.

The Scipy package contains a module (a subsection of a package’s code) specifically for statistics.

You can import it (make its functions available in your programme) into your notebook using the command ‘from scipy import stats’.

This package contains everything you’ll need to calculate statistical measurements on your data, perform statistical tests, calculate correlations, summarise your data and investigate various probability distributions.

Here’s how to quickly access summary statistics (minimum, maximum, mean, variance, skew, and kurtosis) of an array using Scipy:

Data Manipulation with Python

Data scientists have to spend an unfortunate amount of time cleaning and wrangling data. Luckily, the Pandas package helps us do this with code rather than by hand.

The most common tasks that I use Pandas for are reading data from CSV files and databases.

It also has a powerful syntax for combining different datasets together (datasets are called DataFrames in Pandas) and performing data manipulation.

You can see the first few rows of a DataFrame using the .head method:

You can select just one column using square brackets:

And you can create new columns by combining others:

Working with Databases in Python

In order to use the pandas read_sql method, you’ll have to establish a connection to a database.

The most bulletproof method of connecting to a database is by using the SQLAlchemy package for Python.

Because SQL is a language of its own and connecting to a database depends on which database you’re using, I’ll leave you to read the documentation if you’re interested in learning more.

Data Engineering in Python

Sometimes we’d prefer to do some calculations on our data before they arrive in our projects as a Pandas DataFrame.

If you’re working with databases or scraping data from the web (and storing it somewhere), this process of moving data and transforming it is called ETL (Extract, transform, load).

You extract the data from one place, do some transformations to it (summarise the data by adding it up, finding the mean, changing data types, and so on) and then load it to a place where you can access it.

There’s a really cool tool called Airflow which is very good at helping you manage ETL workflows. Even better, it’s written in Python.

It was developed by Airbnb when they had to move incredible amounts of data around, you can find out more about it here.

Big Data Engineering in Python

Sometimes ETL processes can be really slow. If you have billions of rows of data (or if they’re a strange data type like text), you can recruit lots of different computers to work on the transformation separately and pull everything back together at the last second.

This architecture pattern is called MapReduce and it was made popular by Hadoop.

Nowadays, lots of people use Spark to do this kind of data transformation / retrieval work and there’s a Python interface to Spark called (surprise, surprise) PySpark.

Both the MapReduce architecture and Spark are very complex tools, so I’m not going to go into detail here. Just know that they exist and that if you find yourself dealing with a very slow ETL process, PySpark might help. Here’s a link to the official site.

Further Statistics in Python

We already know that we can run statistical tests, calculate descriptive statistics, p-values, and things like skew and kurtosis using the stats module from Scipy, but what else can Python do with statistics?

One particular package that I think you should know about is the lifelines package.

Using the lifelines package, you can calculate a variety of functions from a subfield of statistics called survival analysis.

Survival analysis has a lot of applications. I’ve used it to predict churn (when a customer will cancel a subscription) and when a retail store might be burglarised.

These are totally different to the applications the creators of the package imagined it would be used for (survival analysis is traditionally a medical statistics tool). But that just shows how many different ways there are to frame data science problems!

The documentation for the package is really good, check it out here.

Machine Learning in Python

Now this is a major topic — machine learning is taking the world by storm and is a crucial part of a data scientist’s work.

Simply put, machine learning is a set of techniques that allows a computer to map input data to output data. There are a few instances where this isn’t the case but they’re in the minority and it’s generally helpful to think of ML this way.

There are two really good machine learning packages for Python, let’s talk about them both.

Scikit-Learn

Most of the time you spend doing machine learning in Python will be spent using the Scikit-Learn package (sometimes abbreviated sklearn).

This package implements a whole heap of machine learning algorithms and exposes them all through a consistent syntax. This makes it really easy for data scientists to take full advantage of every algorithm.

The general framework for using Scikit-Learn goes something like this –

You split your dataset into train and test datasets:

Then you instantiate and train a model:

And then you use the metrics module to test how well your model works:

XGBoost

The second package that is commonly used for machine learning in Python is XGBoost.

Where Scikit-Learn implements a whole range of algorithms XGBoost only implements a single one — gradient boosted decision trees.

This package (and algorithm) has become very popular recently due to its success at Kaggle competitions (online data science competitions that anyone can participate in).

Training the model works in much the same way as a Scikit-Learn algorithm.

Deep Learning in Python

The machine learning algorithms available in Scikit-Learn are sufficient for nearly any problem. That being said, sometimes you need to use the most advanced thing available.

Deep neural networks have skyrocketed in popularity due to the fact that systems using them have outperformed nearly every other class of algorithm.

There’s a problem though — it’s very hard to say what a neural net is doing and why it’s making the decisions that it is. Because of this, their use in finance, medicine, the law and related professions isn’t widely endorsed.

The two major classes of neural network are convolutional neural networks (which are used to classify images and complete a host of other tasks in computer vision) and recurrent neural nets (which are used to understand and generate text).

Exploring how neural nets work is outside the scope of this article, but just know that the packages you’ll need to look for if you want to do this kind of work are TensorFlow (a Google contibution!) and Keras.

Keras is essentially a wrapper for TensorFlow that makes it easier to work with.

Data Science APIs in Python

Once you’ve trained a model, you’d like to be able to access predictions from it in other software. The way you do this is by creating an API.

An API allows your model to receive data one row at a time from an external source and return a prediction.

Because Python is a general purpose programming language that can also be used to create web services, it’s easy to use Python to serve your model via API.

If you need to build an API you should look into the pickle and Flask. Pickle allows you to save trained models on your hard-drive so that you can use them later. And Flask is the simplest way to create web services.

Web Applications in Python

Finally, if you’d like to build a full-featured web application around your data science project, you should use the Django framework.

Django is immensely popular in the web development community and was used to build the first version of Instagram and Pinterest (among many others).

Summary

And with that we’ve concluded our whirlwind tour of data science with Python.

We’ve covered everything you’d need to learn to become a full-fledged data scientist. If it still seems intimidating, you should know that nobody knows all of this stuff and that even the best of us still Google the basics from time to time.

Learn Data Science | How to Learn Data Science for Free

Learn Data Science | How to Learn Data Science for Free

Learn Data Science | How to Learn Data Science for Free. In this post, I have described a learning path and free online courses and tutorials that will enable you to learn data science for free.

The average cost of obtaining a masters degree at traditional bricks and mortar institutions will set you back anywhere between $30,000 and $120,000. Even online data science degree programs don’t come cheap costing a minimum of $9,000. So what do you do if you want to learn data science but can’t afford to pay this?

I trained into a career as a data scientist without taking any formal education in the subject. In this article, I am going to share with you my own personal curriculum for learning data science if you can’t or don’t want to pay thousands of dollars for more formal study.

The curriculum will consist of 3 main parts, technical skills, theory and practical experience. I will include links to free resources for every element of the learning path and will also be including some links to additional ‘low cost’ options. So if you want to spend a little money to accelerate your learning you can add these resources to the curriculum. I will include the estimated costs for each of these.

Technical skills

The first part of the curriculum will focus on technical skills. I recommend learning these first so that you can take a practical first approach rather than say learning the mathematical theory first. Python is by far the most widely used programming language used for data science. In the Kaggle Machine Learning and Data Science survey carried out in 2018 83% of respondents said that they used Python on a daily basis. I would, therefore, recommend focusing on this language but also spending a little time on other languages such as R.

Python Fundamentals

Before you can start to use Python for data science you need a basic grasp of the fundamentals behind the language. So you will want to take a Python introductory course. There are lots of free ones out there but I like the Codeacademy ones best as they include hands-on in-browser coding throughout.

I would suggest taking the introductory course to learn Python. This covers basic syntax, functions, control flow, loops, modules and classes.

Data analysis with python

Next, you will want to get a good understanding of using Python for data analysis. There are a number of good resources for this.

To start with I suggest taking at least the free parts of the data analyst learning path on dataquest.io. Dataquest offers complete learning paths for data analyst, data scientist and data engineer. Quite a lot of the content, particularly on the data analyst path is available for free. If you do have some money to put towards learning then I strongly suggest putting it towards paying for a few months of the premium subscription. I took this course and it provided a fantastic grounding in the fundamentals of data science. It took me 6 months to complete the data scientist path. The price varies from $24.50 to $49 per month depending on whether you pay annually or not. It is better value to purchase the annual subscription if you can afford it.

The Dataquest platform

Python for machine learning

If you have chosen to pay for the full data science course on Dataquest then you will have a good grasp of the fundamentals of machine learning with Python. If not then there are plenty of other free resources. I would focus to start with on scikit-learn which is by far the most commonly used Python library for machine learning.

When I was learning I was lucky enough to attend a two-day workshop run by Andreas Mueller one of the core developers of scikit-learn. He has however published all the material from this course, and others, on this Github repo. These consist of slides, course notes and notebooks that you can work through. I would definitely recommend working through this material.

Then I would suggest taking some of the tutorials in the scikit-learn documentation. After that, I would suggest building some practical machine learning applications and learning the theory behind how the models work — which I will cover a bit later on.

SQL

SQL is a vital skill to learn if you want to become a data scientist as one of the fundamental processes in data modelling is extracting data in the first place. This will more often than not involve running SQL queries against a database. Again if you haven’t opted to take the full Dataquest course then here are a few free resources to learn this skill.

Codeacamdemy has a free introduction to SQL course. Again this is very practical with in-browser coding all the way through. If you also want to learn about cloud-based database querying then Google Cloud BigQuery is very accessible. There is a free tier so you can try queries for free, an extensive range of public datasets to try and very good documentation.

Codeacademy SQL course

R

To be a well-rounded data scientist it is a good idea to diversify a little from just Python. I would, therefore, suggest also taking an introductory course in R. Codeacademy have an introductory course on their free plan. It is probably worth noting here that similar to Dataquest Codeacademy also offers a complete data science learning plan as part of their pro account (this costs from $31.99 to $15.99 per month depending on how many months you pay for up front). I personally found the Dataquest course to be much more comprehensive but this may work out a little cheaper if you are looking to follow a learning path on a single platform.

Software engineering

It is a good idea to get a grasp of software engineering skills and best practices. This will help your code to be more readable and extensible both for yourself and others. Additionally, when you start to put models into production you will need to be able to write good quality well-tested code and work with tools like version control.

There are two great free resources for this. Python like you mean it covers things like the PEP8 style guide, documentation and also covers object-oriented programming really well.

The scikit-learn contribution guidelines, although written to facilitate contributions to the library, actually cover the best practices really well. This covers topics such as Github, unit testing and debugging and is all written in the context of a data science application.

Deep learning

For a comprehensive introduction to deep learning, I don’t think that you can get any better than the totally free and totally ad-free fast.ai. This course includes an introduction to machine learning, practical deep learning, computational linear algebra and a code-first introduction to natural language processing. All their courses have a practical first approach and I highly recommend them.

Fast.ai platform

Theory

Whilst you are learning the technical elements of the curriculum you will encounter some of the theory behind the code you are implementing. I recommend that you learn the theoretical elements alongside the practical. The way that I do this is that I learn the code to be able to implement a technique, let’s take KMeans as an example, once I have something working I will then look deeper into concepts such as inertia. Again the scikit-learn documentation contains all the mathematical concepts behind the algorithms.

In this section, I will introduce the key foundational elements of theory that you should learn alongside the more practical elements.

The khan academy covers almost all the concepts I have listed below for free. You can tailor the subjects you would like to study when you sign up and you then have a nice tailored curriculum for this part of the learning path. Checking all of the boxes below will give you an overview of most elements I have listed below.

Maths

Calculus

Calculus is defined by Wikipedia as “the mathematical study of continuous change.” In other words calculus can find patterns between functions, for example, in the case of derivatives, it can help you to understand how a function changes over time.

Many machine learning algorithms utilise calculus to optimise the performance of models. If you have studied even a little machine learning you will probably have heard of Gradient descent. This functions by iteratively adjusting the parameter values of a model to find the optimum values to minimise the cost function. Gradient descent is a good example of how calculus is used in machine learning.

What you need to know:

Derivatives

  • Geometric definition
  • Calculating the derivative of a function
  • Nonlinear functions

Chain rule

  • Composite functions
  • Composite function derivatives
  • Multiple functions

Gradients

  • Partial derivatives
  • Directional derivatives
  • Integrals

Linear Algebra

Many popular machine learning methods, including XGBOOST, use matrices to store inputs and process data. Matrices alongside vector spaces and linear equations form the mathematical branch known as Linear Algebra. In order to understand how many machine learning methods work it is essential to get a good understanding of this field.

What you need to learn:

Vectors and spaces

  • Vectors
  • Linear combinations
  • Linear dependence and independence
  • Vector dot and cross products

Matrix transformations

  • Functions and linear transformations
  • Matrix multiplication
  • Inverse functions
  • Transpose of a matrix

Statistics

Here is a list of the key concepts you need to know:

Descriptive/Summary statistics

  • How to summarise a sample of data
  • Different types of distributions
  • Skewness, kurtosis, central tendency (e.g. mean, median, mode)
  • Measures of dependence, and relationships between variables such as correlation and covariance

Experiment design

  • Hypothesis testing
  • Sampling
  • Significance tests
  • Randomness
  • Probability
  • Confidence intervals and two-sample inference

Machine learning

  • Inference about slope
  • Linear and non-linear regression
  • Classification

Practical experience

The third section of the curriculum is all about practice. In order to truly master the concepts above you will need to use the skills in some projects that ideally closely resemble a real-world application. By doing this you will encounter problems to work through such as missing and erroneous data and develop a deep level of expertise in the subject. In this last section, I will list some good places you can get this practical experience from for free.

“With deliberate practice, however, the goal is not just to reach your potential but to build it, to make things possible that were not possible before. This requires challenging homeostasis — getting out of your comfort zone — and forcing your brain or your body to adapt.”, Anders Ericsson, Peak: Secrets from the New Science of Expertise

Kaggle, et al

Machine learning competitions are a good place to get practice with building machine learning models. They give access to a wide range of data sets, each with a specific problem to solve and have a leaderboard. The leaderboard is a good way to benchmark how good your knowledge at developing a good model actually is and where you may need to improve further.

In addition to Kaggle, there are other platforms for machine learning competitions including Analytics Vidhya and DrivenData.

Driven data competitions page

UCI Machine Learning Repository

The UCI machine learning repository is a large source of publically available data sets. You can use these data sets to put together your own data projects this could include data analysis and machine learning models, you could even try building a deployed model with a web front end. It is a good idea to store your projects somewhere publically such as Github as this can create a portfolio showcasing your skills to use for future job applications.


UCI repository

Contributions to open source

One other option to consider is contributing to open source projects. There are many Python libraries that rely on the community to maintain them and there are often hackathons held at meetups and conferences where even beginners can join in. Attending one of these events would certainly give you some practical experience and an environment where you can learn from others whilst giving something back at the same time. Numfocus is a good example of a project like this.

In this post, I have described a learning path and free online courses and tutorials that will enable you to learn data science for free. Showcasing what you are able to do in the form of a portfolio is a great tool for future job applications in lieu of formal qualifications and certificates. I really believe that education should be accessible to everyone and, certainly, for data science at least, the internet provides that opportunity. In addition to the resources listed here, I have previously published a recommended reading list for learning data science available here. These are also all freely available online and are a great way to complement the more practical resources covered above.

Thanks for reading!

Python For Data Analysis | Build a Data Analysis Library from Scratch | Learn Python in 2019

Python For Data Analysis - Build a Data Analysis Library from Scratch - Learn Python in 2019

**
**

Immerse yourself in a long, comprehensive project that teaches advanced Python concepts to build an entire library

You’ll learn

  • How to build a Python library similar pandas
  • How to complete a large, comprehensive project
  • Test-driven development with pytest
  • Environment creation
  • Advanced Python topics such as special methods and property decorators
  • A fully-functioning library that you can use to data analysis