The Pandas Library for Python

The Pandas Library for Python

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming ...

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming ...

  1. Introduction to Pandas
  2. About the Data
  3. Setup
  4. Loading Data
  5. Basic Operations
  6. The Dtype
  7. Cleansing and Transforming Data
  8. Performing Basic Operations
  • Calculations
  • Booleans
  • Grouping
  • Plotting
  1. Introduction to Pandas
  2. About the Data
  3. Setup
  4. Loading Data
  5. Basic Operations
  6. The Dtype
  7. Cleansing and Transforming Data
  8. Performing Basic Operations

Introduction to Pandas

So, what is Pandas — practically speaking? In short, it’s the major data analysis library for Python. For scientists, students, and professional developers alike, Pandas represents a central reason for any learning or interaction with Python, as opposed to a statistics-specific language like R, or a proprietary academic package like SPSS or Matlab. (Fun fact — Pandas is named after the term Panel Data, and was originally created for the analysis of financial data tables). I like to think that the final “s” stands for Series or Statistics.

Although there are plenty of ways to explore numerical data with Python out-of-the box, these will universally involve some fairly low-performance results, with a ton of boilerplate. It may sound hard to believe, but Pandas is often recommended as the next stop for Excel users who are ready to take their data analysis to the next level. Nearly any problem that can be solved with a spreadsheet program can be solved in Pandas — without all the graphical cruft.

More importantly, because problems can be solved in Pandas via Python, solutions are already automated, or could be run as a service in the cloud. Further, Pandas makes heavy use of Numpy, relying on its low-level calls to produce linear math results orders of magnitude more quickly than they would be handled by Python alone. These are just a few of the reasons Pandas is recommended as one of the first libraries to learn for all Pythonistas, and remains absolutely critical to Data Scientists.

About the Data

In this post, we’re going to be using a fascinating data set to demonstrate a useful slice of the Pandas library. This data set is particularly interesting as it’s part of a real world example, and we can all imagine people lined up at an airport (a place where things do occasionally go wrong). When looking at the data, I imagine people sitting in those uncomfortable airport seats having just found out that their luggage is missing — not just temporarily, but it’s nowhere to be found in the system! Or, better yet, imagine that a hardworking TSA employee accidentally broke a precious family heirloom.

So it’s time to fill out another form, of course. Now, getting data from forms is an interesting process as far as data gathering is concerned, as we have a set of data that happens at specific times. This actually means we can interpret the entries as a Time Series. Also, because people are submitting the information, we can learn things about a group of people, too.

Back to our example: let’s say we work for the TSA and we’ve been tasked with getting some insights about when these accidents are most likely to happen, and make some recommendations for improving the service.

Pandas, luckily, is a one-stop shop for exploring and analyzing this data set. Feel free to download the excel file into your project folder to get started, or run the curl command below. Yes, pandas can read .xls or .xlsx files with a single call to **pd.read_excel()**! In fact, it’s often helpful for beginners experienced with .csv or excel files to think about how they would solve a problem in excel, and then experience how much easier it can be in Pandas.

So, without further ado, open your terminal, a text editor, or your favorite IDE, and take a look for yourself with the guidance below.

Example data:

Take for example, some claims made against the TSA during a screening process of persons or a passenger’s property due to an injury, loss, or damage. The claims data information includes claim number, incident date, claim type, claim amount, status, and disposition.

Directory: TSA Claims Data

Our Data Download: claims-2014.xls

Setup

To start off, let’s create a clean directory. You can put this wherever you’d like, or create a project folder in an IDE. Use your install method of choice to get Pandas: Pip is probably the easiest.

$ mkdir -p ~/Desktop/pandas-tutorial/data && cd ~/Desktop/pandas-tutorial

Install pandas along with xldr for loading Excel formatted files, matplotlib for plotting graphs, and Numpy for high-level mathematical functions.

$ pip3 install matplotlib numpy pandas xldr

Optional: download the example data with curl:

$ curl -O https://www.dhs.gov/sites/default/files/publications/claims-2014.xls

Launch Python:

$ python3
Python 3.7.1 (default, Nov  6 2018, 18:46:03)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

Import packages:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> import pandas as pd

Loading Data

Loading data with Pandas is easy. Pandas can accurately read data from almost any common format including JSON, CSV, and SQL. Data is loaded into Pandas’ “flagship” data structure, the DataFrame.

That’s a term you’ll want to remember. You’ll be hearing a lot about DataFrames. If that term seems confusing — think about a table in a database, or a sheet in Excel. The main point is that there is more than one column: each row or entry has multiple fields which are consistent from one row to the next.

You can load the example data straight from the web:

>>> df = pd.read_excel(io='https://www.dhs.gov/sites/default/files/publications/claims-2014.xls', index_col='Claim Number')

Less cooly, data can be loaded from a file:

$ curl -O https://www.dhs.gov/sites/default/files/publications/claims-2014.xls

>>> df = pd.read_excel(io='claims-2014.xls', index_col='Claim Number')

Basic Operations

Print information about a DataFrame including the index dtype and column dtypes, non-null values, and memory usage. DataFrame.info() is one of the more useful and versatile methods attached to DataFrames (there are nearly 150!).

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8855 entries, 2013081805991 to 2015012220083
Data columns (total 10 columns):
Date Received    8855 non-null datetime64[ns]
Incident Date    8855 non-null datetime64[ns]
Airport Code     8855 non-null object
Airport Name     8855 non-null object
Airline Name     8855 non-null object
Claim Type       8855 non-null object
Claim Site       8855 non-null object
Item Category    8855 non-null object
Close Amount     8855 non-null object
Disposition      8855 non-null object
dtypes: datetime64[ns](2), object(8)
memory usage: 761.0+ KB

View the first n rows:

>>> df.info()
<class '>>> df.head(n=3)  # see also df.tail()
    Claim Number Date Received       Incident Date Airport Code       ...              Claim Site                   Item Category Close Amount      Disposition
0  2013081805991    2014-01-13 2012-12-21 00:00:00          HPN       ...         Checked Baggage  Audio/Video; Jewelry & Watches            0             Deny
1  2014080215586    2014-07-17 2014-06-30 18:38:00          MCO       ...         Checked Baggage                               -            0             Deny
2  2014010710583    2014-01-07 2013-12-27 22:00:00          SJU       ...         Checked Baggage                    Food & Drink           50  Approve in Full
[3 rows x 11 columns]

List all the columns in the DataFrame:

df.columns> df.columns> df.columns> df.columns> df.columns
Return a single column (important — also referred to as a Series):

>>> df['Claim Type'].head()
0    Personal Injury
1    Property Damage
2    Property Damage
3    Property Damage
4    Property Damage
Name: Claim Type, dtype: object

Hopefully, you’re starting to get an idea of what claims-2014.xls’s data is all about.

The Dtype

Data types are a fundamental concept that you’ll want to have a solid grasp of in order to avoid frustration later. Pandas adopts the nomenclature of Numpy, referring to a column’s data type as its dtype. Pandas also attempts to infer dtypes upon DataFrame construction (i.e. initialization).

To take advantage of the performance boosts intrinsic to Numpy, we need to become familiar with these types, and learn about how they roughly translate to native Python types.

Look again at df.info() and note the dtype assigned to each column of our DataFrame:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8855 entries, 0 to 8854
Data columns (total 11 columns):
Date Received    8855 non-null datetime64[ns]
Incident Date    8855 non-null datetime64[ns]
Airport Code     8855 non-null object
Airport Name     8855 non-null object
Airline Name     8855 non-null object
Claim Type       8855 non-null object
Claim Site       8855 non-null object
Item Category    8855 non-null object
Close Amount     8855 non-null object
Disposition      8855 non-null object
dtypes: datetime64[ns](2), object(8)
memory usage: 761.1+ KB

dtypes are analogous to text/number format settings typical of most spreadsheet applications, and Pandas uses dtypes to determine which kind(s) of operations may be performed the data in a specific column. For example, mathematical operations can only be performed on numeric data types such as int64 or float64. Columns containing valid Dates and/or time values are assigned the datetime dtype and text and or binary data is assigned the catchall object dtype.

In short, Pandas attempts to infer dtypes upon DataFrame construction. However, like many data analysis applications, the process isn’t always perfect.

It’s important to note that Pandas dtype inference errs on the side of caution: if a Series appears to contain more than one type of data, it’s assigned a catch-all dtype of ‘object’. This behavior is less flexible than a typical spreadsheet application and is intended to ensure dtypes are not inferred incorrectly but also requires the analyst to ensure the data is “clean” after it’s loaded.

Cleansing and Transforming Data

Data is almost always dirty: it almost always contains some datum with atypical formatting; some artifact unique to its medium of origin. Therefore, cleansing data is crucial to ensuring analysis derived therefrom is sound. The work of cleansing with Pandas primarily involves identifying and re-casting incorrectly inferred dtypes.

>>> df.dtypes
Date Received    datetime64[ns]
Incident Date    datetime64[ns]
Airport Code             object
Airport Name             object
Airline Name             object
Claim Type               object
Claim Site               object
Item Category            object
Close Amount             object
Disposition              object
dtype: object

Looking again at our DataFrame’s dtypes we can see that Pandas correctly inferred the dtypes of Date Received and Incident Date as datetime64 dtypes. Thus, datetime attributes of the column’s data are accessible during operations. For example, to summarize our data by the hour of the day when each incident occurred we can group and summarize our data by the hour element of a datetime64 column to determine which hours of the day certain types of incidents occur.

>>> grp = df.groupby(by=df['Incident Date'].dt.hour)
>>> grp['Item Category'].describe()
              count unique                   top freq
Incident Date
0              3421    146  Baggage/Cases/Purses  489
1                 6      5                 Other    2
2                11      9                     -    2
3                 5      5     Jewelry & Watches    1
4                49     18  Baggage/Cases/Purses    6
5               257     39                     -   33
6               357     54                     -   43
7               343     43              Clothing   41
8               299     47                     -   35
9               305     41                     -   31
10              349     45                 Other   43
11              343     41                     -   45
12              363     51                 Other   41
13              359     55                     -   45
14              386     60  Baggage/Cases/Purses   49
15              376     51                 Other   41
16              351     43  Personal Electronics   35
17              307     52                 Other   34
18              289     43  Baggage/Cases/Purses   37
19              241     46  Baggage/Cases/Purses   26
20              163     31  Baggage/Cases/Purses   23
21              104     32  Baggage/Cases/Purses   20
22              106     33  Baggage/Cases/Purses   19
23               65     25  Baggage/Cases/Purses   14

This works out quite perfectly — however, note that Close Amount was loaded as an ‘object’. Words like “Amount” are a good indicator that a column contains numeric values.

Let’s take a look at the values in Close Amount.

>>> df['Close Amount'].head()
0     0
1     0
2    50
3     0
4     0
Name: Close Amount, dtype: object

Those look like numeric values to me. So let’s take a look at the other end

>>> df['Close Amount'].tail()
8850      0
8851    800
8852      0
8853    256
8854      -
Name: Close Amount, dtype: object

There’s the culprit: index # 8854 is a string value.

If Pandas can’t objectively determine that all of the values contained in a DataFrame column are the same numeric or date/time dtype, it defaults to an object.

Luckily, I know from experience that Excel’s “Accounting” number format typically formats 0.00 as a dash, -.

So how do we fix this? Pandas provides a general method, DataFrame.apply, which can be used to apply any single-argument function to each value of one or more of its columns.

In this case, we’ll use it to simultaneously convert the — to the value it represents in Excel, 0.0 and re-cast the entire column’s initial object dtype to its correct dtype a float64.

First, we’ll define a new function to perform the conversion:

>>> def dash_to_zero(x):
>>>    if '-' in str(x):
>>>        return float() # 0.0
>>>    else:
>>>        return x  # just return the input value as-is

Then, we’ll apply the function to each value of Close Amount:

>>> df['Close Amount'] = df['Close Amount'].apply(dash_to_zero)
>>> df['Close Amount'].dtype
dtype('float64')

These two steps can also be combined into a single-line operation using Python’s lambda:

>>> df['Close Amount'].apply(lambda x: 0. if '-' in str(x) else x)

Performing Basic Analysis

Once you’re confident that your dataset is “clean,” you’re ready for some data analysis! Aggregation is the process of getting summary data that may be more useful than the finely grained values we are given to start with.

Calculations

>>> df.sum()
Close Amount    538739.51
dtype: float64

>>> df.min()
Date Received              2014-01-01 00:00:00
Incident Date              2011-08-24 08:30:00
Airport Code                                 -
Airport Name      Albert J Ellis, Jacksonville
Airline Name                                 -
Claim Type                                   -
Claim Site                                   -
Item Category                                -
Close Amount                                 0
Disposition                                  -

>>> df.max()
Date Received                       2014-12-31 00:00:00
Incident Date                       2014-12-31 00:00:00
Airport Code                                        ZZZ
Airport Name                 Yuma International Airport
Airline Name                                 XL Airways
Claim Type                              Property Damage
Claim Site                                        Other
Item Category    Travel Accessories; Travel Accessories
Close Amount                                    25483.4
Disposition                                      Settle
dtype: object

Booleans

Find all of the rows where ‘Close Amount’ is greater than zero. This is helpful because we’d like to see some patterins where the amount is actually positive, and show how conditional operators work.

>>> df[df['Close Amount'] > 0].describe()
       Close Amount
count   2360.000000
mean     228.279453
std      743.720179
min        1.250000
25%       44.470000
50%      100.000000
75%      240.942500
max    25483.440000

Grouping

In this example, we’ll walk through how to group by a single column’s values.

The Groupby object is an intermediate step that allows us to aggregate on several rows which share something in common — in this case, the disposition value. This is useful because we get a birds-eye view of different categories of data. Ultimately, we use describe() to see several aggregates at once.

>>> grp = df.groupby(by='Disposition')
>>> grp.describe()
                Close Amount
                       count        mean          std   min       25%      50%       75%       max
Disposition
-                     3737.0    0.000000     0.000000  0.00    0.0000    0.000    0.0000      0.00
Approve in Full       1668.0  158.812116   314.532028  1.25   32.9625   79.675  159.3375   6183.36
Deny                  2758.0    0.000000     0.000000  0.00    0.0000    0.000    0.0000      0.00
Settle                 692.0  395.723844  1268.818458  6.00  100.0000  225.000  425.6100  25483.44

Group by multiple columns:

>>> grp = df.groupby(by=['Disposition', 'Claim Site'])
>>> grp.describe()
                                Close Amount
                                       count         mean          std     min       25%       50%        75%       max
Disposition     Claim Site
-               -                       34.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Bus Station              2.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Checked Baggage       2759.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Checkpoint             903.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Motor Vehicle           28.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Other                   11.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
Approve in Full Checked Baggage       1162.0   113.868072   192.166683    1.25   25.6600    60.075   125.9825   2200.00
                Checkpoint             493.0   236.643367   404.707047    8.95   60.0000   124.000   250.1400   6183.36
                Motor Vehicle            9.0  1591.428889  1459.368190  493.80  630.0000   930.180  1755.9800   5158.05
                Other                    4.0   398.967500   358.710134   61.11  207.2775   317.385   509.0750    899.99
Deny            -                        4.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Checked Baggage       2333.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Checkpoint             407.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Motor Vehicle            1.0     0.000000          NaN    0.00    0.0000     0.000     0.0000      0.00
                Other                   13.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
Settle          Checked Baggage        432.0   286.271968   339.487254    7.25   77.0700   179.995   361.5700   2500.00
                Checkpoint             254.0   487.173031  1620.156849    6.00  166.9250   281.000   496.3925  25483.44
                Motor Vehicle            6.0  4404.910000  7680.169379  244.00  841.8125  1581.780  2215.5025  20000.00

Plotting

While aggregates on groups of data is one of the best ways to get insights, visualizing data lets patterns jump out from the page, and is straightforward for those who aren’t as familiar with aggregate values. Properly formatted visualizations are critical to communicating meaning in the data, and it’s nice to see that Pandas has some of these functions out of the box:

>>> df.plot(x='Incident Date', y='Close Amount')
>>> plt.show()

Incident Date by Close Amount

Exporting Transformed Data

Finally, we may need to commit either our original data, or the aggregates as a DataFrame to file format different than the one we started with, as Pandas does not limit you to writing back out to the same file format.

The most common flat file to write to from Pandas will be the .csv. From the visualization, it looks like the cost of TSA claims, while occasionally very high due to some outliers is improving in 2015. We should probably recommend comparing staffing and procedural changes to continue in that direction, and explore in more detail why we have more incidents at certain times of day.

Like loading data, Pandas offers a number of methods for writing your data to file in various formats. Writing back to an Excel file is slightly more involved than the others, so let’s write to an even more portable format: CSV. To write your transformed dataset to a new CSV file:

>>> df.to_csv(path_or_buf='claims-2014.v1.csv')

Final Notes

Here we’ve seen a workflow that is both interesting and powerful. We’ve taken a round-trip all the way from a government excel file, into Python, through some fairly powerful data visualization, and back to a .csv file which could be more universally accessed — all through the power of Pandas. Further, we’ve covered the three central objects in Pandas — DataFrames, Series, and dtypes. Best of all, we have a deeper understanding of an interesting, real-world data set.

These are the core concepts to understand when working with Pandas, and now you can ask intelligent questions (of yourself, or of Google) about these different objects. This TSA data use case has shown us exactly what Pandas is good for: the exploration, analysis, and aggregation of data to draw conclusions.

The analysis and exploration of data is important in practically any field, but it is especially useful to Data Scientists and AI professionals who may need to crunch and clean data in very specific, finely-grained ways, like getting moving averages on stock ticks. Additionally, certain tasks may need to be automated, and this could prove difficult or expensive in sprawling applications like Excel, or Google Sheets, which may not offer all the functionality of Pandas with the full power of Python.

Just imagine telling a business administrator that they may never have to run that broken spreadsheet macro ever again! Once analysis is automated, it can be deployed as a service or applied to hundreds of thousands of records streaming from a database. Alternatively, Pandas could be used to make critical decisions after establishing statistical associations between patterns, as indeed it is every day.

Next, be sure to checkout at Python’s extensive database libraries (e.g. SQLalchemy), or API clients (like the Google Sheets/Slides Python Client or Airtable API to put your results in front of domain experts). The possibilities are endless, and are only enhanced by Python’s mature libraries and active community.

Thanks for reading ❤

If you liked this post, share it with all of your programming buddies!

Python data manipulation from Pandas Library

Python data manipulation from Pandas Library

The Pandas library is the most popular Python data manipulation library. It provides an easy way to manipulate data through its data-frame API, inspired from R’s data-frames.

The Pandas library is the most popular Python data manipulation library. It provides an easy way to manipulate data through its data-frame API, inspired from R’s data-frames.

Understanding The Pandas Library

One of the keys to getting a good understanding of Pandas, is to understand that Pandas is mostly a wrapper around a series of other Python Libraries. The main ones being Numpy, SQL Alchemy, Matplot lib and Openpyxl.

The core internal model of the data-frame is a series of Numpy array’s, and Pandas functions such as the now deprecated “as_matrix” return results in that internal representation.

Pandas leverages other libraries to get data in and out of data-frames, SQL Alchemy for instances is used through the read_sql and to_sql functions. And openpyxl and xlsx writer are used for read_excel and to_excel functions.

Matplotlib and Seaborn are instead used to provide an easy interface to plot information available within a data frame, using command such as df.plot()

Numpy’s Panda — Efficient pandas

One of the complain that you often hear is that Python is slow or that it is difficult to handle large amount of data. Most often than not, this is due to poor efficiency of the code being written. It is true that native Python code tends to be slower than compiled code, but libraries like Pandas effectively provides an interface in Python code to compiled code. Knowing how to properly interface with it, let us get the best out of Pandas/Python.

APPLY & VECTORIZED OPERATIONS

Pandas, like its underlying library Numpy, performs vectorized operations more efficiently than performing loops. These efficiencies are due to vectorized operations being performed through C compiled code, rather than native python code and on the ability of vectorized operations to operate on entire datasets.

The apply interface allows to gain some of the efficiency by using a CPython interfaces to do the looping:

df.apply(lambda x: x['col_a'] * x['col_b'], axis=1)

But most of the performance gain would be obtained from the use of vectorized operation themselves, be it directly in pandas or by calling its’ internal Numpy arrays directly.

As you can see from the picture above the difference in performance can be drastic, between processing it with a vectorized operation (3.53ms) and looping with apply to do an addition (27.8s). Additional efficiencies can be obtained by directly invoking the numpy’s arrays and api, eg:

***Swifter: ***swifter is a python library that makes it easy to vectorize different types of operations on dataframe, its API is fairly similar to that of the apply function

EFFICIENT DATA STORING THROUGH DTYPES

When loading a data-frame into memory, be it through read_csv, or read_excel or some other data-frame read function, SQL makes type inference which might prove to be inefficient. These api allow you to specify the types of each columns explicitly. This allows for a more efficient storage of data in memory.

df.astype({'testColumn': str, 'testCountCol': float})

Dtypes are native object from Numpy, which allows you to define the exact type and number of bits used to store certain informations.

Numpy’s dtype np.dtype('int32') would for instance represent a 32 bits long integer. Pandas default to set up integer to 64 bits, we could be save half the space when using a 32 bits:

memory_usage() shows the number of bytes used by each of the columns, since there is only one entry (row) per column, the size of each int64 column is 8bytes and of int32 4bytes.

Pandas also introduces the categorical dtype, that allows for efficient memory utilization for frequently occurring values. In the example below, we can see a 28x decrease in memory utilization for the field posting_date when we converted it to a categorical value.

In our example, the overall size of the data-frame drops by more than 3X by just changing this data type:

Not only using the right dtypes allows you to handle larger datasets in memory, it also makes some computations become more effective. In the example below, we can see that using categorical type brought a 3X speed improvement for the groupby / sum operation.

Within pandas, you can define the dtypes during the data load (read_ ) or as a type conversion (astype).

***CyberPandas: ***Cyber pandas is one of the different library extension that enables a richer variety of datatypes by supporting ipv4 and ipv6 data types and storing them efficiently.

HANDLING LARGE DATASETS WITH CHUNKS

Pandas allows for the loading of data in a data-frame by chunks, it is therefore possible to process data-frames as iterators and be able to handle data-frames larger than the available memory.

The combination of defining a chunksize when reading a data source and the get_chunk method, allows pandas to process data as an iterator. For instance, in the example shown above, the data frame is read 2 rows at the time. These chunks can then be iterated through:

i = 0
for a in df_iter:
  # do some processing  chunk = df_iter.get_chunk()
  i += 1
  new_chunk = chunk.apply(lambda x: do_something(x), axis=1)
  new_chunk.to_csv("chunk_output_%i.csv" % i )

The output of which can then be fed to a csv file, pickled, exported to a database, etc…

setting up operator by chunks also allows certain operations to be perform through multi-processing.

Dask: is a for instance, a framework built on top of Pandas and build with multi-processing and distributed processing in mind. It makes use of collections of chunks of pandas data-frames both in memory and on disk.

SQL Alchemy’s Pandas — Database Pandas

Pandas also is built up on top of SQL Alchemy to interface with databases, as such it is able to download datasets from diverse SQL type of databases as well as push records to it. Using the SQL Alchemy interface rather than the Pandas’ API directly allows us to do certain operations not natively supported within pandas such as transactions or upserts:

SQL TRANSACTIONS

Pandas can also make use of SQL transactions, handling commits and rollbacks. Pedro Capelastegui, explained in one of his blog post notably, how pandas could take advantage of transactions through a SQL alchemy context manager.

with engine.begin() as conn:
  df.to_sql(
    tableName,
    con=conn,
    ...
  )

the advantage of using a SQL transaction, is the fact that the transaction would roll back should the data load fail.

SQL extension

PandaSQL

Pandas has a few SQL extension such as pandasql a library that allows to perform SQL queries on top of data-frames. Through pandasql the data-frame object can be queried directly as if they were database tables.

SQL UPSERTs

Pandas doesn’t natively support upsert exports to SQL on databases supporting this function. Patches to pandas exists to allow this feature.

MatplotLib/Seaborn — Visual Pandas

Matplotlib and Searborn visualization are already integrated in some of the dataframe API such as through the .plot command. There is a fairly comprehensive documentation as how the interface works, on pandas’ website.

Extensions: Different extensions exists such as Bokeh and plotly to provide interactive visualization within Jupyter notebooks, while it is also possible to extend matplotlib to handle 3D graphs.

Other Extensions

Quite a few other extensions for pandas exists, which are there to handle no-core functionalities. One of them is tqdm, which provides a progress bar functionality for certain operations, another is pretty pandas which allows to format dataframes and add summary informations.

tqdm

tqdm is a progress bar extension in python that interacts with pandas, it allows user to see the progress of maps and applys operations on pandas dataframe when using the relevant function (progress_map and progress_apply):

PrettyPandas

PrettyPandas is a library that provides an easy way to format data-frames and to add table summaries to them:

Data Science with Python explained

Data Science with Python explained

An overview of using Python for data science including Numpy, Scipy, pandas, Scikit-Learn, XGBoost, TensorFlow and Keras.

An overview of using Python for data science including Numpy, Scipy, pandas, Scikit-Learn, XGBoost, TensorFlow and Keras.

So you’ve heard of data science and you’ve heard of Python.

You want to explore both but have no idea where to start — data science is pretty complicated, after all.

Don’t worry — Python is one of the easiest programming languages to learn. And thanks to the hard work of thousands of open source contributors, you can do data science, too.

If you look at the contents of this article, you may think there’s a lot to master, but this article has been designed to gently increase the difficulty as we go along.

One article obviously can’t teach you everything you need to know about data science with python, but once you’ve followed along you’ll know exactly where to look to take the next steps in your data science journey.

Table contents:

  • Why Python?
  • Installing Python
  • Using Python for Data Science
  • Numeric computation in Python
  • Statistical analysis in Python
  • Data manipulation in Python
  • Working with databases in Python
  • Data engineering in Python
  • Big data engineering in Python
  • Further statistics in Python
  • Machine learning in Python
  • Deep learning in Python
  • Data science APIs in Python
  • Applications in Python
  • Summary
Why Python?

Python, as a language, has a lot of features that make it an excellent choice for data science projects.

It’s easy to learn, simple to install (in fact, if you use a Mac you probably already have it installed), and it has a lot of extensions that make it great for doing data science.

Just because Python is easy to learn doesn’t mean its a toy programming language — huge companies like Google use Python for their data science projects, too. They even contribute packages back to the community, so you can use the same tools in your projects!

You can use Python to do way more than just data science — you can write helpful scripts, build APIs, build websites, and much much more. Learning it for data science means you can easily pick up all these other things as well.

Things to note

There are a few important things to note about Python.

Right now, there are two versions of Python that are in common use. They are versions 2 and 3.

Most tutorials, and the rest of this article, will assume that you’re using the latest version of Python 3. It’s just good to be aware that sometimes you can come across books or articles that use Python 2.

The difference between the versions isn’t huge, but sometimes copying and pasting version 2 code when you’re running version 3 won’t work — you’ll have to do some light editing.

The second important thing to note is that Python really cares about whitespace (that’s spaces and return characters). If you put whitespace in the wrong place, your programme will very likely throw an error.

There are tools out there to help you avoid doing this, but with practice you’ll get the hang of it.

If you’ve come from programming in other languages, Python might feel like a bit of a relief: there’s no need to manage memory and the community is very supportive.

If Python is your first programming language you’ve made an excellent choice. I really hope you enjoy your time using it to build awesome things.

Installing Python

The best way to install Python for data science is to use the Anaconda distribution (you’ll notice a fair amount of snake-related words in the community).

It has everything you need to get started using Python for data science including a lot of the packages that we’ll be covering in the article.

If you click on Products -> Distribution and scroll down, you’ll see installers available for Mac, Windows and Linux.

Even if you have Python available on your Mac already, you should consider installing the Anaconda distribution as it makes installing other packages easier.

If you prefer to do things yourself, you can go to the official Python website and download an installer there.

Package Managers

Packages are pieces of Python code that aren’t a part of the language but are really helpful for doing certain tasks. We’ll be talking a lot about packages throughout this article so it’s important that we’re set up to use them.

Because the packages are just pieces of Python code, we could copy and paste the code and put it somewhere the Python interpreter (the thing that runs your code) can find it.

But that’s a hassle — it means that you’ll have to copy and paste stuff every time you start a new project or if the package gets updated.

To sidestep all of that, we’ll instead use a package manager.

If you chose to use the Anaconda distribution, congratulations — you already have a package manager installed. If you didn’t, I’d recommend installing pip.

No matter which one you choose, you’ll be able to use commands at the terminal (or command prompt) to install and update packages easily.

Using Python for Data Science

Now that you’ve got Python installed, you’re ready to start doing data science.

But how do you start?

Because Python caters to so many different requirements (web developers, data analysts, data scientists) there are lots of different ways to work with the language.

Python is an interpreted language which means that you don’t have to compile your code into an executable file, you can just pass text documents containing code to the interpreter!

Let’s take a quick look at the different ways you can interact with the Python interpreter.

In the terminal

If you open up the terminal (or command prompt) and type the word ‘python’, you’ll start a shell session. You can type any valid Python commands in there and they’d work just like you’d expect.

This can be a good way to quickly debug something but working in a terminal is difficult over the course of even a small project.

Using a text editor

If you write a series of Python commands in a text file and save it with a .py extension, you can navigate to the file using the terminal and, by typing python YOUR_FILE_NAME.py, can run the programme.

This is essentially the same as typing the commands one-by-one into the terminal, it’s just much easier to fix mistakes and change what your program does.

In an IDE

An IDE is a professional-grade piece of software that helps you manage software projects.

One of the benefits of an IDE is that you can use debugging features which tell you where you’ve made a mistake before you try to run your programme.

Some IDEs come with project templates (for specific tasks) that you can use to set your project out according to best practices.

Jupyter Notebooks

None of these ways are the best for doing data science with python — that particular honour belongs to Jupyter notebooks.

Jupyter notebooks give you the capability to run your code one ‘block’ at a time, meaning that you can see the output before you decide what to do next — that’s really crucial in data science projects where we often need to see charts before taking the next step.

If you’re using Anaconda, you’ll already have Jupyter lab installed. To start it you’ll just need to type ‘jupyter lab’ into the terminal.

If you’re using pip, you’ll have to install Jupyter lab with the command ‘python pip install jupyter’.

Numeric Computation in Python

It probably won’t surprise you to learn that data science is mostly about numbers.

The NumPy package includes lots of helpful functions for performing the kind of mathematical operations you’ll need to do data science work.

It comes installed as part of the Anaconda distribution, and installing it with pip is just as easy as installing Jupyter notebooks (‘pip install numpy’).

The most common mathematical operations we’ll need to do in data science are things like matrix multiplication, computing the dot product of vectors, changing the data types of arrays and creating the arrays in the first place!

Here’s how you can make a list into a NumPy array:

Here’s how you can do array multiplication and calculate dot products in NumPy:

And here’s how you can do matrix multiplication in NumPy:

Statistics in Python

With mathematics out of the way, we must move forward to statistics.

The Scipy package contains a module (a subsection of a package’s code) specifically for statistics.

You can import it (make its functions available in your programme) into your notebook using the command ‘from scipy import stats’.

This package contains everything you’ll need to calculate statistical measurements on your data, perform statistical tests, calculate correlations, summarise your data and investigate various probability distributions.

Here’s how to quickly access summary statistics (minimum, maximum, mean, variance, skew, and kurtosis) of an array using Scipy:

Data Manipulation with Python

Data scientists have to spend an unfortunate amount of time cleaning and wrangling data. Luckily, the Pandas package helps us do this with code rather than by hand.

The most common tasks that I use Pandas for are reading data from CSV files and databases.

It also has a powerful syntax for combining different datasets together (datasets are called DataFrames in Pandas) and performing data manipulation.

You can see the first few rows of a DataFrame using the .head method:

You can select just one column using square brackets:

And you can create new columns by combining others:

Working with Databases in Python

In order to use the pandas read_sql method, you’ll have to establish a connection to a database.

The most bulletproof method of connecting to a database is by using the SQLAlchemy package for Python.

Because SQL is a language of its own and connecting to a database depends on which database you’re using, I’ll leave you to read the documentation if you’re interested in learning more.

Data Engineering in Python

Sometimes we’d prefer to do some calculations on our data before they arrive in our projects as a Pandas DataFrame.

If you’re working with databases or scraping data from the web (and storing it somewhere), this process of moving data and transforming it is called ETL (Extract, transform, load).

You extract the data from one place, do some transformations to it (summarise the data by adding it up, finding the mean, changing data types, and so on) and then load it to a place where you can access it.

There’s a really cool tool called Airflow which is very good at helping you manage ETL workflows. Even better, it’s written in Python.

It was developed by Airbnb when they had to move incredible amounts of data around, you can find out more about it here.

Big Data Engineering in Python

Sometimes ETL processes can be really slow. If you have billions of rows of data (or if they’re a strange data type like text), you can recruit lots of different computers to work on the transformation separately and pull everything back together at the last second.

This architecture pattern is called MapReduce and it was made popular by Hadoop.

Nowadays, lots of people use Spark to do this kind of data transformation / retrieval work and there’s a Python interface to Spark called (surprise, surprise) PySpark.

Both the MapReduce architecture and Spark are very complex tools, so I’m not going to go into detail here. Just know that they exist and that if you find yourself dealing with a very slow ETL process, PySpark might help. Here’s a link to the official site.

Further Statistics in Python

We already know that we can run statistical tests, calculate descriptive statistics, p-values, and things like skew and kurtosis using the stats module from Scipy, but what else can Python do with statistics?

One particular package that I think you should know about is the lifelines package.

Using the lifelines package, you can calculate a variety of functions from a subfield of statistics called survival analysis.

Survival analysis has a lot of applications. I’ve used it to predict churn (when a customer will cancel a subscription) and when a retail store might be burglarised.

These are totally different to the applications the creators of the package imagined it would be used for (survival analysis is traditionally a medical statistics tool). But that just shows how many different ways there are to frame data science problems!

The documentation for the package is really good, check it out here.

Machine Learning in Python

Now this is a major topic — machine learning is taking the world by storm and is a crucial part of a data scientist’s work.

Simply put, machine learning is a set of techniques that allows a computer to map input data to output data. There are a few instances where this isn’t the case but they’re in the minority and it’s generally helpful to think of ML this way.

There are two really good machine learning packages for Python, let’s talk about them both.

Scikit-Learn

Most of the time you spend doing machine learning in Python will be spent using the Scikit-Learn package (sometimes abbreviated sklearn).

This package implements a whole heap of machine learning algorithms and exposes them all through a consistent syntax. This makes it really easy for data scientists to take full advantage of every algorithm.

The general framework for using Scikit-Learn goes something like this –

You split your dataset into train and test datasets:

Then you instantiate and train a model:

And then you use the metrics module to test how well your model works:

XGBoost

The second package that is commonly used for machine learning in Python is XGBoost.

Where Scikit-Learn implements a whole range of algorithms XGBoost only implements a single one — gradient boosted decision trees.

This package (and algorithm) has become very popular recently due to its success at Kaggle competitions (online data science competitions that anyone can participate in).

Training the model works in much the same way as a Scikit-Learn algorithm.

Deep Learning in Python

The machine learning algorithms available in Scikit-Learn are sufficient for nearly any problem. That being said, sometimes you need to use the most advanced thing available.

Deep neural networks have skyrocketed in popularity due to the fact that systems using them have outperformed nearly every other class of algorithm.

There’s a problem though — it’s very hard to say what a neural net is doing and why it’s making the decisions that it is. Because of this, their use in finance, medicine, the law and related professions isn’t widely endorsed.

The two major classes of neural network are convolutional neural networks (which are used to classify images and complete a host of other tasks in computer vision) and recurrent neural nets (which are used to understand and generate text).

Exploring how neural nets work is outside the scope of this article, but just know that the packages you’ll need to look for if you want to do this kind of work are TensorFlow (a Google contibution!) and Keras.

Keras is essentially a wrapper for TensorFlow that makes it easier to work with.

Data Science APIs in Python

Once you’ve trained a model, you’d like to be able to access predictions from it in other software. The way you do this is by creating an API.

An API allows your model to receive data one row at a time from an external source and return a prediction.

Because Python is a general purpose programming language that can also be used to create web services, it’s easy to use Python to serve your model via API.

If you need to build an API you should look into the pickle and Flask. Pickle allows you to save trained models on your hard-drive so that you can use them later. And Flask is the simplest way to create web services.

Web Applications in Python

Finally, if you’d like to build a full-featured web application around your data science project, you should use the Django framework.

Django is immensely popular in the web development community and was used to build the first version of Instagram and Pinterest (among many others).

Summary

And with that we’ve concluded our whirlwind tour of data science with Python.

We’ve covered everything you’d need to learn to become a full-fledged data scientist. If it still seems intimidating, you should know that nobody knows all of this stuff and that even the best of us still Google the basics from time to time.

Learn Data Science | How to Learn Data Science for Free

Learn Data Science | How to Learn Data Science for Free

Learn Data Science | How to Learn Data Science for Free. In this post, I have described a learning path and free online courses and tutorials that will enable you to learn data science for free.

The average cost of obtaining a masters degree at traditional bricks and mortar institutions will set you back anywhere between $30,000 and $120,000. Even online data science degree programs don’t come cheap costing a minimum of $9,000. So what do you do if you want to learn data science but can’t afford to pay this?

I trained into a career as a data scientist without taking any formal education in the subject. In this article, I am going to share with you my own personal curriculum for learning data science if you can’t or don’t want to pay thousands of dollars for more formal study.

The curriculum will consist of 3 main parts, technical skills, theory and practical experience. I will include links to free resources for every element of the learning path and will also be including some links to additional ‘low cost’ options. So if you want to spend a little money to accelerate your learning you can add these resources to the curriculum. I will include the estimated costs for each of these.

Technical skills

The first part of the curriculum will focus on technical skills. I recommend learning these first so that you can take a practical first approach rather than say learning the mathematical theory first. Python is by far the most widely used programming language used for data science. In the Kaggle Machine Learning and Data Science survey carried out in 2018 83% of respondents said that they used Python on a daily basis. I would, therefore, recommend focusing on this language but also spending a little time on other languages such as R.

Python Fundamentals

Before you can start to use Python for data science you need a basic grasp of the fundamentals behind the language. So you will want to take a Python introductory course. There are lots of free ones out there but I like the Codeacademy ones best as they include hands-on in-browser coding throughout.

I would suggest taking the introductory course to learn Python. This covers basic syntax, functions, control flow, loops, modules and classes.

Data analysis with python

Next, you will want to get a good understanding of using Python for data analysis. There are a number of good resources for this.

To start with I suggest taking at least the free parts of the data analyst learning path on dataquest.io. Dataquest offers complete learning paths for data analyst, data scientist and data engineer. Quite a lot of the content, particularly on the data analyst path is available for free. If you do have some money to put towards learning then I strongly suggest putting it towards paying for a few months of the premium subscription. I took this course and it provided a fantastic grounding in the fundamentals of data science. It took me 6 months to complete the data scientist path. The price varies from $24.50 to $49 per month depending on whether you pay annually or not. It is better value to purchase the annual subscription if you can afford it.

The Dataquest platform

Python for machine learning

If you have chosen to pay for the full data science course on Dataquest then you will have a good grasp of the fundamentals of machine learning with Python. If not then there are plenty of other free resources. I would focus to start with on scikit-learn which is by far the most commonly used Python library for machine learning.

When I was learning I was lucky enough to attend a two-day workshop run by Andreas Mueller one of the core developers of scikit-learn. He has however published all the material from this course, and others, on this Github repo. These consist of slides, course notes and notebooks that you can work through. I would definitely recommend working through this material.

Then I would suggest taking some of the tutorials in the scikit-learn documentation. After that, I would suggest building some practical machine learning applications and learning the theory behind how the models work — which I will cover a bit later on.

SQL

SQL is a vital skill to learn if you want to become a data scientist as one of the fundamental processes in data modelling is extracting data in the first place. This will more often than not involve running SQL queries against a database. Again if you haven’t opted to take the full Dataquest course then here are a few free resources to learn this skill.

Codeacamdemy has a free introduction to SQL course. Again this is very practical with in-browser coding all the way through. If you also want to learn about cloud-based database querying then Google Cloud BigQuery is very accessible. There is a free tier so you can try queries for free, an extensive range of public datasets to try and very good documentation.

Codeacademy SQL course

R

To be a well-rounded data scientist it is a good idea to diversify a little from just Python. I would, therefore, suggest also taking an introductory course in R. Codeacademy have an introductory course on their free plan. It is probably worth noting here that similar to Dataquest Codeacademy also offers a complete data science learning plan as part of their pro account (this costs from $31.99 to $15.99 per month depending on how many months you pay for up front). I personally found the Dataquest course to be much more comprehensive but this may work out a little cheaper if you are looking to follow a learning path on a single platform.

Software engineering

It is a good idea to get a grasp of software engineering skills and best practices. This will help your code to be more readable and extensible both for yourself and others. Additionally, when you start to put models into production you will need to be able to write good quality well-tested code and work with tools like version control.

There are two great free resources for this. Python like you mean it covers things like the PEP8 style guide, documentation and also covers object-oriented programming really well.

The scikit-learn contribution guidelines, although written to facilitate contributions to the library, actually cover the best practices really well. This covers topics such as Github, unit testing and debugging and is all written in the context of a data science application.

Deep learning

For a comprehensive introduction to deep learning, I don’t think that you can get any better than the totally free and totally ad-free fast.ai. This course includes an introduction to machine learning, practical deep learning, computational linear algebra and a code-first introduction to natural language processing. All their courses have a practical first approach and I highly recommend them.

Fast.ai platform

Theory

Whilst you are learning the technical elements of the curriculum you will encounter some of the theory behind the code you are implementing. I recommend that you learn the theoretical elements alongside the practical. The way that I do this is that I learn the code to be able to implement a technique, let’s take KMeans as an example, once I have something working I will then look deeper into concepts such as inertia. Again the scikit-learn documentation contains all the mathematical concepts behind the algorithms.

In this section, I will introduce the key foundational elements of theory that you should learn alongside the more practical elements.

The khan academy covers almost all the concepts I have listed below for free. You can tailor the subjects you would like to study when you sign up and you then have a nice tailored curriculum for this part of the learning path. Checking all of the boxes below will give you an overview of most elements I have listed below.

Maths

Calculus

Calculus is defined by Wikipedia as “the mathematical study of continuous change.” In other words calculus can find patterns between functions, for example, in the case of derivatives, it can help you to understand how a function changes over time.

Many machine learning algorithms utilise calculus to optimise the performance of models. If you have studied even a little machine learning you will probably have heard of Gradient descent. This functions by iteratively adjusting the parameter values of a model to find the optimum values to minimise the cost function. Gradient descent is a good example of how calculus is used in machine learning.

What you need to know:

Derivatives

  • Geometric definition
  • Calculating the derivative of a function
  • Nonlinear functions

Chain rule

  • Composite functions
  • Composite function derivatives
  • Multiple functions

Gradients

  • Partial derivatives
  • Directional derivatives
  • Integrals

Linear Algebra

Many popular machine learning methods, including XGBOOST, use matrices to store inputs and process data. Matrices alongside vector spaces and linear equations form the mathematical branch known as Linear Algebra. In order to understand how many machine learning methods work it is essential to get a good understanding of this field.

What you need to learn:

Vectors and spaces

  • Vectors
  • Linear combinations
  • Linear dependence and independence
  • Vector dot and cross products

Matrix transformations

  • Functions and linear transformations
  • Matrix multiplication
  • Inverse functions
  • Transpose of a matrix

Statistics

Here is a list of the key concepts you need to know:

Descriptive/Summary statistics

  • How to summarise a sample of data
  • Different types of distributions
  • Skewness, kurtosis, central tendency (e.g. mean, median, mode)
  • Measures of dependence, and relationships between variables such as correlation and covariance

Experiment design

  • Hypothesis testing
  • Sampling
  • Significance tests
  • Randomness
  • Probability
  • Confidence intervals and two-sample inference

Machine learning

  • Inference about slope
  • Linear and non-linear regression
  • Classification

Practical experience

The third section of the curriculum is all about practice. In order to truly master the concepts above you will need to use the skills in some projects that ideally closely resemble a real-world application. By doing this you will encounter problems to work through such as missing and erroneous data and develop a deep level of expertise in the subject. In this last section, I will list some good places you can get this practical experience from for free.

“With deliberate practice, however, the goal is not just to reach your potential but to build it, to make things possible that were not possible before. This requires challenging homeostasis — getting out of your comfort zone — and forcing your brain or your body to adapt.”, Anders Ericsson, Peak: Secrets from the New Science of Expertise

Kaggle, et al

Machine learning competitions are a good place to get practice with building machine learning models. They give access to a wide range of data sets, each with a specific problem to solve and have a leaderboard. The leaderboard is a good way to benchmark how good your knowledge at developing a good model actually is and where you may need to improve further.

In addition to Kaggle, there are other platforms for machine learning competitions including Analytics Vidhya and DrivenData.

Driven data competitions page

UCI Machine Learning Repository

The UCI machine learning repository is a large source of publically available data sets. You can use these data sets to put together your own data projects this could include data analysis and machine learning models, you could even try building a deployed model with a web front end. It is a good idea to store your projects somewhere publically such as Github as this can create a portfolio showcasing your skills to use for future job applications.


UCI repository

Contributions to open source

One other option to consider is contributing to open source projects. There are many Python libraries that rely on the community to maintain them and there are often hackathons held at meetups and conferences where even beginners can join in. Attending one of these events would certainly give you some practical experience and an environment where you can learn from others whilst giving something back at the same time. Numfocus is a good example of a project like this.

In this post, I have described a learning path and free online courses and tutorials that will enable you to learn data science for free. Showcasing what you are able to do in the form of a portfolio is a great tool for future job applications in lieu of formal qualifications and certificates. I really believe that education should be accessible to everyone and, certainly, for data science at least, the internet provides that opportunity. In addition to the resources listed here, I have previously published a recommended reading list for learning data science available here. These are also all freely available online and are a great way to complement the more practical resources covered above.

Thanks for reading!