1653336480

# How to Create Scatter Plot Correlation Matrix Visualization using Python Pandas DataFrame

Python pandas tutorial for beginner on how to create scatter plot correlation matrix visualization to understand the correlation among various columns or variables of python pandas dataframe.

1652748716

## Exploratory Data Analysis Tutorial | Basics of EDA with Python

Exploratory data analysis is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate or not.

🔹 Topics Covered:
00:00:00 Basics of EDA with Python
01:40:10 Multiple Variate Analysis
02:30:26 Outlier Detection
03:44:48 Cricket World Cup Analysis using Exploratory Data Analysis

## What is Exploratory Data Analysis(EDA)?

If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.

We can find a more formal definition in Wikipedia.

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

EDA in Python uses data visualization to draw meaningful patterns and insights. It also involves the preparation of data sets for analysis by removing irregularities in the data.

Based on the results of EDA, companies also make business decisions, which can have repercussions later.

• If EDA is not done properly then it can hamper the further steps in the machine learning model building process.
• If done well, it may improve the efficacy of everything we do next.

1. Data Sourcing
2. Data Cleaning
3. Univariate analysis
4. Bivariate analysis
5. Multivariate analysis

## 1. Data Sourcing

Data Sourcing is the process of finding and loading the data into our system. Broadly there are two ways in which we can find data.

1. Private Data
2. Public Data

Private Data

As the name suggests, private data is given by private organizations. There are some security and privacy concerns attached to it. This type of data is used for mainly organizations internal analysis.

Public Data

This type of Data is available to everyone. We can find this in government websites and public organizations etc. Anyone can access this data, we do not need any special permissions or approval.

We can get public data on the following sites.

The very first step of EDA is Data Sourcing, we have seen how we can access data and load into our system. Now, the next step is how to clean the data.

## 2. Data Cleaning

After completing the Data Sourcing, the next step in the process of EDA is Data Cleaning. It is very important to get rid of the irregularities and clean the data after sourcing it into our system.

Irregularities are of different types of data.

• Missing Values
• Incorrect Format
• Anomalies/Outliers

To perform the data cleaning we are using a sample data set, which can be found here.

We are using Jupyter Notebook for analysis.

First, let’s import the necessary libraries and store the data in our system for analysis.

#import the useful libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read the data set of "Marketing Analysis" in data.

# Printing the data
data

Now, the data set looks like this,

If we observe the above dataset, there are some discrepancies in the Column header for the first 2 rows. The correct data is from the index number 1. So, we have to fix the first two rows.

This is called Fixing the Rows and Columns. Let’s ignore the first two rows and load the data again.

#import the useful libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read the file in data without first two rows as it is of no use.

#print the head of the data frame.

Now, the dataset looks like this, and it makes more sense.

Dataset after fixing the rows and columns

Following are the steps to be taken while Fixing Rows and Columns:

1. Delete Summary Rows and Columns in the Dataset.
2. Delete Header and Footer Rows on every page.
3. Delete Extra Rows like blank rows, page numbers, etc.
4. We can merge different columns if it makes for better understanding of the data
5. Similarly, we can also split one column into multiple columns based on our requirements or understanding.
6. Add Column names, it is very important to have column names to the dataset.

Now if we observe the above dataset, the customerid column has of no importance to our analysis, and also the jobedu column has both the information of job and education in it.

So, what we’ll do is, we’ll drop the customerid column and we’ll split the jobedu column into two other columns job and education and after that, we’ll drop the jobedu column as well.

# Drop the customer id as it is of no use.
data.drop('customerid', axis = 1, inplace = True)

#Extract job  & Education in newly from "jobedu" column.
data['job']= data["jobedu"].apply(lambda x: x.split(",")[0])
data['education']= data["jobedu"].apply(lambda x: x.split(",")[1])

# Drop the "jobedu" column from the dataframe.
data.drop('jobedu', axis = 1, inplace = True)

# Printing the Dataset
data

Now, the dataset looks like this,

Dropping Customerid and jobedu columns and adding job and education columns

Missing Values

If there are missing values in the Dataset before doing any statistical analysis, we need to handle those missing values.

There are mainly three types of missing values.

1. MCAR(Missing completely at random): These values do not depend on any other features.
2. MAR(Missing at random): These values may be dependent on some other features.
3. MNAR(Missing not at random): These missing values have some reason for why they are missing.

Let’s see which columns have missing values in the dataset.

# Checking the missing values
data.isnull().sum()

The output will be,

As we can see three columns contain missing values. Let’s see how to handle the missing values. We can handle missing values by dropping the missing records or by imputing the values.

Drop the missing Values

Let’s handle missing values in the age column.

# Dropping the records with age missing in data dataframe.
data = data[~data.age.isnull()].copy()

# Checking the missing values in the dataset.
data.isnull().sum()

Let’s check the missing values in the dataset now.

Let’s impute values to the missing values for the month column.

Since the month column is of an object type, let’s calculate the mode of that column and impute those values to the missing values.

# Find the mode of month in data
month_mode = data.month.mode()[0]

# Fill the missing values with mode value of month in data.
data.month.fillna(month_mode, inplace = True)

# Let's see the null values in the month column.
data.month.isnull().sum()

Now output is,

# Mode of month is
'may, 2017'
# Null values in month column after imputing with mode
0

Handling the missing values in the Response column. Since, our target column is Response Column, if we impute the values to this column it’ll affect our analysis. So, it is better to drop the missing values from Response Column.

#drop the records with response missing in data.
data = data[~data.response.isnull()].copy()
# Calculate the missing values in each column of data frame
data.isnull().sum()

Let’s check whether the missing values in the dataset have been handled or not,

All the missing values have been handled

We can also, fill the missing values as ‘NaN’ so that while doing any statistical analysis, it won’t affect the outcome.

Handling Outliers

We have seen how to fix missing values, now let’s see how to handle outliers in the dataset.

Outliers are the values that are far beyond the next nearest data points.

There are two types of outliers:

1. Univariate outliers: Univariate outliers are the data points whose values lie beyond the range of expected values based on one variable.
2. Multivariate outliers: While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value.

So, after understanding the causes of these outliers, we can handle them by dropping those records or imputing with the values or leaving them as is, if it makes more sense.

Standardizing Values

To perform data analysis on a set of values, we have to make sure the values in the same column should be on the same scale. For example, if the data contains the values of the top speed of different companies’ cars, then the whole column should be either in meters/sec scale or miles/sec scale.

Now, that we are clear on how to source and clean the data, let’s see how we can analyze the data.

## 3. Univariate Analysis

If we analyze data over a single variable/column from a dataset, it is known as Univariate Analysis.

Categorical Unordered Univariate Analysis:

An unordered variable is a categorical variable that has no defined order. If we take our data as an example, the job column in the dataset is divided into many sub-categories like technician, blue-collar, services, management, etc. There is no weight or measure given to any value in the ‘job’ column.

Now, let’s analyze the job category by using plots. Since Job is a category, we will plot the bar plot.

# Let's calculate the percentage of each job status category.
data.job.value_counts(normalize=True)

#plot the bar graph of percentage job categories
data.job.value_counts(normalize=True).plot.barh()
plt.show()

The output looks like this,

By the above bar plot, we can infer that the data set contains more number of blue-collar workers compared to other categories.

Categorical Ordered Univariate Analysis:

Ordered variables are those variables that have a natural rank of order. Some examples of categorical ordered variables from our dataset are:

• Month: Jan, Feb, March……
• Education: Primary, Secondary,……

Now, let’s analyze the Education Variable from the dataset. Since we’ve already seen a bar plot, let’s see how a Pie Chart looks like.

#calculate the percentage of each education category.
data.education.value_counts(normalize=True)

#plot the pie chart of education categories
data.education.value_counts(normalize=True).plot.pie()
plt.show()

The output will be,

By the above analysis, we can infer that the data set has a large number of them belongs to secondary education after that tertiary and next primary. Also, a very small percentage of them have been unknown.

This is how we analyze univariate categorical analysis. If the column or variable is of numerical then we’ll analyze by calculating its mean, median, std, etc. We can get those values by using the describe function.

data.salary.describe()

The output will be,

## 4. Bivariate Analysis

If we analyze data by taking two variables/columns into consideration from a dataset, it is known as Bivariate Analysis.

a) Numeric-Numeric Analysis:

Analyzing the two numeric variables from a dataset is known as numeric-numeric analysis. We can analyze it in three different ways.

• Scatter Plot
• Pair Plot
• Correlation Matrix

Scatter Plot

Let’s take three columns ‘Balance’, ‘Age’ and ‘Salary’ from our dataset and see what we can infer by plotting to scatter plot between salary balance and age balance

#plot the scatter plot of balance and salary variable in data
plt.scatter(data.salary,data.balance)
plt.show()

#plot the scatter plot of balance and age variable in data
data.plot.scatter(x="age",y="balance")
plt.show()

Now, the scatter plots looks like,

Pair Plot

Now, let’s plot Pair Plots for the three columns we used in plotting Scatter plots. We’ll use the seaborn library for plotting Pair Plots.

#plot the pair plot of salary, balance and age in data dataframe.
sns.pairplot(data = data, vars=['salary','balance','age'])
plt.show()

The Pair Plot looks like this,

Correlation Matrix

Since we cannot use more than two variables as x-axis and y-axis in Scatter and Pair Plots, it is difficult to see the relation between three numerical variables in a single graph. In those cases, we’ll use the correlation matrix.

# Creating a matrix using age, salry, balance as rows and columns
data[['age','salary','balance']].corr()

#plot the correlation matrix of salary, balance and age in data dataframe.
sns.heatmap(data[['age','salary','balance']].corr(), annot=True, cmap = 'Reds')
plt.show()

First, we created a matrix using age, salary, and balance. After that, we are plotting the heatmap using the seaborn library of the matrix.

b) Numeric - Categorical Analysis

Analyzing the one numeric variable and one categorical variable from a dataset is known as numeric-categorical analysis. We analyze them mainly using mean, median, and box plots.

Let’s take salary and response columns from our dataset.

First check for mean value using groupby

#groupby the response to find the mean of the salary with response no & yes separately.
data.groupby('response')['salary'].mean()

The output will be,

There is not much of a difference between the yes and no response based on the salary.

Let’s calculate the median,

#groupby the response to find the median of the salary with response no & yes separately.
data.groupby('response')['salary'].median()

The output will be,

By both mean and median we can say that the response of yes and no remains the same irrespective of the person’s salary. But, is it truly behaving like that, let’s plot the box plot for them and check the behavior.

#plot the box plot of salary for yes & no responses.
sns.boxplot(data.response, data.salary)
plt.show()

The box plot looks like this,

As we can see, when we plot the Box Plot, it paints a very different picture compared to mean and median. The IQR for customers who gave a positive response is on the higher salary side.

This is how we analyze Numeric-Categorical variables, we use mean, median, and Box Plots to draw some sort of conclusions.

c) Categorical — Categorical Analysis

Since our target variable/column is the Response rate, we’ll see how the different categories like Education, Marital Status, etc., are associated with the Response column. So instead of ‘Yes’ and ‘No’ we will convert them into ‘1’ and ‘0’, by doing that we’ll get the “Response Rate”.

#create response_rate of numerical data type where response "yes"= 1, "no"= 0
data['response_rate'] = np.where(data.response=='yes',1,0)
data.response_rate.value_counts()

The output looks like this,

Let’s see how the response rate varies for different categories in marital status.

#plot the bar graph of marital status with average value of response_rate
data.groupby('marital')['response_rate'].mean().plot.bar()
plt.show()

The graph looks like this,

By the above graph, we can infer that the positive response is more for Single status members in the data set. Similarly, we can plot the graphs for Loan vs Response rate, Housing Loans vs Response rate, etc.

## 5. Multivariate Analysis

If we analyze data by taking more than two variables/columns into consideration from a dataset, it is known as Multivariate Analysis.

Let’s see how ‘Education’, ‘Marital’, and ‘Response_rate’ vary with each other.

First, we’ll create a pivot table with the three columns and after that, we’ll create a heatmap.

result = pd.pivot_table(data=data, index='education', columns='marital',values='response_rate')
print(result)

#create heat map of education vs marital vs response_rate
sns.heatmap(result, annot=True, cmap = 'RdYlGn', center=0.117)
plt.show()

The Pivot table and heatmap looks like this,

Based on the Heatmap we can infer that the married people with primary education are less likely to respond positively for the survey and single people with tertiary education are most likely to respond positively to the survey.

Similarly, we can plot the graphs for Job vs marital vs response, Education vs poutcome vs response, etc.

Conclusion

This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis (EDA) helps us to look beyond the data. The more we explore the data, the more the insights we draw from it. As a data analyst, almost 80% of our time will be spent understanding data and solving various business problems through EDA.

Thank you for reading and Happy Coding!!!

#dataanalysis #python

1655630160

## Installation

Install via pip:

$pip install pytumblr Install from source:$ git clone https://github.com/tumblr/pytumblr.git
$cd pytumblr$ python setup.py install

## Usage

### Create a client

A pytumblr.TumblrRestClient is the object you'll make all of your calls to the Tumblr API through. Creating one is this easy:

client = pytumblr.TumblrRestClient(
'<consumer_key>',
'<consumer_secret>',
'<oauth_token>',
'<oauth_secret>',
)

client.info() # Grabs the current user information

Two easy ways to get your credentials to are:

1. The built-in interactive_console.py tool (if you already have a consumer key & secret)
2. The Tumblr API console at https://api.tumblr.com/console
3. Get sample login code at https://api.tumblr.com/console/calls/user/info

### Supported Methods

#### User Methods

client.info() # get information about the authenticating user
client.dashboard() # get the dashboard for the authenticating user
client.likes() # get the likes for the authenticating user
client.following() # get the blogs followed by the authenticating user

client.like(id, reblogkey) # like a post
client.unlike(id, reblogkey) # unlike a post

#### Blog Methods

client.blog_info(blogName) # get information about a blog
client.posts(blogName, **params) # get posts for a blog
client.avatar(blogName) # get the avatar for a blog
client.blog_likes(blogName) # get the likes on a blog
client.followers(blogName) # get the followers of a blog
client.blog_following(blogName) # get the publicly exposed blogs that [blogName] follows
client.queue(blogName) # get the queue for a given blog
client.submission(blogName) # get the submissions for a given blog

#### Post Methods

Creating posts

PyTumblr lets you create all of the various types that Tumblr supports. When using these types there are a few defaults that are able to be used with any post type.

The default supported types are described below.

• state - a string, the state of the post. Supported types are published, draft, queue, private
• tags - a list, a list of strings that you want tagged on the post. eg: ["testing", "magic", "1"]
• tweet - a string, the string of the customized tweet you want. eg: "Man I love my mega awesome post!"
• date - a string, the customized GMT that you want
• format - a string, the format that your post is in. Support types are html or markdown
• slug - a string, the slug for the url of the post you want

We'll show examples throughout of these default examples while showcasing all the specific post types.

Creating a photo post

Creating a photo post supports a bunch of different options plus the described default options * caption - a string, the user supplied caption * link - a string, the "click-through" url for the photo * source - a string, the url for the photo you want to use (use this or the data parameter) * data - a list or string, a list of filepaths or a single file path for multipart file upload

#Creates a photo post using a source URL
client.create_photo(blogName, state="published", tags=["testing", "ok"],

#Creates a photo post using a local filepath
client.create_photo(blogName, state="queue", tags=["testing", "ok"],
tweet="Woah this is an incredible sweet post [URL]",
data="/Users/johnb/path/to/my/image.jpg")

#Creates a photoset post using several local filepaths
client.create_photo(blogName, state="draft", tags=["jb is cool"], format="markdown",
data=["/Users/johnb/path/to/my/image.jpg", "/Users/johnb/Pictures/kittens.jpg"],
caption="## Mega sweet kittens")

Creating a text post

Creating a text post supports the same options as default and just a two other parameters * title - a string, the optional title for the post. Supports markdown or html * body - a string, the body of the of the post. Supports markdown or html

#Creating a text post
client.create_text(blogName, state="published", slug="testing-text-posts", title="Testing", body="testing1 2 3 4")

Creating a quote post

Creating a quote post supports the same options as default and two other parameter * quote - a string, the full text of the qote. Supports markdown or html * source - a string, the cited source. HTML supported

#Creating a quote post
client.create_quote(blogName, state="queue", quote="I am the Walrus", source="Ringo")

• title - a string, the title of post that you want. Supports HTML entities.
• url - a string, the url that you want to create a link post for.
• description - a string, the desciption of the link that you have
client.create_link(blogName, title="I like to search things, you should too.", url="https://duckduckgo.com",
description="Search is pretty cool when a duck does it.")

Creating a chat post

Creating a chat post supports the same options as default and two other parameters * title - a string, the title of the chat post * conversation - a string, the text of the conversation/chat, with diablog labels (no html)

#Create a chat post
chat = """John: Testing can be fun!
Renee: Testing is tedious and so are you.
John: Aw.
"""
client.create_chat(blogName, title="Renee just doesn't understand.", conversation=chat, tags=["renee", "testing"])

Creating an audio post

Creating an audio post allows for all default options and a has 3 other parameters. The only thing to keep in mind while dealing with audio posts is to make sure that you use the external_url parameter or data. You cannot use both at the same time. * caption - a string, the caption for your post * external_url - a string, the url of the site that hosts the audio file * data - a string, the filepath of the audio file you want to upload to Tumblr

#Creating an audio file
client.create_audio(blogName, caption="Rock out.", data="/Users/johnb/Music/my/new/sweet/album.mp3")

#lets use soundcloud!
client.create_audio(blogName, caption="Mega rock out.", external_url="https://soundcloud.com/skrillex/sets/recess")

Creating a video post

Creating a video post allows for all default options and has three other options. Like the other post types, it has some restrictions. You cannot use the embed and data parameters at the same time. * caption - a string, the caption for your post * embed - a string, the HTML embed code for the video * data - a string, the path of the file you want to upload

client.create_video(blogName, caption="Jon Snow. Mega ridiculous sword.",

#Creating a video post from local file
client.create_video(blogName, caption="testing", data="/Users/johnb/testing/ok/blah.mov")

Editing a post

Updating a post requires you knowing what type a post you're updating. You'll be able to supply to the post any of the options given above for updates.

client.edit_post(blogName, id=post_id, type="text", title="Updated")
client.edit_post(blogName, id=post_id, type="photo", data="/Users/johnb/mega/awesome.jpg")

Reblogging a Post

Reblogging a post just requires knowing the post id and the reblog key, which is supplied in the JSON of any post object.

client.reblog(blogName, id=125356, reblog_key="reblog_key")

Deleting a post

Deleting just requires that you own the post and have the post id

client.delete_post(blogName, 123456) # Deletes your post :(

A note on tags: When passing tags, as params, please pass them as a list (not a comma-separated string):

client.create_text(blogName, tags=['hello', 'world'], ...)

Getting notes for a post

In order to get the notes for a post, you need to have the post id and the blog that it is on.

data = client.notes(blogName, id='123456')

The results include a timestamp you can use to make future calls.

#### Tagged Methods

# get posts with a given tag
client.tagged(tag, **params)

### Using the interactive console

This client comes with a nice interactive console to run you through the OAuth process, grab your tokens (and store them for future use).

You'll need pyyaml installed to run it, but then it's just:

$python interactive-console.py and away you go! Tokens are stored in ~/.tumblr and are also shared by other Tumblr API clients like the Ruby client. ### Running tests The tests (and coverage reports) are run with nose, like this: python setup.py test Author: tumblr Source Code: https://github.com/tumblr/pytumblr License: Apache-2.0 license 1561523460 ## Matplotlib Cheat Sheet: Plotting in Python This Matplotlib cheat sheet introduces you to the basics that you need to plot your data with Python and includes code samples. Data visualization and storytelling with your data are essential skills that every data scientist needs to communicate insights gained from analyses effectively to any audience out there. For most beginners, the first package that they use to get in touch with data visualization and storytelling is, naturally, Matplotlib: it is a Python 2D plotting library that enables users to make publication-quality figures. But, what might be even more convincing is the fact that other packages, such as Pandas, intend to build more plotting integration with Matplotlib as time goes on. However, what might slow down beginners is the fact that this package is pretty extensive. There is so much that you can do with it and it might be hard to still keep a structure when you're learning how to work with Matplotlib. DataCamp has created a Matplotlib cheat sheet for those who might already know how to use the package to their advantage to make beautiful plots in Python, but that still want to keep a one-page reference handy. Of course, for those who don't know how to work with Matplotlib, this might be the extra push be convinced and to finally get started with data visualization in Python. You'll see that this cheat sheet presents you with the six basic steps that you can go through to make beautiful plots. Check out the infographic by clicking on the button below: With this handy reference, you'll familiarize yourself in no time with the basics of Matplotlib: you'll learn how you can prepare your data, create a new plot, use some basic plotting routines to your advantage, add customizations to your plots, and save, show and close the plots that you make. What might have looked difficult before will definitely be more clear once you start using this cheat sheet! Use it in combination with the Matplotlib Gallery, the documentation. Matplotlib Matplotlib is a Python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. ## Prepare the Data ### 1D Data >>> import numpy as np >>> x = np.linspace(0, 10, 100) >>> y = np.cos(x) >>> z = np.sin(x) ### 2D Data or Images >>> data = 2 * np.random.random((10, 10)) >>> data2 = 3 * np.random.random((10, 10)) >>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> U = 1 X** 2 + Y >>> V = 1 + X Y**2 >>> from matplotlib.cbook import get_sample_data >>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) ## Create Plot >>> import matplotlib.pyplot as plt ### Figure >>> fig = plt.figure() >>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) ### Axes >>> fig.add_axes() >>> ax1 = fig.add_subplot(221) #row-col-num >>> ax3 = fig.add_subplot(212) >>> fig3, axes = plt.subplots(nrows=2,ncols=2) >>> fig4, axes2 = plt.subplots(ncols=3) ## Save Plot >>> plt.savefig('foo.png') #Save figures >>> plt.savefig('foo.png', transparent=True) #Save transparent figures ## Show Plot >>> plt.show() ## Plotting Routines ## 1D Data >>> fig, ax = plt.subplots() >>> lines = ax.plot(x,y) #Draw points with lines or markers connecting them >>> ax.scatter(x,y) #Draw unconnected points, scaled or colored >>> axes[0,0].bar([1,2,3],[3,4,5]) #Plot vertical rectangles (constant width) >>> axes[1,0].barh([0.5,1,2.5],[0,1,2]) #Plot horiontal rectangles (constant height) >>> axes[1,1].axhline(0.45) #Draw a horizontal line across axes >>> axes[0,1].axvline(0.65) #Draw a vertical line across axes >>> ax.fill(x,y,color='blue') #Draw filled polygons >>> ax.fill_between(x,y,color='yellow') #Fill between y values and 0 ### 2D Data >>> fig, ax = plt.subplots() >>> im = ax.imshow(img, #Colormapped or RGB arrays cmap= 'gist_earth', interpolation= 'nearest', vmin=-2, vmax=2) >>> axes2[0].pcolor(data2) #Pseudocolor plot of 2D array >>> axes2[0].pcolormesh(data) #Pseudocolor plot of 2D array >>> CS = plt.contour(Y,X,U) #Plot contours >>> axes2[2].contourf(data1) #Plot filled contours >>> axes2[2]= ax.clabel(CS) #Label a contour plot ### Vector Fields >>> axes[0,1].arrow(0,0,0.5,0.5) #Add an arrow to the axes >>> axes[1,1].quiver(y,z) #Plot a 2D field of arrows >>> axes[0,1].streamplot(X,Y,U,V) #Plot a 2D field of arrows ### Data Distributions >>> ax1.hist(y) #Plot a histogram >>> ax3.boxplot(y) #Make a box and whisker plot >>> ax3.violinplot(z) #Make a violin plot ## Plot Anatomy & Workflow ### Plot Anatomy y-axis x-axis ### Workflow The basic steps to creating plots with matplotlib are: 1 Prepare Data 2 Create Plot 3 Plot 4 Customized Plot 5 Save Plot 6 Show Plot >>> import matplotlib.pyplot as plt >>> x = [1,2,3,4] #Step 1 >>> y = [10,20,25,30] >>> fig = plt.figure() #Step 2 >>> ax = fig.add_subplot(111) #Step 3 >>> ax.plot(x, y, color= 'lightblue', linewidth=3) #Step 3, 4 >>> ax.scatter([2,4,6], [5,15,25], color= 'darkgreen', marker= '^' ) >>> ax.set_xlim(1, 6.5) >>> plt.savefig('foo.png' ) #Step 5 >>> plt.show() #Step 6 ## Close and Clear >>> plt.cla() #Clear an axis >>> plt.clf(). #Clear the entire figure >>> plt.close(). #Close a window ## Plotting Customize Plot ### Colors, Color Bars & Color Maps >>> plt.plot(x, x, x, x**2, x, x** 3) >>> ax.plot(x, y, alpha = 0.4) >>> ax.plot(x, y, c= 'k') >>> fig.colorbar(im, orientation= 'horizontal') >>> im = ax.imshow(img, cmap= 'seismic' ) ### Markers >>> fig, ax = plt.subplots() >>> ax.scatter(x,y,marker= ".") >>> ax.plot(x,y,marker= "o") ### Linestyles >>> plt.plot(x,y,linewidth=4.0) >>> plt.plot(x,y,ls= 'solid') >>> plt.plot(x,y,ls= '--') >>> plt.plot(x,y,'--' ,x**2,y**2,'-.' ) >>> plt.setp(lines,color= 'r',linewidth=4.0) ### Text & Annotations >>> ax.text(1, -2.1, 'Example Graph', style= 'italic' ) >>> ax.annotate("Sine", xy=(8, 0), xycoords= 'data', xytext=(10.5, 0), textcoords= 'data', arrowprops=dict(arrowstyle= "->", connectionstyle="arc3"),) ### Mathtext >>> plt.title(r '$sigma_i=15$', fontsize=20) ### Limits, Legends and Layouts Limits & Autoscaling >>> ax.margins(x=0.0,y=0.1) #Add padding to a plot >>> ax.axis('equal') #Set the aspect ratio of the plot to 1 >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) #Set limits for x-and y-axis >>> ax.set_xlim(0,10.5) #Set limits for x-axis Legends >>> ax.set(title= 'An Example Axes', #Set a title and x-and y-axis labels ylabel= 'Y-Axis', xlabel= 'X-Axis') >>> ax.legend(loc= 'best') #No overlapping plot elements Ticks >>> ax.xaxis.set(ticks=range(1,5), #Manually set x-ticks ticklabels=[3,100, 12,"foo" ]) >>> ax.tick_params(axis= 'y', #Make y-ticks longer and go in and out direction= 'inout', length=10) Subplot Spacing >>> fig3.subplots_adjust(wspace=0.5, #Adjust the spacing between subplots hspace=0.3, left=0.125, right=0.9, top=0.9, bottom=0.1) >>> fig.tight_layout() #Fit subplot(s) in to the figure area Axis Spines >>> ax1.spines[ 'top'].set_visible(False) #Make the top axis line for a plot invisible >>> ax1.spines['bottom' ].set_position(( 'outward',10)) #Move the bottom axis line outward Have this Cheat Sheet at your fingertips Original article source at https://www.datacamp.com #matplotlib #cheatsheet #python 1642995900 ## Pandas Bokeh: Bokeh Plotting Backend for Pandas and GeoPandas Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of Pandas. Importing the library adds a complementary plotting method plot_bokeh() on DataFrames and Series. With Pandas-Bokeh, creating stunning, interactive, HTML-based visualization is as easy as calling: df.plot_bokeh() Pandas-Bokeh also provides native support as a Pandas Plotting backend for Pandas >= 0.25. When Pandas-Bokeh is installed, switchting the default Pandas plotting backend to Bokeh can be done via: pd.set_option('plotting.backend', 'pandas_bokeh') More details about the new Pandas backend can be found below. ## Interactive Documentation Please visit: https://patrikhlobil.github.io/Pandas-Bokeh/ for an interactive version of the documentation below, where you can play with the dynamic Bokeh plots. For more information have a look at the Examples below or at notebooks on the Github Repository of this project. ## Installation You can install Pandas-Bokeh from PyPI via pip pip install pandas-bokeh or conda: conda install -c patrikhlobil pandas-bokeh With the current release 0.5.5, Pandas-Bokeh officially supports Python 3.6 and newer. For more details, see Release Notes. ## How To Use ### Classical Use The Pandas-Bokeh library should be imported after Pandas, GeoPandas and/or Pyspark. After the import, one should define the plotting output, which can be: • pandas_bokeh.output_notebook(): Embeds the Plots in the cell outputs of the notebook. Ideal when working in Jupyter Notebooks. • pandas_bokeh.output_file(filename): Exports the plot to the provided filename as an HTML. For more details about the plotting outputs, see the reference here or the Bokeh documentation. #### Notebook output (see also bokeh.io.output_notebook) import pandas as pd import pandas_bokeh pandas_bokeh.output_notebook() #### File output to "Interactive Plot.html" (see also bokeh.io.output_file) import pandas as pd import pandas_bokeh pandas_bokeh.output_file("Interactive Plot.html") ### Pandas-Bokeh as native Pandas plotting backend For pandas >= 0.25, a plotting backend switch is natively supported. It can be achievied by calling: import pandas as pd pd.set_option('plotting.backend', 'pandas_bokeh') Now, the plotting API is accessible for a Pandas DataFrame via: df.plot(...) All additional functionalities of Pandas-Bokeh are then accessible at pd.plotting. So, setting the output to notebook is: pd.plotting.output_notebook() or calling the grid layout functionality: pd.plotting.plot_grid(...) Note: Backwards compatibility is kept since there will still be the df.plot_bokeh(...) methods for a DataFrame. ### Plot types Supported plottypes are at the moment: Also, check out the complementary chapter Outputs, Formatting & Layouts about: ## Lineplot ### Basic Lineplot This simple lineplot in Pandas-Bokeh already contains various interactive elements: • a pannable and zoomable (zoom in plotarea and zoom on axis) plot • by clicking on the legend elements, one can hide and show the individual lines • a Hovertool for the plotted lines Consider the following simple example: import numpy as np np.random.seed(42) df = pd.DataFrame({"Google": np.random.randn(1000)+0.2, "Apple": np.random.randn(1000)+0.17}, index=pd.date_range('1/1/2000', periods=1000)) df = df.cumsum() df = df + 50 df.plot_bokeh(kind="line") #equivalent to df.plot_bokeh.line() Note, that similar to the regular pandas.DataFrame.plot method, there are also additional accessors to directly access the different plotting types like: • df.plot_bokeh(kind="line", ...)df.plot_bokeh.line(...) • df.plot_bokeh(kind="bar", ...)df.plot_bokeh.bar(...) • df.plot_bokeh(kind="hist", ...)df.plot_bokeh.hist(...) • ... #### Advanced Lineplot There are various optional parameters to tune the plots, for example: • kind: Which kind of plot should be produced. Currently supported are: "line", "point", "scatter", "bar" and "histogram". In the near future many more will be implemented as horizontal barplot, boxplots, pie-charts, etc. • x: Name of the column to use for the horizontal x-axis. If the x parameter is not specified, the index is used for the x-values of the plot. Alternative, also an array of values can be passed that has the same number of elements as the DataFrame. • y: Name of column or list of names of columns to use for the vertical y-axis. • figsize: Choose width & height of the plot • title: Sets title of the plot • xlim/ylim: Set visibler range of plot for x- and y-axis (also works for datetime x-axis) • xlabel/ylabel: Set x- and y-labels • logx/logy: Set log-scale on x-/y-axis • xticks/yticks: Explicitly set the ticks on the axes • color: Defines a single color for a plot. • colormap: Can be used to specify multiple colors to plot. Can be either a list of colors or the name of a Bokeh color palette • hovertool: If True a Hovertool is active, else if False no Hovertool is drawn. • hovertool_string: If specified, this string will be used for the hovertool (@{column} will be replaced by the value of the column for the element the mouse hovers over, see also Bokeh documentation and here) • toolbar_location: Specify the position of the toolbar location (None, "above", "below", "left" or "right"). Default: "right" • zooming: Enables/Disables zooming. Default: True • panning: Enables/Disables panning. Default: True • fontsize_label/fontsize_ticks/fontsize_title/fontsize_legend: Set fontsize of labels, ticks, title or legend (int or string of form "15pt") • rangetool Enables a range tool scroller. Default False • kwargs**: Optional keyword arguments of bokeh.plotting.figure.line Try them out to get a feeling for the effects. Let us consider now: df.plot_bokeh.line( figsize=(800, 450), y="Apple", title="Apple vs Google", xlabel="Date", ylabel="Stock price [$]",
yticks=[0, 100, 200, 300, 400],
ylim=(0, 400),
toolbar_location=None,
colormap=["red", "blue"],
hovertool_string=r"""<img
height="42" alt="@imgs" width="42"
style="float: left; margin: 0px 15px 15px 0px;"
border="2"></img> Apple

<h4> Stock Price: </h4> @{Apple}""",
panning=False,
zooming=False)

#### Lineplot with data points

For lineplots, as for many other plot-kinds, there are some special keyword arguments that only work for this plotting type. For lineplots, these are:

• plot_data_points: Plot also the data points on the lines
• plot_data_points_size: Determines the size of the data points
• marker: Defines the point type (Default: "circle"). Possible values are: 'circle', 'square', 'triangle', 'asterisk', 'circle_x', 'square_x', 'inverted_triangle', 'x', 'circle_cross', 'square_cross', 'diamond', 'cross'
• kwargs**: Optional keyword arguments of bokeh.plotting.figure.line

Let us use this information to have another version of the same plot:

df.plot_bokeh.line(
figsize=(800, 450),
xlabel="Date",
ylabel="Stock price [$]", yticks=[0, 100, 200, 300, 400], ylim=(100, 200), xlim=("2001-01-01", "2001-02-01"), colormap=["red", "blue"], plot_data_points=True, plot_data_points_size=10, marker="asterisk") #### Lineplot with rangetool ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD')) df = df.cumsum() df.plot_bokeh(rangetool=True) ## Pointplot If you just wish to draw the date points for curves, the pointplot option is the right choice. It also accepts the kwargs of bokeh.plotting.figure.scatter like marker or size: import numpy as np x = np.arange(-3, 3, 0.1) y2 = x**2 y3 = x**3 df = pd.DataFrame({"x": x, "Parabula": y2, "Cube": y3}) df.plot_bokeh.point( x="x", xticks=range(-3, 4), size=5, colormap=["#009933", "#ff3399"], title="Pointplot (Parabula vs. Cube)", marker="x") ## Stepplot With a similar API as the line- & pointplots, one can generate a stepplot. Additional keyword arguments for this plot type are passes to bokeh.plotting.figure.step, e.g. mode (before, after, center), see the following example import numpy as np x = np.arange(-3, 3, 1) y2 = x**2 y3 = x**3 df = pd.DataFrame({"x": x, "Parabula": y2, "Cube": y3}) df.plot_bokeh.step( x="x", xticks=range(-1, 1), colormap=["#009933", "#ff3399"], title="Pointplot (Parabula vs. Cube)", figsize=(800,300), fontsize_title=30, fontsize_label=25, fontsize_ticks=15, fontsize_legend=5, ) df.plot_bokeh.step( x="x", xticks=range(-1, 1), colormap=["#009933", "#ff3399"], title="Pointplot (Parabula vs. Cube)", mode="after", figsize=(800,300) ) Note that the step-plot API of Bokeh does so far not support a hovertool functionality. ## Scatterplot A basic scatterplot can be created using the kind="scatter" option. For scatterplots, the x and y parameters have to be specified and the following optional keyword argument is allowed: category: Determines the category column to use for coloring the scatter points kwargs**: Optional keyword arguments of bokeh.plotting.figure.scatter Note, that the pandas.DataFrame.plot_bokeh() method return per default a Bokeh figure, which can be embedded in Dashboard layouts with other figures and Bokeh objects (for more details about (sub)plot layouts and embedding the resulting Bokeh plots as HTML click here). In the example below, we use the building grid layout support of Pandas-Bokeh to display both the DataFrame (using a Bokeh DataTable) and the resulting scatterplot: # Load Iris Dataset: df = pd.read_csv( r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/iris/iris.csv" ) df = df.sample(frac=1) # Create Bokeh-Table with DataFrame: from bokeh.models.widgets import DataTable, TableColumn from bokeh.models import ColumnDataSource data_table = DataTable( columns=[TableColumn(field=Ci, title=Ci) for Ci in df.columns], source=ColumnDataSource(df), height=300, ) # Create Scatterplot: p_scatter = df.plot_bokeh.scatter( x="petal length (cm)", y="sepal width (cm)", category="species", title="Iris DataSet Visualization", show_figure=False, ) # Combine Table and Scatterplot via grid layout: pandas_bokeh.plot_grid([[data_table, p_scatter]], plot_width=400, plot_height=350) A possible optional keyword parameters that can be passed to bokeh.plotting.figure.scatter is size. Below, we use the sepal length of the Iris data as reference for the size: #Change one value to clearly see the effect of the size keyword df.loc[13, "sepal length (cm)"] = 15 #Make scatterplot: p_scatter = df.plot_bokeh.scatter( x="petal length (cm)", y="sepal width (cm)", category="species", title="Iris DataSet Visualization with Size Keyword", size="sepal length (cm)") In this example you can see, that the additional dimension sepal length cannot be used to clearly differentiate between the virginica and versicolor species. ## Barplot The barplot API has no special keyword arguments, but accepts optional kwargs of bokeh.plotting.figure.vbar like alpha. It uses per default the index for the bar categories (however, also columns can be used as x-axis category using the x argument). data = { 'fruits': ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries'], '2015': [2, 1, 4, 3, 2, 4], '2016': [5, 3, 3, 2, 4, 6], '2017': [3, 2, 4, 4, 5, 3] } df = pd.DataFrame(data).set_index("fruits") p_bar = df.plot_bokeh.bar( ylabel="Price per Unit [€]", title="Fruit prices per Year", alpha=0.6) Using the stacked keyword argument you also maked stacked barplots: p_stacked_bar = df.plot_bokeh.bar( ylabel="Price per Unit [€]", title="Fruit prices per Year", stacked=True, alpha=0.6) Also horizontal versions of the above barplot are supported with the keyword kind="barh" or the accessor plot_bokeh.barh. You can still specify a column of the DataFrame as the bar category via the x argument if you do not wish to use the index. #Reset index, such that "fruits" is now a column of the DataFrame: df.reset_index(inplace=True) #Create horizontal bar (via kind keyword): p_hbar = df.plot_bokeh( kind="barh", x="fruits", xlabel="Price per Unit [€]", title="Fruit prices per Year", alpha=0.6, legend = "bottom_right", show_figure=False) #Create stacked horizontal bar (via barh accessor): p_stacked_hbar = df.plot_bokeh.barh( x="fruits", stacked=True, xlabel="Price per Unit [€]", title="Fruit prices per Year", alpha=0.6, legend = "bottom_right", show_figure=False) #Plot all barplot examples in a grid: pandas_bokeh.plot_grid([[p_bar, p_stacked_bar], [p_hbar, p_stacked_hbar]], plot_width=450) ## Histogram For drawing histograms (kind="hist"), Pandas-Bokeh has a lot of customization features. Optional keyword arguments for histogram plots are: • bins: Determines bins to use for the histogram. If bins is an int, it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths. If bins is a string, it defines the method used to calculate the optimal bin width, as defined by histogram_bin_edges. • histogram_type: Either "sidebyside", "topontop" or "stacked". Default: "topontop" • stacked: Boolean that overrides the histogram_type as "stacked" if given. Default: False • kwargs**: Optional keyword arguments of bokeh.plotting.figure.quad Below examples of the different histogram types: import numpy as np df_hist = pd.DataFrame({ 'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000), 'c': np.random.randn(1000) - 1 }, columns=['a', 'b', 'c']) #Top-on-Top Histogram (Default): df_hist.plot_bokeh.hist( bins=np.linspace(-5, 5, 41), vertical_xlabel=True, hovertool=False, title="Normal distributions (Top-on-Top)", line_color="black") #Side-by-Side Histogram (multiple bars share bin side-by-side) also accessible via #kind="hist": df_hist.plot_bokeh( kind="hist", bins=np.linspace(-5, 5, 41), histogram_type="sidebyside", vertical_xlabel=True, hovertool=False, title="Normal distributions (Side-by-Side)", line_color="black") #Stacked histogram: df_hist.plot_bokeh.hist( bins=np.linspace(-5, 5, 41), histogram_type="stacked", vertical_xlabel=True, hovertool=False, title="Normal distributions (Stacked)", line_color="black") Further, advanced keyword arguments for histograms are: • weights: A column of the DataFrame that is used as weight for the histogramm aggregation (see also numpy.histogram) • normed: If True, histogram values are normed to 1 (sum of histogram values=1). It is also possible to pass an integer, e.g. normed=100 would result in a histogram with percentage y-axis (sum of histogram values=100). Default: False • cumulative: If True, a cumulative histogram is shown. Default: False • show_average: If True, the average of the histogram is also shown. Default: False Their usage is shown in these examples: p_hist = df_hist.plot_bokeh.hist( y=["a", "b"], bins=np.arange(-4, 6.5, 0.5), normed=100, vertical_xlabel=True, ylabel="Share[%]", title="Normal distributions (normed)", show_average=True, xlim=(-4, 6), ylim=(0, 30), show_figure=False) p_hist_cum = df_hist.plot_bokeh.hist( y=["a", "b"], bins=np.arange(-4, 6.5, 0.5), normed=100, cumulative=True, vertical_xlabel=True, ylabel="Share[%]", title="Normal distributions (normed & cumulative)", show_figure=False) pandas_bokeh.plot_grid([[p_hist, p_hist_cum]], plot_width=450, plot_height=300) ## Areaplot Areaplot (kind="area") can be either drawn on top of each other or stacked. The important parameters are: stacked: If True, the areaplots are stacked. If False, plots are drawn on top of each other. Default: False kwargs**: Optional keyword arguments of bokeh.plotting.figure.patch Let us consider the energy consumption split by source that can be downloaded as DataFrame via: df_energy = pd.read_csv(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/energy/energy.csv", parse_dates=["Year"]) df_energy.head() Creating the Areaplot can be achieved via: df_energy.plot_bokeh.area( x="Year", stacked=True, legend="top_left", colormap=["brown", "orange", "black", "grey", "blue", "green"], title="Worldwide energy consumption split by energy source", ylabel="Million tonnes oil equivalent", ylim=(0, 16000)) Note that the energy consumption of fossile energy is still increasing and renewable energy sources are still small in comparison 😢!!! However, when we norm the plot using the normed keyword, there is a clear trend towards renewable energies in the last decade: df_energy.plot_bokeh.area( x="Year", stacked=True, normed=100, legend="bottom_left", colormap=["brown", "orange", "black", "grey", "blue", "green"], title="Worldwide energy consumption split by energy source", ylabel="Million tonnes oil equivalent") ## Pieplot For Pieplots, let us consider a dataset showing the results of all Bundestags elections in Germany since 2002: df_pie = pd.read_csv(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/Bundestagswahl/Bundestagswahl.csv") df_pie We can create a Pieplot of the last election in 2017 by specifying the "Partei" (german for party) column as the x column and the "2017" column as the y column for values: df_pie.plot_bokeh.pie( x="Partei", y="2017", colormap=["blue", "red", "yellow", "green", "purple", "orange", "grey"], title="Results of German Bundestag Election 2017", ) When you pass several columns to the y parameter (not providing the y-parameter assumes you plot all columns), multiple nested pieplots will be shown in one plot: df_pie.plot_bokeh.pie( x="Partei", colormap=["blue", "red", "yellow", "green", "purple", "orange", "grey"], title="Results of German Bundestag Elections [2002-2017]", line_color="grey") ## Mapplot The mapplot method of Pandas-Bokeh allows for plotting geographic points stored in a Pandas DataFrame on an interactive map. For more advanced Geoplots for line and polygon shapes have a look at the Geoplots examples for the GeoPandas API of Pandas-Bokeh. For mapplots, only (latitude, longitude) pairs in geographic projection (WGS84) can be plotted on a map. The basic API has the following 2 base parameters: • x: name of the longitude column of the DataFrame • y: name of the latitude column of the DataFrame The other optional keyword arguments are discussed in the section about the GeoPandas API, e.g. category for coloring the points. Below an example of plotting all cities for more than 1 million inhabitants: df_mapplot = pd.read_csv(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/populated%20places/populated_places.csv") df_mapplot.head() df_mapplot["size"] = df_mapplot["pop_max"] / 1000000 df_mapplot.plot_bokeh.map( x="longitude", y="latitude", hovertool_string="""<h2> @{name} </h2> <h3> Population: @{pop_max} </h3>""", tile_provider="STAMEN_TERRAIN_RETINA", size="size", figsize=(900, 600), title="World cities with more than 1.000.000 inhabitants") ## Geoplots Pandas-Bokeh also allows for interactive plotting of Maps using GeoPandas by providing a geopandas.GeoDataFrame.plot_bokeh() method. It allows to plot the following geodata on a map : • Points/MultiPoints • Lines/MultiLines • Polygons/MultiPolygons Note: t is not possible to mix up the objects types, i.e. a GeoDataFrame with Points and Lines is for example not allowed. Les us start with a simple example using the "World Borders Dataset" . Let us first import all neccessary libraries and read the shapefile: import geopandas as gpd import pandas as pd import pandas_bokeh pandas_bokeh.output_notebook() #Read in GeoJSON from URL: df_states = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/states/states.geojson") df_states.head() Plotting the data on a map is as simple as calling: df_states.plot_bokeh(simplify_shapes=10000) We also passed the optional parameter simplify_shapes (~meter) to improve plotting performance (for a reference see shapely.object.simplify). The above geolayer thus has an accuracy of about 10km. Many keyword arguments like xlabel, ylabel, xlim, ylim, title, colormap, hovertool, zooming, panning, ... for costumizing the plot are also available for the geoplotting API and can be uses as in the examples shown above. There are however also many other options especially for plotting geodata: • geometry_column: Specify the column that stores the geometry-information (default: "geometry") • hovertool_columns: Specify column names, for which values should be shown in hovertool • hovertool_string: If specified, this string will be used for the hovertool (@{column} will be replaced by the value of the column for the element the mouse hovers over, see also Bokeh documentation) • colormap_uselog: If set True, the colormapper is using a logscale. Default: False • colormap_range: Specify the value range of the colormapper via (min, max) tuple • tile_provider: Define build-in tile provider for background maps. Possible values: None, 'CARTODBPOSITRON', 'CARTODBPOSITRON_RETINA', 'STAMEN_TERRAIN', 'STAMEN_TERRAIN_RETINA', 'STAMEN_TONER', 'STAMEN_TONER_BACKGROUND', 'STAMEN_TONER_LABELS'. Default: CARTODBPOSITRON_RETINA • tile_provider_url: An arbitraty tile_provider_url of the form '/{Z}/{X}/{Y}*.png' can be passed to be used as background map. • tile_attribution: String (also HTML accepted) for showing attribution for tile source in the lower right corner • tile_alpha: Sets the alpha value of the background tile between [0, 1]. Default: 1 One of the most common usage of map plots are choropleth maps, where the color of a the objects is determined by the property of the object itself. There are 3 ways of drawing choropleth maps using Pandas-Bokeh, which are described below. ### Categories This is the simplest way. Just provide the category keyword for the selection of the property column: • category: Specifies the column of the GeoDataFrame that should be used to draw a choropleth map • show_colorbar: Whether or not to show a colorbar for categorical plots. Default: True Let us now draw the regions as a choropleth plot using the category keyword (at the moment, only numerical columns are supported for choropleth plots): df_states.plot_bokeh( figsize=(900, 600), simplify_shapes=5000, category="REGION", show_colorbar=False, colormap=["blue", "yellow", "green", "red"], hovertool_columns=["STATE_NAME", "REGION"], tile_provider="STAMEN_TERRAIN_RETINA") When hovering over the states, the state-name and the region are shown as specified in the hovertool_columns argument. ### Dropdown By passing a list of column names of the GeoDataFrame as the dropdown keyword argument, a dropdown menu is shown above the map. This dropdown menu can be used to select the choropleth layer by the user. : df_states["STATE_NAME_SMALL"] = df_states["STATE_NAME"].str.lower() df_states.plot_bokeh( figsize=(900, 600), simplify_shapes=5000, dropdown=["POPESTIMATE2010", "POPESTIMATE2017"], colormap="Viridis", hovertool_string=""" <img src="https://www.states101.com/img/flags/gif/small/@STATE_NAME_SMALL.gif" height="42" alt="@imgs" width="42" style="float: left; margin: 0px 15px 15px 0px;" border="2"></img> <h2> @STATE_NAME </h2> <h3> 2010: @POPESTIMATE2010 </h3> <h3> 2017: @POPESTIMATE2017 </h3>""", tile_provider_url=r"http://c.tile.stamen.com/watercolor/{Z}/{X}/{Y}.jpg", tile_attribution='Map tiles by <a href="http://stamen.com">Stamen Design</a>, under <a href="http://creativecommons.org/licenses/by/3.0">CC BY 3.0</a>. Data by <a href="http://openstreetmap.org">OpenStreetMap</a>, under <a href="http://www.openstreetmap.org/copyright">ODbL</a>.' ) Using hovertool_string, one can pass a string that can contain arbitrary HTML elements (including divs, images, ...) that is shown when hovering over the geographies (@{column} will be replaced by the value of the column for the element the mouse hovers over, see also Bokeh documentation). Here, we also used an OSM tile server with watercolor style via tile_provider_url and added the attribution via tile_attribution. ### Sliders Another option for interactive choropleth maps is the slider implementation of Pandas-Bokeh. The possible keyword arguments are here: • slider: By passing a list of column names of the GeoDataFrame, a slider can be used to . This dropdown menu can be used to select the choropleth layer by the user. • slider_range: Pass a range (or numpy.arange) of numbers object to relate the sliders values with the slider columns. By passing range(0,10), the slider will have values [0, 1, 2, ..., 9], when passing numpy.arange(3,5,0.5), the slider will have values [3, 3.5, 4, 4.5]. Default: range(0, len(slider)) • slider_name: Specifies the title of the slider. Default is an empty string. This can be used to display the change in population relative to the year 2010: #Calculate change of population relative to 2010: for i in range(8): df_states["Delta_Population_201%d"%i] = ((df_states["POPESTIMATE201%d"%i] / df_states["POPESTIMATE2010"]) -1 ) * 100 #Specify slider columns: slider_columns = ["Delta_Population_201%d"%i for i in range(8)] #Specify slider-range (Maps "Delta_Population_2010" -> 2010, # "Delta_Population_2011" -> 2011, ...): slider_range = range(2010, 2018) #Make slider plot: df_states.plot_bokeh( figsize=(900, 600), simplify_shapes=5000, slider=slider_columns, slider_range=slider_range, slider_name="Year", colormap="Inferno", hovertool_columns=["STATE_NAME"] + slider_columns, title="Change of Population [%]") ### Plot multiple geolayers If you wish to display multiple geolayers, you can pass the Bokeh figure of a Pandas-Bokeh plot via the figure keyword to the next plot_bokeh() call: import geopandas as gpd import pandas_bokeh pandas_bokeh.output_notebook() # Read in GeoJSONs from URL: df_states = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/states/states.geojson") df_cities = gpd.read_file( r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/populated%20places/ne_10m_populated_places_simple_bigcities.geojson" ) df_cities["size"] = df_cities.pop_max / 400000 #Plot shapes of US states (pass figure options to this initial plot): figure = df_states.plot_bokeh( figsize=(800, 450), simplify_shapes=10000, show_figure=False, xlim=[-170, -80], ylim=[10, 70], category="REGION", colormap="Dark2", legend="States", show_colorbar=False, ) #Plot cities as points on top of the US states layer by passing the figure: df_cities.plot_bokeh( figure=figure, # <== pass figure here! category="pop_max", colormap="Viridis", colormap_uselog=True, size="size", hovertool_string="""<h1>@name</h1> <h3>Population: @pop_max </h3>""", marker="inverted_triangle", legend="Cities", ) ### Point & Line plots: Below, you can see an example that use Pandas-Bokeh to plot point data on a map. The plot shows all cities with a population larger than 1.000.000. For point plots, you can select the marker as keyword argument (since it is passed to bokeh.plotting.figure.scatter). Here an overview of all available marker types: gdf = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/populated%20places/ne_10m_populated_places_simple_bigcities.geojson") gdf["size"] = gdf.pop_max / 400000 gdf.plot_bokeh( category="pop_max", colormap="Viridis", colormap_uselog=True, size="size", hovertool_string="""<h1>@name</h1> <h3>Population: @pop_max </h3>""", xlim=[-15, 35], ylim=[30,60], marker="inverted_triangle"); In a similar way, also GeoDataFrames with (multi)line shapes can be drawn using Pandas-Bokeh. ### Colorbar formatting: If you want to display the numerical labels on your colorbar with an alternative to the scientific format, you can pass in a one of the bokeh number string formats or an instance of one of the bokeh.models.formatters to the colorbar_tick_format argument in the geoplot An example of using the string format argument: df_states = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/states/states.geojson") df_states["STATE_NAME_SMALL"] = df_states["STATE_NAME"].str.lower() # pass in a string format to colorbar_tick_format to display the ticks as 10m rather than 1e7 df_states.plot_bokeh( figsize=(900, 600), category="POPESTIMATE2017", simplify_shapes=5000, colormap="Inferno", colormap_uselog=True, colorbar_tick_format="0.0a") An example of using the bokeh PrintfTickFormatter: df_states = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/states/states.geojson") df_states["STATE_NAME_SMALL"] = df_states["STATE_NAME"].str.lower() for i in range(8): df_states["Delta_Population_201%d"%i] = ((df_states["POPESTIMATE201%d"%i] / df_states["POPESTIMATE2010"]) -1 ) * 100 # pass in a PrintfTickFormatter instance colorbar_tick_format to display the ticks with 2 decimal places df_states.plot_bokeh( figsize=(900, 600), category="Delta_Population_2017", simplify_shapes=5000, colormap="Inferno", colorbar_tick_format=PrintfTickFormatter(format="%4.2f")) ## Outputs, Formatting & Layouts ### Output options The pandas.DataFrame.plot_bokeh API has the following additional keyword arguments: • show_figure: If True, the resulting figure is shown (either in the notebook or exported and shown as HTML file, see Basics. If False, None is returned. Default: True • return_html: If True, the method call returns an HTML string that contains all Bokeh CSS&JS resources and the figure embedded in a div. This HTML representation of the plot can be used for embedding the plot in an HTML document. Default: False If you have a Bokeh figure or layout, you can also use the pandas_bokeh.embedded_html function to generate an embeddable HTML representation of the plot. This can be included into any valid HTML (note that this is not possible directly with the HTML generated by the pandas_bokeh.output_file output option, because it includes an HTML header). Let us consider the following simple example: #Import Pandas and Pandas-Bokeh (if you do not specify an output option, the standard is #output_file): import pandas as pd import pandas_bokeh #Create DataFrame to Plot: import numpy as np x = np.arange(-10, 10, 0.1) sin = np.sin(x) cos = np.cos(x) tan = np.tan(x) df = pd.DataFrame({"x": x, "sin(x)": sin, "cos(x)": cos, "tan(x)": tan}) #Make Bokeh plot from DataFrame using Pandas-Bokeh. Do not show the plot, but export #it to an embeddable HTML string: html_plot = df.plot_bokeh( kind="line", x="x", y=["sin(x)", "cos(x)", "tan(x)"], xticks=range(-20, 20), title="Trigonometric functions", show_figure=False, return_html=True, ylim=(-1.5, 1.5)) #Write some HTML and embed the HTML plot below it. For production use, please use #Templates and the awesome Jinja library. html = r""" <script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\$','\$']]}}); </script> <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> </script> <h1> Trigonometric functions </h1> <p> The basic trigonometric functions are:</p> <p>$ sin(x) $</p> <p>$ cos(x) $</p> <p>$ tan(x) = \frac{sin(x)}{cos(x)}$</p> <p>Below is a plot that shows them</p> """ + html_plot #Export the HTML string to an external HTML file and show it: with open("test.html" , "w") as f: f.write(html) import webbrowser webbrowser.open("test.html") This code will open up a webbrowser and show the following page. As you can see, the interactive Bokeh plot is embedded nicely into the HTML layout. The return_html option is ideal for the use in a templating engine like Jinja. ### Auto Scaling Plots For single plots that have a number of x axis values or for larger monitors, you can auto scale the figure to the width of the entire jupyter cell by setting the sizing_mode parameter. df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd']) df.plot_bokeh(kind="bar", figsize=(500, 200), sizing_mode="scale_width") The figsize parameter can be used to change the height and width as well as act as a scaling multiplier against the axis that is not being scaled. ### Number formats To change the formats of numbers in the hovertool, use the number_format keyword argument. For a documentation about the format to pass, have a look at the Bokeh documentation.Let us consider some examples for the number 3.141592653589793: This number format will be applied to all numeric columns of the hovertool. If you want to make a very custom or complicated hovertool, you should probably use the hovertool_string keyword argument, see e.g. this example. Below, we use the number_format parameter to specify the "Stock Price" format to 2 decimal digits and an additional$ sign.

import numpy as np

#Lineplot:
np.random.seed(42)
df = pd.DataFrame({
"Apple": np.random.randn(1000) + 0.17
},
index=pd.date_range('1/1/2000', periods=1000))
df = df.cumsum()
df = df + 50
df.plot_bokeh(
kind="line",
xlabel="Date",
ylabel="Stock price [$]", yticks=[0, 100, 200, 300, 400], ylim=(0, 400), colormap=["red", "blue"], number_format="1.00$")

#### Suppress scientific notation for axes

If you want to suppress the scientific notation for axes, you can use the disable_scientific_axes parameter, which accepts one of "x", "y", "xy":

df = pd.DataFrame({"Animal": ["Mouse", "Rabbit", "Dog", "Tiger", "Elefant", "Wale"],
"Weight [g]": [19, 3000, 40000, 200000, 6000000, 50000000]})
p_scientific = df.plot_bokeh(x="Animal", y="Weight [g]", show_figure=False)
p_non_scientific = df.plot_bokeh(x="Animal", y="Weight [g]", disable_scientific_axes="y", show_figure=False,)
pandas_bokeh.plot_grid([[p_scientific, p_non_scientific]], plot_width = 450)

### Dashboard Layouts

As shown in the Scatterplot Example, combining plots with plots or other HTML elements is straighforward in Pandas-Bokeh due to the layout capabilities of Bokeh. The easiest way to generate a dashboard layout is using the pandas_bokeh.plot_grid method (which is an extension of bokeh.layouts.gridplot):

import pandas as pd
import numpy as np
import pandas_bokeh
pandas_bokeh.output_notebook()

#Barplot:
data = {
'fruits':
['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries'],
'2015': [2, 1, 4, 3, 2, 4],
'2016': [5, 3, 3, 2, 4, 6],
'2017': [3, 2, 4, 4, 5, 3]
}
df = pd.DataFrame(data).set_index("fruits")
p_bar = df.plot_bokeh(
kind="bar",
ylabel="Price per Unit [€]",
title="Fruit prices per Year",
show_figure=False)

#Lineplot:
np.random.seed(42)
df = pd.DataFrame({
"Apple": np.random.randn(1000) + 0.17
},
index=pd.date_range('1/1/2000', periods=1000))
df = df.cumsum()
df = df + 50
p_line = df.plot_bokeh(
kind="line",
xlabel="Date",
ylabel="Stock price [\$]",
yticks=[0, 100, 200, 300, 400],
ylim=(0, 400),
colormap=["red", "blue"],
show_figure=False)

#Scatterplot:
df = pd.DataFrame(iris["data"])
df.columns = iris["feature_names"]
df["species"] = iris["target"]
df["species"] = df["species"].map(dict(zip(range(3), iris["target_names"])))
p_scatter = df.plot_bokeh(
kind="scatter",
x="petal length (cm)",
y="sepal width (cm)",
category="species",
title="Iris DataSet Visualization",
show_figure=False)

#Histogram:
df_hist = pd.DataFrame({
'a': np.random.randn(1000) + 1,
'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1
},
columns=['a', 'b', 'c'])

p_hist = df_hist.plot_bokeh(
kind="hist",
bins=np.arange(-6, 6.5, 0.5),
vertical_xlabel=True,
normed=100,
hovertool=False,
title="Normal distributions",
show_figure=False)

#Make Dashboard with Grid Layout:
pandas_bokeh.plot_grid([[p_line, p_bar],
[p_scatter, p_hist]], plot_width=450)

Using a combination of row and column elements (see also Bokeh Layouts) allow for a very easy general arrangement of elements. An alternative layout to the one above is:

p_line.plot_width = 900
p_hist.plot_width = 900

layout = pandas_bokeh.column(p_line,
pandas_bokeh.row(p_scatter, p_bar),
p_hist)

pandas_bokeh.show(layout)

## Release Notes

Release Notes can be found here.

Contributing to Pandas-Bokeh

If you wish to contribute to the development of Pandas-Bokeh you can follow the instructions on the CONTRIBUTING.md.