There are many data visualization libraries in Python, yet Matplotlib is the most popular library out of all of them. Matplotlib’s popularity is due to its reliability and utility - it’s able to create both simple and complex plots with little code. You can also customize the plots in a variety of ways.
In this tutorial, we’ll cover how to plot Box Plots in Matplotlib.
Box plots are used to visualize summary statistics of a dataset, displaying attributes of the distribution like the data’s range and distribution.
#python #matplotlib #pandas #data visualization #data science
This Matplotlib cheat sheet introduces you to the basics that you need to plot your data with Python and includes code samples.
Data visualization and storytelling with your data are essential skills that every data scientist needs to communicate insights gained from analyses effectively to any audience out there.
For most beginners, the first package that they use to get in touch with data visualization and storytelling is, naturally, Matplotlib: it is a Python 2D plotting library that enables users to make publication-quality figures. But, what might be even more convincing is the fact that other packages, such as Pandas, intend to build more plotting integration with Matplotlib as time goes on.
However, what might slow down beginners is the fact that this package is pretty extensive. There is so much that you can do with it and it might be hard to still keep a structure when you're learning how to work with Matplotlib.
DataCamp has created a Matplotlib cheat sheet for those who might already know how to use the package to their advantage to make beautiful plots in Python, but that still want to keep a one-page reference handy. Of course, for those who don't know how to work with Matplotlib, this might be the extra push be convinced and to finally get started with data visualization in Python.
You'll see that this cheat sheet presents you with the six basic steps that you can go through to make beautiful plots.
Check out the infographic by clicking on the button below:
With this handy reference, you'll familiarize yourself in no time with the basics of Matplotlib: you'll learn how you can prepare your data, create a new plot, use some basic plotting routines to your advantage, add customizations to your plots, and save, show and close the plots that you make.
Matplotlib is a Python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms.
>>> import numpy as np >>> x = np.linspace(0, 10, 100) >>> y = np.cos(x) >>> z = np.sin(x)
>>> data = 2 * np.random.random((10, 10)) >>> data2 = 3 * np.random.random((10, 10)) >>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> U = 1 X** 2 + Y >>> V = 1 + X Y**2 >>> from matplotlib.cbook import get_sample_data >>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy'))
>>> import matplotlib.pyplot as plt
>>> fig = plt.figure() >>> fig2 = plt.figure(figsize=plt.figaspect(2.0))
>>> fig.add_axes() >>> ax1 = fig.add_subplot(221) #row-col-num >>> ax3 = fig.add_subplot(212) >>> fig3, axes = plt.subplots(nrows=2,ncols=2) >>> fig4, axes2 = plt.subplots(ncols=3)
>>> plt.savefig('foo.png') #Save figures >>> plt.savefig('foo.png', transparent=True) #Save transparent figures
>>> fig, ax = plt.subplots() >>> lines = ax.plot(x,y) #Draw points with lines or markers connecting them >>> ax.scatter(x,y) #Draw unconnected points, scaled or colored >>> axes[0,0].bar([1,2,3],[3,4,5]) #Plot vertical rectangles (constant width) >>> axes[1,0].barh([0.5,1,2.5],[0,1,2]) #Plot horiontal rectangles (constant height) >>> axes[1,1].axhline(0.45) #Draw a horizontal line across axes >>> axes[0,1].axvline(0.65) #Draw a vertical line across axes >>> ax.fill(x,y,color='blue') #Draw filled polygons >>> ax.fill_between(x,y,color='yellow') #Fill between y values and 0
>>> fig, ax = plt.subplots() >>> im = ax.imshow(img, #Colormapped or RGB arrays cmap= 'gist_earth', interpolation= 'nearest', vmin=-2, vmax=2) >>> axes2.pcolor(data2) #Pseudocolor plot of 2D array >>> axes2.pcolormesh(data) #Pseudocolor plot of 2D array >>> CS = plt.contour(Y,X,U) #Plot contours >>> axes2.contourf(data1) #Plot filled contours >>> axes2= ax.clabel(CS) #Label a contour plot
>>> axes[0,1].arrow(0,0,0.5,0.5) #Add an arrow to the axes >>> axes[1,1].quiver(y,z) #Plot a 2D field of arrows >>> axes[0,1].streamplot(X,Y,U,V) #Plot a 2D field of arrows
>>> ax1.hist(y) #Plot a histogram >>> ax3.boxplot(y) #Make a box and whisker plot >>> ax3.violinplot(z) #Make a violin plot
The basic steps to creating plots with matplotlib are:
1 Prepare Data
2 Create Plot
4 Customized Plot
5 Save Plot
6 Show Plot
>>> import matplotlib.pyplot as plt >>> x = [1,2,3,4] #Step 1 >>> y = [10,20,25,30] >>> fig = plt.figure() #Step 2 >>> ax = fig.add_subplot(111) #Step 3 >>> ax.plot(x, y, color= 'lightblue', linewidth=3) #Step 3, 4 >>> ax.scatter([2,4,6], [5,15,25], color= 'darkgreen', marker= '^' ) >>> ax.set_xlim(1, 6.5) >>> plt.savefig('foo.png' ) #Step 5 >>> plt.show() #Step 6
>>> plt.cla() #Clear an axis >>> plt.clf(). #Clear the entire figure >>> plt.close(). #Close a window
>>> plt.plot(x, x, x, x**2, x, x** 3) >>> ax.plot(x, y, alpha = 0.4) >>> ax.plot(x, y, c= 'k') >>> fig.colorbar(im, orientation= 'horizontal') >>> im = ax.imshow(img, cmap= 'seismic' )
>>> fig, ax = plt.subplots() >>> ax.scatter(x,y,marker= ".") >>> ax.plot(x,y,marker= "o")
>>> plt.plot(x,y,linewidth=4.0) >>> plt.plot(x,y,ls= 'solid') >>> plt.plot(x,y,ls= '--') >>> plt.plot(x,y,'--' ,x**2,y**2,'-.' ) >>> plt.setp(lines,color= 'r',linewidth=4.0)
>>> ax.text(1, -2.1, 'Example Graph', style= 'italic' ) >>> ax.annotate("Sine", xy=(8, 0), xycoords= 'data', xytext=(10.5, 0), textcoords= 'data', arrowprops=dict(arrowstyle= "->", connectionstyle="arc3"),)
>>> plt.title(r '$sigma_i=15$', fontsize=20)
Limits & Autoscaling
>>> ax.margins(x=0.0,y=0.1) #Add padding to a plot >>> ax.axis('equal') #Set the aspect ratio of the plot to 1 >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) #Set limits for x-and y-axis >>> ax.set_xlim(0,10.5) #Set limits for x-axis
>>> ax.set(title= 'An Example Axes', #Set a title and x-and y-axis labels ylabel= 'Y-Axis', xlabel= 'X-Axis') >>> ax.legend(loc= 'best') #No overlapping plot elements
>>> ax.xaxis.set(ticks=range(1,5), #Manually set x-ticks ticklabels=[3,100, 12,"foo" ]) >>> ax.tick_params(axis= 'y', #Make y-ticks longer and go in and out direction= 'inout', length=10)
>>> fig3.subplots_adjust(wspace=0.5, #Adjust the spacing between subplots hspace=0.3, left=0.125, right=0.9, top=0.9, bottom=0.1) >>> fig.tight_layout() #Fit subplot(s) in to the figure area
>>> ax1.spines[ 'top'].set_visible(False) #Make the top axis line for a plot invisible >>> ax1.spines['bottom' ].set_position(( 'outward',10)) #Move the bottom axis line outward
Original article source at https://www.datacamp.com
#matplotlib #cheatsheet #python
Exploratory data analysis is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate or not.
🔹 Topics Covered:
00:00:00 Basics of EDA with Python
01:40:10 Multiple Variate Analysis
02:30:26 Outlier Detection
03:44:48 Cricket World Cup Analysis using Exploratory Data Analysis
If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.
We can find a more formal definition in Wikipedia.
In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
EDA in Python uses data visualization to draw meaningful patterns and insights. It also involves the preparation of data sets for analysis by removing irregularities in the data.
Based on the results of EDA, companies also make business decisions, which can have repercussions later.
In this article we’ll see about the following topics:
Data Sourcing is the process of finding and loading the data into our system. Broadly there are two ways in which we can find data.
As the name suggests, private data is given by private organizations. There are some security and privacy concerns attached to it. This type of data is used for mainly organizations internal analysis.
This type of Data is available to everyone. We can find this in government websites and public organizations etc. Anyone can access this data, we do not need any special permissions or approval.
We can get public data on the following sites.
The very first step of EDA is Data Sourcing, we have seen how we can access data and load into our system. Now, the next step is how to clean the data.
After completing the Data Sourcing, the next step in the process of EDA is Data Cleaning. It is very important to get rid of the irregularities and clean the data after sourcing it into our system.
Irregularities are of different types of data.
To perform the data cleaning we are using a sample data set, which can be found here.
We are using Jupyter Notebook for analysis.
First, let’s import the necessary libraries and store the data in our system for analysis.
#import the useful libraries. import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline # Read the data set of "Marketing Analysis" in data. data= pd.read_csv("marketing_analysis.csv") # Printing the data data
Now, the data set looks like this,
If we observe the above dataset, there are some discrepancies in the Column header for the first 2 rows. The correct data is from the index number 1. So, we have to fix the first two rows.
This is called Fixing the Rows and Columns. Let’s ignore the first two rows and load the data again.
#import the useful libraries. import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline # Read the file in data without first two rows as it is of no use. data = pd.read_csv("marketing_analysis.csv",skiprows = 2) #print the head of the data frame. data.head()
Now, the dataset looks like this, and it makes more sense.
Dataset after fixing the rows and columns
Following are the steps to be taken while Fixing Rows and Columns:
Now if we observe the above dataset, the
customerid column has of no importance to our analysis, and also the
jobedu column has both the information of
education in it.
So, what we’ll do is, we’ll drop the
customerid column and we’ll split the
jobedu column into two other columns
education and after that, we’ll drop the
jobedu column as well.
# Drop the customer id as it is of no use. data.drop('customerid', axis = 1, inplace = True) #Extract job & Education in newly from "jobedu" column. data['job']= data["jobedu"].apply(lambda x: x.split(",")) data['education']= data["jobedu"].apply(lambda x: x.split(",")) # Drop the "jobedu" column from the dataframe. data.drop('jobedu', axis = 1, inplace = True) # Printing the Dataset data
Now, the dataset looks like this,
Customerid and jobedu columns and adding job and education columns
If there are missing values in the Dataset before doing any statistical analysis, we need to handle those missing values.
There are mainly three types of missing values.
Let’s see which columns have missing values in the dataset.
# Checking the missing values data.isnull().sum()
The output will be,
As we can see three columns contain missing values. Let’s see how to handle the missing values. We can handle missing values by dropping the missing records or by imputing the values.
Drop the missing Values
Let’s handle missing values in the
# Dropping the records with age missing in data dataframe. data = data[~data.age.isnull()].copy() # Checking the missing values in the dataset. data.isnull().sum()
Let’s check the missing values in the dataset now.
Let’s impute values to the missing values for the month column.
Since the month column is of an object type, let’s calculate the mode of that column and impute those values to the missing values.
# Find the mode of month in data month_mode = data.month.mode() # Fill the missing values with mode value of month in data. data.month.fillna(month_mode, inplace = True) # Let's see the null values in the month column. data.month.isnull().sum()
Now output is,
# Mode of month is 'may, 2017' # Null values in month column after imputing with mode 0
Handling the missing values in the Response column. Since, our target column is Response Column, if we impute the values to this column it’ll affect our analysis. So, it is better to drop the missing values from Response Column.
#drop the records with response missing in data. data = data[~data.response.isnull()].copy() # Calculate the missing values in each column of data frame data.isnull().sum()
Let’s check whether the missing values in the dataset have been handled or not,
All the missing values have been handled
We can also, fill the missing values as ‘NaN’ so that while doing any statistical analysis, it won’t affect the outcome.
We have seen how to fix missing values, now let’s see how to handle outliers in the dataset.
Outliers are the values that are far beyond the next nearest data points.
There are two types of outliers:
So, after understanding the causes of these outliers, we can handle them by dropping those records or imputing with the values or leaving them as is, if it makes more sense.
To perform data analysis on a set of values, we have to make sure the values in the same column should be on the same scale. For example, if the data contains the values of the top speed of different companies’ cars, then the whole column should be either in meters/sec scale or miles/sec scale.
Now, that we are clear on how to source and clean the data, let’s see how we can analyze the data.
If we analyze data over a single variable/column from a dataset, it is known as Univariate Analysis.
Categorical Unordered Univariate Analysis:
An unordered variable is a categorical variable that has no defined order. If we take our data as an example, the job column in the dataset is divided into many sub-categories like technician, blue-collar, services, management, etc. There is no weight or measure given to any value in the ‘job’ column.
Now, let’s analyze the job category by using plots. Since Job is a category, we will plot the bar plot.
# Let's calculate the percentage of each job status category. data.job.value_counts(normalize=True) #plot the bar graph of percentage job categories data.job.value_counts(normalize=True).plot.barh() plt.show()
The output looks like this,
By the above bar plot, we can infer that the data set contains more number of blue-collar workers compared to other categories.
Categorical Ordered Univariate Analysis:
Ordered variables are those variables that have a natural rank of order. Some examples of categorical ordered variables from our dataset are:
Now, let’s analyze the Education Variable from the dataset. Since we’ve already seen a bar plot, let’s see how a Pie Chart looks like.
#calculate the percentage of each education category. data.education.value_counts(normalize=True) #plot the pie chart of education categories data.education.value_counts(normalize=True).plot.pie() plt.show()
The output will be,
By the above analysis, we can infer that the data set has a large number of them belongs to secondary education after that tertiary and next primary. Also, a very small percentage of them have been unknown.
This is how we analyze univariate categorical analysis. If the column or variable is of numerical then we’ll analyze by calculating its mean, median, std, etc. We can get those values by using the describe function.
The output will be,
If we analyze data by taking two variables/columns into consideration from a dataset, it is known as Bivariate Analysis.
a) Numeric-Numeric Analysis:
Analyzing the two numeric variables from a dataset is known as numeric-numeric analysis. We can analyze it in three different ways.
Let’s take three columns ‘Balance’, ‘Age’ and ‘Salary’ from our dataset and see what we can infer by plotting to scatter plot between
#plot the scatter plot of balance and salary variable in data plt.scatter(data.salary,data.balance) plt.show() #plot the scatter plot of balance and age variable in data data.plot.scatter(x="age",y="balance") plt.show()
Now, the scatter plots looks like,
Now, let’s plot Pair Plots for the three columns we used in plotting Scatter plots. We’ll use the seaborn library for plotting Pair Plots.
#plot the pair plot of salary, balance and age in data dataframe. sns.pairplot(data = data, vars=['salary','balance','age']) plt.show()
The Pair Plot looks like this,
Since we cannot use more than two variables as x-axis and y-axis in Scatter and Pair Plots, it is difficult to see the relation between three numerical variables in a single graph. In those cases, we’ll use the correlation matrix.
# Creating a matrix using age, salry, balance as rows and columns data[['age','salary','balance']].corr() #plot the correlation matrix of salary, balance and age in data dataframe. sns.heatmap(data[['age','salary','balance']].corr(), annot=True, cmap = 'Reds') plt.show()
First, we created a matrix using age, salary, and balance. After that, we are plotting the heatmap using the seaborn library of the matrix.
b) Numeric - Categorical Analysis
Analyzing the one numeric variable and one categorical variable from a dataset is known as numeric-categorical analysis. We analyze them mainly using mean, median, and box plots.
response columns from our dataset.
First check for mean value using
#groupby the response to find the mean of the salary with response no & yes separately. data.groupby('response')['salary'].mean()
The output will be,
There is not much of a difference between the yes and no response based on the salary.
Let’s calculate the median,
#groupby the response to find the median of the salary with response no & yes separately. data.groupby('response')['salary'].median()
The output will be,
By both mean and median we can say that the response of yes and no remains the same irrespective of the person’s salary. But, is it truly behaving like that, let’s plot the box plot for them and check the behavior.
#plot the box plot of salary for yes & no responses. sns.boxplot(data.response, data.salary) plt.show()
The box plot looks like this,
As we can see, when we plot the Box Plot, it paints a very different picture compared to mean and median. The IQR for customers who gave a positive response is on the higher salary side.
This is how we analyze Numeric-Categorical variables, we use mean, median, and Box Plots to draw some sort of conclusions.
c) Categorical — Categorical Analysis
Since our target variable/column is the Response rate, we’ll see how the different categories like Education, Marital Status, etc., are associated with the Response column. So instead of ‘Yes’ and ‘No’ we will convert them into ‘1’ and ‘0’, by doing that we’ll get the “Response Rate”.
#create response_rate of numerical data type where response "yes"= 1, "no"= 0 data['response_rate'] = np.where(data.response=='yes',1,0) data.response_rate.value_counts()
The output looks like this,
Let’s see how the response rate varies for different categories in marital status.
#plot the bar graph of marital status with average value of response_rate data.groupby('marital')['response_rate'].mean().plot.bar() plt.show()
The graph looks like this,
By the above graph, we can infer that the positive response is more for Single status members in the data set. Similarly, we can plot the graphs for Loan vs Response rate, Housing Loans vs Response rate, etc.
If we analyze data by taking more than two variables/columns into consideration from a dataset, it is known as Multivariate Analysis.
Let’s see how ‘Education’, ‘Marital’, and ‘Response_rate’ vary with each other.
First, we’ll create a pivot table with the three columns and after that, we’ll create a heatmap.
result = pd.pivot_table(data=data, index='education', columns='marital',values='response_rate') print(result) #create heat map of education vs marital vs response_rate sns.heatmap(result, annot=True, cmap = 'RdYlGn', center=0.117) plt.show()
The Pivot table and heatmap looks like this,
Based on the Heatmap we can infer that the married people with primary education are less likely to respond positively for the survey and single people with tertiary education are most likely to respond positively to the survey.
Similarly, we can plot the graphs for Job vs marital vs response, Education vs poutcome vs response, etc.
This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis (EDA) helps us to look beyond the data. The more we explore the data, the more the insights we draw from it. As a data analyst, almost 80% of our time will be spent understanding data and solving various business problems through EDA.
Thank you for reading and Happy Coding!!!
Matplotlib Tutorial - Bar Charts and reading in CSV Data - (Part 2)
Part 2 of our Matplotlib Tutorial Videos Series is out…
In this video, we will learn creating bar charts in matplotlib. We will also learn to put bar charts side-by-side in matplotlib. Also, we will read in a csv file to create bar charts in Matplotlib .
#matplotlib #barcharts #pythonplotting #matplotlibtutorials #matplotlibvideos
#matplotlib #Matplotlib-tutorial-videos #matplotlib-plotting #matplotlib-bar-charts
Intuitively show statistics, error bars, or custom functions
The box plot is a quick and convenient way to get a feel for your data set — it can give you a snapshot of your range, mean, quartiles, and outliers. Sure, it’s not as descriptive as a histogram or kde for the distribution, but it’s fantastic for seeing how our distributions change over our variables.
While data scientists and most technical people are familiar and comfortable with box plots, they can be pretty foreign to people in non-engineering or statistics domains. And herein lies the problem: you need a way to show how your data trends and its distributions change in a format that anyone can understand.
I’m going to show you step-by-step how to make a line plot that conveys as much or as little information from the box plot that you’d like to share — means, medians, ranges, quartiles, standard deviation error bars, or any custom values.
#box-plot #pandas #line-plot #visualization #matplotlib
You have a great quality product in your hand, and now you are thinking of ways to represent it to the customers in a most appealing way. Many brands are afraid of spending money on custom boxes without knowing that it is the most cost-effecting way with various benefits. When designed with proper planning and creativity, custom packaging can result in money. You shouldn’t mind spending a little more to save money in the long run. You may disagree with us, but investment in customization and personalization can open new doors of success for your business.
From our kitchen cabinets to the big supermarkets, custom boxes are everywhere. We are surrounded by customization, even if it is about selling a small cosmetic product. When you deal in the soap market, you have to face a lot of competition. You can’t compete in the market with a strategy of using plain cardboard boxes. Gone are the days when you can use a simple packaging solution. Today is the age of customization, and you have to use custom Boxes for Soap. These are not only affordable but also look good on the shelves and provide an ultimate customer experience. Let’s take a deeper look at how custom packaging is a cost-effective solution with several benefits.
During this pandemic, the e-commerce industry has grown too fast, and every brand is selling its products online. One thing which is keeping the manufacturer is the high shipping price. But to make the right decision, you need to understand how shipping prices work. It is not only about the weight of the box; consider dimensional weight as well. Even if you are shipping a small item like soap, the box size can increase the shipping cost. So, the firsts step towards price reduction is using the box which is according to the product size and also light in weight.
It is one of the most faced issues which brands face. Damaged and broken products always result in returns which ultimately means additional cost. No brand will ever want to face negative reviews and customer backlash. You can avoid it by using durable and sturdy boxes. When it comes to protection, corrugate and cardboard soap boxers can outclass every other solution. These two materials are quite affordable and readily available in the market. Once again, it is highly recommended to use the right box size to avoid damage and returns. You can also use other packaging materials for added protection.
When it comes to increasing your visibility and exposure in retail stores, there is no better option other than custom containers. These come in a variety of shapes and styles, which increase the customer’s interest in your product. Custom Pillow Boxes are the right choice to present your soap products on the shelves. You can customize the pillow packaging further with other customization options like window patching, lamination, and gold stamping. Unique solutions always make customers take a closer look at the product, and most probably they will end up buying it.
We have mentioned it before, but it needs more repetition. A wrong size box will always cost you more. If you are thinking that you can end up saving by choosing the standard size boxes for all the products, you are wrong. It will cost you more in form of damaged products and returns. Moreover, the bigger will be the size of the box, the more you have to pay for shipping. So, always choose the right size, which suits the product dimensions, and don’t leave too much space in the containers. A bigger size box will also make you use the protective material.
If you still think that custom boxes are way out of your range, we have still so many options for you. Take a simple corrugate or cardboard box in white or any plain color, print your logo on it, and you have a custom box in your hand. Getting a custom solution for your soap products has never been so easy. Today is the age of minimalism, and you can take advantage of this trend. Use simple customization for a natural and minimal look. Cardboard and corrugated are the most affordable option when it comes to custom material.
When it comes to benefits with custom packaging, there are several benefits that you can get. From product protection to the customer experience, you will get everything with a custom solution. The biggest benefits of providing a personalized experience are repeat business and positive reviews from the customers. When you put your heart and effort into the customization, customers will reward you with positive feedback. A good review from the customers will attract more customers to your business. Satisfied customers will bring business with repeat business and higher brand recall.
When it comes to being sustainable, there is no better option than Kraft. Use Kraft packaging boxes to display your products on the shelves. It will attract more and more eco-conscious customers, which will ultimately result in boosted sales. It is not only a cheap option but offers 100% recyclability. The presentation and display of your product have a greater role to play in drawing the attraction. Using the same old-style packaging will not going to help you out. Think of something innovative and try using Kraft boxes for better results. Find a solution that is not unique but meets the customer’s needs.
When it comes to soap packaging, Kraft Boxes for Display are a perfect choice. These are not only cost-efficient but result in customer satisfaction, repeat business, and reduced shipping cost. Find a solution that meets your needs without breaking the budget.
#boxes for soap #boxes for pillow #boxes for display #soap boxes #pillow boxes #display boxes