1619467380

`pandas`

and `plotly`

.- Install packages and tools
- Import the data
- Analyze the data
- Graphically present the data in US choropleth map

#colab #plotly #pandas #covid19 #google-colab

1619467380

`pandas`

and `plotly`

.- Install packages and tools
- Import the data
- Analyze the data
- Graphically present the data in US choropleth map

#colab #plotly #pandas #covid19 #google-colab

1667425440

Perl script converts PDF files to Gerber format

Pdf2Gerb generates Gerber 274X photoplotting and Excellon drill files from PDFs of a PCB. Up to three PDFs are used: the top copper layer, the bottom copper layer (for 2-sided PCBs), and an optional silk screen layer. The PDFs can be created directly from any PDF drawing software, or a PDF print driver can be used to capture the Print output if the drawing software does not directly support output to PDF.

The general workflow is as follows:

- Design the PCB using your favorite CAD or drawing software.
- Print the top and bottom copper and top silk screen layers to a PDF file.
- Run Pdf2Gerb on the PDFs to create Gerber and Excellon files.
- Use a Gerber viewer to double-check the output against the original PCB design.
- Make adjustments as needed.
- Submit the files to a PCB manufacturer.

Please note that Pdf2Gerb does NOT perform DRC (Design Rule Checks), as these will vary according to individual PCB manufacturer conventions and capabilities. Also note that Pdf2Gerb is not perfect, so the output files must always be checked before submitting them. As of version 1.6, Pdf2Gerb supports most PCB elements, such as round and square pads, round holes, traces, SMD pads, ground planes, no-fill areas, and panelization. However, because it interprets the graphical output of a Print function, there are limitations in what it can recognize (or there may be bugs).

See docs/Pdf2Gerb.pdf for install/setup, config, usage, and other info.

```
#Pdf2Gerb config settings:
#Put this file in same folder/directory as pdf2gerb.pl itself (global settings),
#or copy to another folder/directory with PDFs if you want PCB-specific settings.
#There is only one user of this file, so we don't need a custom package or namespace.
#NOTE: all constants defined in here will be added to main namespace.
#package pdf2gerb_cfg;
use strict; #trap undef vars (easier debug)
use warnings; #other useful info (easier debug)
##############################################################################################
#configurable settings:
#change values here instead of in main pfg2gerb.pl file
use constant WANT_COLORS => ($^O !~ m/Win/); #ANSI colors no worky on Windows? this must be set < first DebugPrint() call
#just a little warning; set realistic expectations:
#DebugPrint("${\(CYAN)}Pdf2Gerb.pl ${\(VERSION)}, $^O O/S\n${\(YELLOW)}${\(BOLD)}${\(ITALIC)}This is EXPERIMENTAL software. \nGerber files MAY CONTAIN ERRORS. Please CHECK them before fabrication!${\(RESET)}", 0); #if WANT_DEBUG
use constant METRIC => FALSE; #set to TRUE for metric units (only affect final numbers in output files, not internal arithmetic)
use constant APERTURE_LIMIT => 0; #34; #max #apertures to use; generate warnings if too many apertures are used (0 to not check)
use constant DRILL_FMT => '2.4'; #'2.3'; #'2.4' is the default for PCB fab; change to '2.3' for CNC
use constant WANT_DEBUG => 0; #10; #level of debug wanted; higher == more, lower == less, 0 == none
use constant GERBER_DEBUG => 0; #level of debug to include in Gerber file; DON'T USE FOR FABRICATION
use constant WANT_STREAMS => FALSE; #TRUE; #save decompressed streams to files (for debug)
use constant WANT_ALLINPUT => FALSE; #TRUE; #save entire input stream (for debug ONLY)
#DebugPrint(sprintf("${\(CYAN)}DEBUG: stdout %d, gerber %d, want streams? %d, all input? %d, O/S: $^O, Perl: $]${\(RESET)}\n", WANT_DEBUG, GERBER_DEBUG, WANT_STREAMS, WANT_ALLINPUT), 1);
#DebugPrint(sprintf("max int = %d, min int = %d\n", MAXINT, MININT), 1);
#define standard trace and pad sizes to reduce scaling or PDF rendering errors:
#This avoids weird aperture settings and replaces them with more standardized values.
#(I'm not sure how photoplotters handle strange sizes).
#Fewer choices here gives more accurate mapping in the final Gerber files.
#units are in inches
use constant TOOL_SIZES => #add more as desired
(
#round or square pads (> 0) and drills (< 0):
.010, -.001, #tiny pads for SMD; dummy drill size (too small for practical use, but needed so StandardTool will use this entry)
.031, -.014, #used for vias
.041, -.020, #smallest non-filled plated hole
.051, -.025,
.056, -.029, #useful for IC pins
.070, -.033,
.075, -.040, #heavier leads
# .090, -.043, #NOTE: 600 dpi is not high enough resolution to reliably distinguish between .043" and .046", so choose 1 of the 2 here
.100, -.046,
.115, -.052,
.130, -.061,
.140, -.067,
.150, -.079,
.175, -.088,
.190, -.093,
.200, -.100,
.220, -.110,
.160, -.125, #useful for mounting holes
#some additional pad sizes without holes (repeat a previous hole size if you just want the pad size):
.090, -.040, #want a .090 pad option, but use dummy hole size
.065, -.040, #.065 x .065 rect pad
.035, -.040, #.035 x .065 rect pad
#traces:
.001, #too thin for real traces; use only for board outlines
.006, #minimum real trace width; mainly used for text
.008, #mainly used for mid-sized text, not traces
.010, #minimum recommended trace width for low-current signals
.012,
.015, #moderate low-voltage current
.020, #heavier trace for power, ground (even if a lighter one is adequate)
.025,
.030, #heavy-current traces; be careful with these ones!
.040,
.050,
.060,
.080,
.100,
.120,
);
#Areas larger than the values below will be filled with parallel lines:
#This cuts down on the number of aperture sizes used.
#Set to 0 to always use an aperture or drill, regardless of size.
use constant { MAX_APERTURE => max((TOOL_SIZES)) + .004, MAX_DRILL => -min((TOOL_SIZES)) + .004 }; #max aperture and drill sizes (plus a little tolerance)
#DebugPrint(sprintf("using %d standard tool sizes: %s, max aper %.3f, max drill %.3f\n", scalar((TOOL_SIZES)), join(", ", (TOOL_SIZES)), MAX_APERTURE, MAX_DRILL), 1);
#NOTE: Compare the PDF to the original CAD file to check the accuracy of the PDF rendering and parsing!
#for example, the CAD software I used generated the following circles for holes:
#CAD hole size: parsed PDF diameter: error:
# .014 .016 +.002
# .020 .02267 +.00267
# .025 .026 +.001
# .029 .03167 +.00267
# .033 .036 +.003
# .040 .04267 +.00267
#This was usually ~ .002" - .003" too big compared to the hole as displayed in the CAD software.
#To compensate for PDF rendering errors (either during CAD Print function or PDF parsing logic), adjust the values below as needed.
#units are pixels; for example, a value of 2.4 at 600 dpi = .0004 inch, 2 at 600 dpi = .0033"
use constant
{
HOLE_ADJUST => -0.004 * 600, #-2.6, #holes seemed to be slightly oversized (by .002" - .004"), so shrink them a little
RNDPAD_ADJUST => -0.003 * 600, #-2, #-2.4, #round pads seemed to be slightly oversized, so shrink them a little
SQRPAD_ADJUST => +0.001 * 600, #+.5, #square pads are sometimes too small by .00067, so bump them up a little
RECTPAD_ADJUST => 0, #(pixels) rectangular pads seem to be okay? (not tested much)
TRACE_ADJUST => 0, #(pixels) traces seemed to be okay?
REDUCE_TOLERANCE => .001, #(inches) allow this much variation when reducing circles and rects
};
#Also, my CAD's Print function or the PDF print driver I used was a little off for circles, so define some additional adjustment values here:
#Values are added to X/Y coordinates; units are pixels; for example, a value of 1 at 600 dpi would be ~= .002 inch
use constant
{
CIRCLE_ADJUST_MINX => 0,
CIRCLE_ADJUST_MINY => -0.001 * 600, #-1, #circles were a little too high, so nudge them a little lower
CIRCLE_ADJUST_MAXX => +0.001 * 600, #+1, #circles were a little too far to the left, so nudge them a little to the right
CIRCLE_ADJUST_MAXY => 0,
SUBST_CIRCLE_CLIPRECT => FALSE, #generate circle and substitute for clip rects (to compensate for the way some CAD software draws circles)
WANT_CLIPRECT => TRUE, #FALSE, #AI doesn't need clip rect at all? should be on normally?
RECT_COMPLETION => FALSE, #TRUE, #fill in 4th side of rect when 3 sides found
};
#allow .012 clearance around pads for solder mask:
#This value effectively adjusts pad sizes in the TOOL_SIZES list above (only for solder mask layers).
use constant SOLDER_MARGIN => +.012; #units are inches
#line join/cap styles:
use constant
{
CAP_NONE => 0, #butt (none); line is exact length
CAP_ROUND => 1, #round cap/join; line overhangs by a semi-circle at either end
CAP_SQUARE => 2, #square cap/join; line overhangs by a half square on either end
CAP_OVERRIDE => FALSE, #cap style overrides drawing logic
};
#number of elements in each shape type:
use constant
{
RECT_SHAPELEN => 6, #x0, y0, x1, y1, count, "rect" (start, end corners)
LINE_SHAPELEN => 6, #x0, y0, x1, y1, count, "line" (line seg)
CURVE_SHAPELEN => 10, #xstart, ystart, x0, y0, x1, y1, xend, yend, count, "curve" (bezier 2 points)
CIRCLE_SHAPELEN => 5, #x, y, 5, count, "circle" (center + radius)
};
#const my %SHAPELEN =
#Readonly my %SHAPELEN =>
our %SHAPELEN =
(
rect => RECT_SHAPELEN,
line => LINE_SHAPELEN,
curve => CURVE_SHAPELEN,
circle => CIRCLE_SHAPELEN,
);
#panelization:
#This will repeat the entire body the number of times indicated along the X or Y axes (files grow accordingly).
#Display elements that overhang PCB boundary can be squashed or left as-is (typically text or other silk screen markings).
#Set "overhangs" TRUE to allow overhangs, FALSE to truncate them.
#xpad and ypad allow margins to be added around outer edge of panelized PCB.
use constant PANELIZE => {'x' => 1, 'y' => 1, 'xpad' => 0, 'ypad' => 0, 'overhangs' => TRUE}; #number of times to repeat in X and Y directions
# Set this to 1 if you need TurboCAD support.
#$turboCAD = FALSE; #is this still needed as an option?
#CIRCAD pad generation uses an appropriate aperture, then moves it (stroke) "a little" - we use this to find pads and distinguish them from PCB holes.
use constant PAD_STROKE => 0.3; #0.0005 * 600; #units are pixels
#convert very short traces to pads or holes:
use constant TRACE_MINLEN => .001; #units are inches
#use constant ALWAYS_XY => TRUE; #FALSE; #force XY even if X or Y doesn't change; NOTE: needs to be TRUE for all pads to show in FlatCAM and ViewPlot
use constant REMOVE_POLARITY => FALSE; #TRUE; #set to remove subtractive (negative) polarity; NOTE: must be FALSE for ground planes
#PDF uses "points", each point = 1/72 inch
#combined with a PDF scale factor of .12, this gives 600 dpi resolution (1/72 * .12 = 600 dpi)
use constant INCHES_PER_POINT => 1/72; #0.0138888889; #multiply point-size by this to get inches
# The precision used when computing a bezier curve. Higher numbers are more precise but slower (and generate larger files).
#$bezierPrecision = 100;
use constant BEZIER_PRECISION => 36; #100; #use const; reduced for faster rendering (mainly used for silk screen and thermal pads)
# Ground planes and silk screen or larger copper rectangles or circles are filled line-by-line using this resolution.
use constant FILL_WIDTH => .01; #fill at most 0.01 inch at a time
# The max number of characters to read into memory
use constant MAX_BYTES => 10 * M; #bumped up to 10 MB, use const
use constant DUP_DRILL1 => TRUE; #FALSE; #kludge: ViewPlot doesn't load drill files that are too small so duplicate first tool
my $runtime = time(); #Time::HiRes::gettimeofday(); #measure my execution time
print STDERR "Loaded config settings from '${\(__FILE__)}'.\n";
1; #last value must be truthful to indicate successful load
#############################################################################################
#junk/experiment:
#use Package::Constants;
#use Exporter qw(import); #https://perldoc.perl.org/Exporter.html
#my $caller = "pdf2gerb::";
#sub cfg
#{
# my $proto = shift;
# my $class = ref($proto) || $proto;
# my $settings =
# {
# $WANT_DEBUG => 990, #10; #level of debug wanted; higher == more, lower == less, 0 == none
# };
# bless($settings, $class);
# return $settings;
#}
#use constant HELLO => "hi there2"; #"main::HELLO" => "hi there";
#use constant GOODBYE => 14; #"main::GOODBYE" => 12;
#print STDERR "read cfg file\n";
#our @EXPORT_OK = Package::Constants->list(__PACKAGE__); #https://www.perlmonks.org/?node_id=1072691; NOTE: "_OK" skips short/common names
#print STDERR scalar(@EXPORT_OK) . " consts exported:\n";
#foreach(@EXPORT_OK) { print STDERR "$_\n"; }
#my $val = main::thing("xyz");
#print STDERR "caller gave me $val\n";
#foreach my $arg (@ARGV) { print STDERR "arg $arg\n"; }
```

Author: swannman

Source Code: https://github.com/swannman/pdf2gerb

License: GPL-3.0 license

1652748716

Exploratory data analysis is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate or not.

🔹 Topics Covered:

00:00:00 Basics of EDA with Python

01:40:10 Multiple Variate Analysis

02:30:26 Outlier Detection

03:44:48 Cricket World Cup Analysis using Exploratory Data Analysis

If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.

We can find a more formal definition in **Wikipedia****.**

In statistics,exploratory data analysisis an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

EDA in Python uses data visualization to draw meaningful patterns and insights. It also involves the preparation of data sets for analysis by removing irregularities in the data.

Based on the results of EDA, companies also make business decisions, which can have repercussions later.

- If EDA is not done properly then it can hamper the further steps in the machine learning model building process.
- If done well, it may improve the efficacy of everything we do next.

In this article we’ll see about the following topics:

- Data Sourcing
- Data Cleaning
- Univariate analysis
- Bivariate analysis
- Multivariate analysis

Data Sourcing is the process of finding and loading the data into our system. Broadly there are two ways in which we can find data.

- Private Data
- Public Data

**Private Data**

As the name suggests, private data is given by private organizations. There are some security and privacy concerns attached to it. This type of data is used for mainly organizations internal analysis.

**Public Data**

This type of Data is available to everyone. We can find this in government websites and public organizations etc. Anyone can access this data, we do not need any special permissions or approval.

We can get public data on the following sites.

- https://data.gov
- https://data.gov.uk
- https://data.gov.in
- https://www.kaggle.com/
- https://archive.ics.uci.edu/ml/index.php
- https://github.com/awesomedata/awesome-public-datasets

The very first step of EDA is Data Sourcing, we have seen how we can access data and load into our system. Now, the next step is how to clean the data.

After completing the Data Sourcing, the next step in the process of EDA is **Data Cleaning**. It is very important to get rid of the irregularities and clean the data after sourcing it into our system.

Irregularities are of different types of data.

- Missing Values
- Incorrect Format
- Incorrect Headers
- Anomalies/Outliers

To perform the data cleaning we are using a sample data set, which can be found **here**.

We are using **Jupyter Notebook** for analysis.

First, let’s import the necessary libraries and store the data in our system for analysis.

```
#import the useful libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Read the data set of "Marketing Analysis" in data.
data= pd.read_csv("marketing_analysis.csv")
# Printing the data
data
```

Now, the data set looks like this,

If we observe the above dataset, there are some discrepancies in the Column header for the first 2 rows. The correct data is from the index number 1. So, we have to fix the first two rows.

This is called **Fixing the Rows and Columns. **Let’s ignore the first two rows and load the data again.

```
#import the useful libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Read the file in data without first two rows as it is of no use.
data = pd.read_csv("marketing_analysis.csv",skiprows = 2)
#print the head of the data frame.
data.head()
```

Now, the dataset looks like this, and it makes more sense.

Dataset after fixing the rows and columns

Following are the steps to be taken while **Fixing Rows and Columns:**

- Delete Summary Rows and Columns in the Dataset.
- Delete Header and Footer Rows on every page.
- Delete Extra Rows like blank rows, page numbers, etc.
- We can merge different columns if it makes for better understanding of the data
- Similarly, we can also split one column into multiple columns based on our requirements or understanding.
- Add Column names, it is very important to have column names to the dataset.

Now if we observe the above dataset, the `customerid`

column has of no importance to our analysis, and also the `jobedu`

column has both the information of `job`

and `education`

in it.

So, what we’ll do is, we’ll drop the `customerid`

column and we’ll split the `jobedu`

column into two other columns `job`

and `education`

and after that, we’ll drop the `jobedu`

column as well.

```
# Drop the customer id as it is of no use.
data.drop('customerid', axis = 1, inplace = True)
#Extract job & Education in newly from "jobedu" column.
data['job']= data["jobedu"].apply(lambda x: x.split(",")[0])
data['education']= data["jobedu"].apply(lambda x: x.split(",")[1])
# Drop the "jobedu" column from the dataframe.
data.drop('jobedu', axis = 1, inplace = True)
# Printing the Dataset
data
```

Now, the dataset looks like this,

Dropping `Customerid `

and jobedu columns and adding job and education columns

**Missing Values**

If there are missing values in the Dataset before doing any statistical analysis, we need to handle those missing values.

There are mainly three types of missing values.

- MCAR(Missing completely at random): These values do not depend on any other features.
- MAR(Missing at random): These values may be dependent on some other features.
- MNAR(Missing not at random): These missing values have some reason for why they are missing.

Let’s see which columns have missing values in the dataset.

```
# Checking the missing values
data.isnull().sum()
```

The output will be,

As we can see three columns contain missing values. Let’s see how to handle the missing values. We can handle missing values by dropping the missing records or by imputing the values.

**Drop the missing Values**

Let’s handle missing values in the `age`

column.

```
# Dropping the records with age missing in data dataframe.
data = data[~data.age.isnull()].copy()
# Checking the missing values in the dataset.
data.isnull().sum()
```

Let’s check the missing values in the dataset now.

Let’s impute values to the missing values for the month column.

Since the month column is of an object type, let’s calculate the mode of that column and impute those values to the missing values.

```
# Find the mode of month in data
month_mode = data.month.mode()[0]
# Fill the missing values with mode value of month in data.
data.month.fillna(month_mode, inplace = True)
# Let's see the null values in the month column.
data.month.isnull().sum()
```

Now output is,

```
# Mode of month is
'may, 2017'
# Null values in month column after imputing with mode
0
```

Handling the missing values in the **Response** column. Since, our target column is Response Column, if we impute the values to this column it’ll affect our analysis. So, it is better to drop the missing values from Response Column.

```
#drop the records with response missing in data.
data = data[~data.response.isnull()].copy()
# Calculate the missing values in each column of data frame
data.isnull().sum()
```

Let’s check whether the missing values in the dataset have been handled or not,

All the missing values have been handled

We can also, fill the missing values as **‘NaN’** so that while doing any statistical analysis, it won’t affect the outcome.

**Handling Outliers**

We have seen how to fix missing values, now let’s see how to handle outliers in the dataset.

Outliers are the values that are far beyond the next nearest data points.

There are two types of outliers:

**Univariate outliers:**Univariate outliers are the data points whose values lie beyond the range of expected values based on one variable.**Multivariate outliers:**While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value.

So, after understanding the causes of these outliers, we can handle them by dropping those records or imputing with the values or leaving them as is, if it makes more sense.

**Standardizing Values**

To perform data analysis on a set of values, we have to make sure the values in the same column should be on the same scale. For example, if the data contains the values of the top speed of different companies’ cars, then the whole column should be either in meters/sec scale or miles/sec scale.

Now, that we are clear on how to source and clean the data, let’s see how we can analyze the data.

If we analyze data over a single variable/column from a dataset, it is known as Univariate Analysis.

**Categorical Unordered Univariate Analysis:**

An unordered variable is a categorical variable that has no defined order. If we take our data as an example, the **job **column in the dataset is divided into many sub-categories like technician, blue-collar, services, management, etc. There is no weight or measure given to any value in the ‘**job**’ column.

Now, let’s analyze the job category by using plots. Since Job is a category, we will plot the bar plot.

```
# Let's calculate the percentage of each job status category.
data.job.value_counts(normalize=True)
#plot the bar graph of percentage job categories
data.job.value_counts(normalize=True).plot.barh()
plt.show()
```

The output looks like this,

By the above bar plot, we can infer that the data set contains more number of blue-collar workers compared to other categories.

**Categorical Ordered Univariate Analysis:**

Ordered variables are those variables that have a natural rank of order. Some examples of categorical ordered variables from our dataset are:

- Month: Jan, Feb, March……
- Education: Primary, Secondary,……

Now, let’s analyze the Education Variable from the dataset. Since we’ve already seen a bar plot, let’s see how a Pie Chart looks like.

```
#calculate the percentage of each education category.
data.education.value_counts(normalize=True)
#plot the pie chart of education categories
data.education.value_counts(normalize=True).plot.pie()
plt.show()
```

The output will be,

By the above analysis, we can infer that the data set has a large number of them belongs to secondary education after that tertiary and next primary. Also, a very small percentage of them have been unknown.

This is how we analyze univariate categorical analysis. If the column or variable is of numerical then we’ll analyze by calculating its mean, median, std, etc. We can get those values by using the describe function.

`data.salary.describe()`

The output will be,

If we analyze data by taking two variables/columns into consideration from a dataset, it is known as Bivariate Analysis.

**a) Numeric-Numeric Analysis:**

Analyzing the two numeric variables from a dataset is known as numeric-numeric analysis. We can analyze it in three different ways.

- Scatter Plot
- Pair Plot
- Correlation Matrix

**Scatter Plot**

Let’s take three columns ‘Balance’, ‘Age’ and ‘Salary’ from our dataset and see what we can infer by plotting to scatter plot between `salary`

`balance`

and `age`

`balance`

```
#plot the scatter plot of balance and salary variable in data
plt.scatter(data.salary,data.balance)
plt.show()
#plot the scatter plot of balance and age variable in data
data.plot.scatter(x="age",y="balance")
plt.show()
```

Now, the scatter plots looks like,

**Pair Plot**

Now, let’s plot Pair Plots for the three columns we used in plotting Scatter plots. We’ll use the seaborn library for plotting Pair Plots.

```
#plot the pair plot of salary, balance and age in data dataframe.
sns.pairplot(data = data, vars=['salary','balance','age'])
plt.show()
```

The Pair Plot looks like this,

**Correlation Matrix**

Since we cannot use more than two variables as x-axis and y-axis in Scatter and Pair Plots, it is difficult to see the relation between three numerical variables in a single graph. In those cases, we’ll use the correlation matrix.

```
# Creating a matrix using age, salry, balance as rows and columns
data[['age','salary','balance']].corr()
#plot the correlation matrix of salary, balance and age in data dataframe.
sns.heatmap(data[['age','salary','balance']].corr(), annot=True, cmap = 'Reds')
plt.show()
```

First, we created a matrix using age, salary, and balance. After that, we are plotting the heatmap using the seaborn library of the matrix.

**b) Numeric - Categorical Analysis**

Analyzing the one numeric variable and one categorical variable from a dataset is known as numeric-categorical analysis. We analyze them mainly using mean, median, and box plots.

Let’s take `salary`

and `response`

columns from our dataset.

First check for mean value using `groupby`

```
#groupby the response to find the mean of the salary with response no & yes separately.
data.groupby('response')['salary'].mean()
```

The output will be,

There is not much of a difference between the yes and no response based on the salary.

Let’s calculate the median,

```
#groupby the response to find the median of the salary with response no & yes separately.
data.groupby('response')['salary'].median()
```

The output will be,

By both mean and median we can say that the response of yes and no remains the same irrespective of the person’s salary. But, is it truly behaving like that, let’s plot the box plot for them and check the behavior.

```
#plot the box plot of salary for yes & no responses.
sns.boxplot(data.response, data.salary)
plt.show()
```

The box plot looks like this,

As we can see, when we plot the Box Plot, it paints a very different picture compared to mean and median. The IQR for customers who gave a positive response is on the higher salary side.

This is how we analyze Numeric-Categorical variables, we use mean, median, and Box Plots to draw some sort of conclusions.

**c) Categorical — Categorical Analysis**

Since our target variable/column is the Response rate, we’ll see how the different categories like Education, Marital Status, etc., are associated with the Response column. So instead of ‘Yes’ and ‘No’ we will convert them into ‘1’ and ‘0’, by doing that we’ll get the “Response Rate”.

```
#create response_rate of numerical data type where response "yes"= 1, "no"= 0
data['response_rate'] = np.where(data.response=='yes',1,0)
data.response_rate.value_counts()
```

The output looks like this,

Let’s see how the response rate varies for different categories in marital status.

```
#plot the bar graph of marital status with average value of response_rate
data.groupby('marital')['response_rate'].mean().plot.bar()
plt.show()
```

The graph looks like this,

By the above graph, we can infer that the positive response is more for Single status members in the data set. Similarly, we can plot the graphs for Loan vs Response rate, Housing Loans vs Response rate, etc.

If we analyze data by taking more than two variables/columns into consideration from a dataset, it is known as Multivariate Analysis.

Let’s see how ‘Education’, ‘Marital’, and ‘Response_rate’ vary with each other.

First, we’ll create a pivot table with the three columns and after that, we’ll create a heatmap.

```
result = pd.pivot_table(data=data, index='education', columns='marital',values='response_rate')
print(result)
#create heat map of education vs marital vs response_rate
sns.heatmap(result, annot=True, cmap = 'RdYlGn', center=0.117)
plt.show()
```

The Pivot table and heatmap looks like this,

Based on the Heatmap we can infer that the married people with primary education are less likely to respond positively for the survey and single people with tertiary education are most likely to respond positively to the survey.

Similarly, we can plot the graphs for Job vs marital vs response, Education vs poutcome vs response, etc.

**Conclusion**

This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis (EDA) helps us to look beyond the data. The more we explore the data, the more the insights we draw from it. As a data analyst, almost 80% of our time will be spent understanding data and solving various business problems through EDA.

**Thank you for reading** and **Happy Coding!!!**

#dataanalysis #python

1561523460

This Matplotlib cheat sheet introduces you to the basics that you need to plot your data with Python and includes code samples.

Data visualization and storytelling with your data are essential skills that every data scientist needs to communicate insights gained from analyses effectively to any audience out there.

For most beginners, the first package that they use to get in touch with data visualization and storytelling is, naturally, Matplotlib: it is a Python 2D plotting library that enables users to make publication-quality figures. But, what might be even more convincing is the fact that other packages, such as Pandas, intend to build more plotting integration with Matplotlib as time goes on.

However, what might slow down beginners is the fact that this package is pretty extensive. There is so much that you can do with it and it might be hard to still keep a structure when you're learning how to work with Matplotlib.

DataCamp has created a Matplotlib cheat sheet for those who might already know how to use the package to their advantage to make beautiful plots in Python, but that still want to keep a one-page reference handy. Of course, for those who don't know how to work with Matplotlib, this might be the extra push be convinced and to finally get started with data visualization in Python.

You'll see that this cheat sheet presents you with the six basic steps that you can go through to make beautiful plots.

Check out the infographic by clicking on the button below:

With this handy reference, you'll familiarize yourself in no time with the basics of Matplotlib: you'll learn how you can prepare your data, create a new plot, use some basic plotting routines to your advantage, add customizations to your plots, and save, show and close the plots that you make.

What might have looked difficult before will definitely be more clear once you start using this cheat sheet! Use it in combination with the **Matplotlib Gallery**, the **documentation.**

**Matplotlib**

Matplotlib is a Python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms.

```
>>> import numpy as np
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x)
>>> z = np.sin(x)
```

```
>>> data = 2 * np.random.random((10, 10))
>>> data2 = 3 * np.random.random((10, 10))
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j]
>>> U = 1 X** 2 + Y
>>> V = 1 + X Y**2
>>> from matplotlib.cbook import get_sample_data
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy'))
```

`>>> import matplotlib.pyplot as plt`

```
>>> fig = plt.figure()
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0))
```

```
>>> fig.add_axes()
>>> ax1 = fig.add_subplot(221) #row-col-num
>>> ax3 = fig.add_subplot(212)
>>> fig3, axes = plt.subplots(nrows=2,ncols=2)
>>> fig4, axes2 = plt.subplots(ncols=3)
```

```
>>> plt.savefig('foo.png') #Save figures
>>> plt.savefig('foo.png', transparent=True) #Save transparent figures
```

`>>> plt.show()`

```
>>> fig, ax = plt.subplots()
>>> lines = ax.plot(x,y) #Draw points with lines or markers connecting them
>>> ax.scatter(x,y) #Draw unconnected points, scaled or colored
>>> axes[0,0].bar([1,2,3],[3,4,5]) #Plot vertical rectangles (constant width)
>>> axes[1,0].barh([0.5,1,2.5],[0,1,2]) #Plot horiontal rectangles (constant height)
>>> axes[1,1].axhline(0.45) #Draw a horizontal line across axes
>>> axes[0,1].axvline(0.65) #Draw a vertical line across axes
>>> ax.fill(x,y,color='blue') #Draw filled polygons
>>> ax.fill_between(x,y,color='yellow') #Fill between y values and 0
```

```
>>> fig, ax = plt.subplots()
>>> im = ax.imshow(img, #Colormapped or RGB arrays
cmap= 'gist_earth',
interpolation= 'nearest',
vmin=-2,
vmax=2)
>>> axes2[0].pcolor(data2) #Pseudocolor plot of 2D array
>>> axes2[0].pcolormesh(data) #Pseudocolor plot of 2D array
>>> CS = plt.contour(Y,X,U) #Plot contours
>>> axes2[2].contourf(data1) #Plot filled contours
>>> axes2[2]= ax.clabel(CS) #Label a contour plot
```

```
>>> axes[0,1].arrow(0,0,0.5,0.5) #Add an arrow to the axes
>>> axes[1,1].quiver(y,z) #Plot a 2D field of arrows
>>> axes[0,1].streamplot(X,Y,U,V) #Plot a 2D field of arrows
```

```
>>> ax1.hist(y) #Plot a histogram
>>> ax3.boxplot(y) #Make a box and whisker plot
>>> ax3.violinplot(z) #Make a violin plot
```

y-axis

x-axis

The basic steps to creating plots with matplotlib are:

1 Prepare Data

2 Create Plot

3 Plot

4 Customized Plot

5 Save Plot

6 Show Plot

```
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] #Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() #Step 2
>>> ax = fig.add_subplot(111) #Step 3
>>> ax.plot(x, y, color= 'lightblue', linewidth=3) #Step 3, 4
>>> ax.scatter([2,4,6],
[5,15,25],
color= 'darkgreen',
marker= '^' )
>>> ax.set_xlim(1, 6.5)
>>> plt.savefig('foo.png' ) #Step 5
>>> plt.show() #Step 6
```

```
>>> plt.cla() #Clear an axis
>>> plt.clf(). #Clear the entire figure
>>> plt.close(). #Close a window
```

```
>>> plt.plot(x, x, x, x**2, x, x** 3)
>>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c= 'k')
>>> fig.colorbar(im, orientation= 'horizontal')
>>> im = ax.imshow(img,
cmap= 'seismic' )
```

```
>>> fig, ax = plt.subplots()
>>> ax.scatter(x,y,marker= ".")
>>> ax.plot(x,y,marker= "o")
```

```
>>> plt.plot(x,y,linewidth=4.0)
>>> plt.plot(x,y,ls= 'solid')
>>> plt.plot(x,y,ls= '--')
>>> plt.plot(x,y,'--' ,x**2,y**2,'-.' )
>>> plt.setp(lines,color= 'r',linewidth=4.0)
```

```
>>> ax.text(1,
-2.1,
'Example Graph',
style= 'italic' )
>>> ax.annotate("Sine",
xy=(8, 0),
xycoords= 'data',
xytext=(10.5, 0),
textcoords= 'data',
arrowprops=dict(arrowstyle= "->",
connectionstyle="arc3"),)
```

`>>> plt.title(r '$sigma_i=15$', fontsize=20)`

Limits & Autoscaling

```
>>> ax.margins(x=0.0,y=0.1) #Add padding to a plot
>>> ax.axis('equal') #Set the aspect ratio of the plot to 1
>>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) #Set limits for x-and y-axis
>>> ax.set_xlim(0,10.5) #Set limits for x-axis
```

Legends

```
>>> ax.set(title= 'An Example Axes', #Set a title and x-and y-axis labels
ylabel= 'Y-Axis',
xlabel= 'X-Axis')
>>> ax.legend(loc= 'best') #No overlapping plot elements
```

Ticks

```
>>> ax.xaxis.set(ticks=range(1,5), #Manually set x-ticks
ticklabels=[3,100, 12,"foo" ])
>>> ax.tick_params(axis= 'y', #Make y-ticks longer and go in and out
direction= 'inout',
length=10)
```

Subplot Spacing

```
>>> fig3.subplots_adjust(wspace=0.5, #Adjust the spacing between subplots
hspace=0.3,
left=0.125,
right=0.9,
top=0.9,
bottom=0.1)
>>> fig.tight_layout() #Fit subplot(s) in to the figure area
```

Axis Spines

```
>>> ax1.spines[ 'top'].set_visible(False) #Make the top axis line for a plot invisible
>>> ax1.spines['bottom' ].set_position(( 'outward',10)) #Move the bottom axis line outward
```

**Have this Cheat Sheet at your fingertips**

Original article source athttps://www.datacamp.com

**#matplotlib #cheatsheet #python**

1620466520

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition