Karim Aya

Karim Aya


Cleaning, Analyzing, and Visualizing Survey Data in Python

A tutorial using pandas, matplotlib, and seaborn to produce digestible insights from dirty data

If you work in data at a D2C startup, there’s a good chance you will be asked to look at survey data at least once. And since SurveyMonkey is one of the most popular survey platforms out there, there’s a good chance it’ll be SurveyMonkey data.

The way SurveyMonkey exports data is not necessarily ready for analysis right out of the box, but it’s pretty close. Here I’ll demonstrate a few examples of questions you might want to ask of your survey data, and how to extract those answers quickly. We’ll even write a few functions to make our lives easier when plotting future questions.

We’ll be using pandasmatplotlib, and seaborn to make sense of our data. I used Mockaroo to generate this data; specifically, for the survey question fields, I used “Custom List” and entered in the appropriate fields. You could achieve the same effect by using random.choice in the random module, but I found it easier to let Mockaroo create the whole thing for me. I then tweaked the data in Excel so that it mirrored the structure of a SurveyMonkey export.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
survey_data = pd.read_csv(‘MOCK_DATA.csv’)

Your first reaction to this might be “Ugh. It’s horrible.” I mean, the column names didn’t read in properly, there are a ton of NaNs, instead of numerical representations like 0/1 or 1/2/3/4/5 we have the actual text answers in each cell…And should we actually be reading this in with a MultiIndex?

But don’t worry, it’s not as bad as you might think. And we’re going to ignore MultiIndexes in this post. (Nobody really likes working with them anyway.) The team needs those insights ASAP — so we’ll come up with some hacky solutions.

First order of business: we’ve been asked to find how the answers to these questions vary by age group. But age is just an age–we don’t have a column for age groups! Well, luckily for us, we can pretty easily define a function to create one.

def age_group(age):

"""Creates an age bucket for each participant using the age variable.
    Meant to be used on a DataFrame with .apply()."""

# Convert to an int, in case the data is read in as an "object" (aka string)
age = int(age)

if age < 30:
    bucket = '<30'

# Age 30 to 39 ('range' excludes upper bound)
if age in range(30, 40):
    bucket = '30-39'
if age in range(40, 50):
    bucket = '40-49'
if age in range(50, 60):
    bucket = '50-59'

if age >= 60:
    bucket = '60+'

return bucket 

But if we try to run it like this, we’ll get an error! That’s because we have that first row, and its value for age is the word “age” instead of a number. Since the first step is to convert each age to an int, this will fail.

We need to remove that row from the DataFrame, but it’ll be useful for us later when we rename columns, so we’ll save it as a separate variable.

# Save it as headers, and then later we can access it via slices like a list
headers = survey_data.loc[0]

.drop() defaults to axis=0, which refers to dropping items row-wise

survey_data = survey_data.drop(0)

You will notice that, since removing headers, we’ve now lost some information when looking at the survey data by itself. Ideally, you will have a list of the questions and their options that were asked in the survey, provided to you by whoever wants the analysis. If not, you should keep a separate way to reference this info in a document or note that you can look at while working.

OK, now let’s apply the age_group function to get our age_group column.

survey_data[‘age_group’] = survey_data[‘What is your age?’].apply(age_group)


Great. Next, let’s subset the data to focus on just the first question. How do the answers to this first question vary by age group?

# Subset the columns from when the question “What was the most…” is asked,

through to all the available answers. Easiest to use .iloc for this

survey_data.iloc[:5, 3:7]

# Next, assign it to a separate variable corresponding to your question
important_consideration = survey_data.iloc[:, 3:7]

Great. We have the answers in a variable now. But when we go to plot this data, it’s not going to look very good, because of the misnamed columns. Let’s write up a quick function to make renaming the columns simple:

def rename_columns(df, new_names_list):

"""Takes a DataFrame that needs to be renamed and a list of the new
    column names, and returns the renamed DataFrame. Make sure the 
    number of columns in the df matches the list length exactly,
    or function will not work as intended."""

rename_dict = dict(zip(df.columns, new_names_list))
df = df.rename(mapper=rename_dict, axis=1)

return df

Remember headers from earlier? We can use it to create our new_names_list for renaming.


It’s already an array, so we can just pass it right in, or we can rename it first for readability.

ic_col_names = headers[3:7].values

important_consideration = rename_columns(important_consideration, ic_col_names)

Now tack on age_group from the original DataFrame so we can use .groupby

(You could also use pd.concat, but I find this easier)

important_consideration[‘age_group’] = survey_data[‘age_group’]


Isn’t that so much nicer to look at? Don’t worry, we’re almost to the part where we get some insights.

consideration_grouped = important_consideration.groupby(‘age_group’).agg(‘count’)


Notice how groupby and other aggregation functions ignore NaNs automatically. That makes our lives significantly easier.

Let’s say we also don’t really care about analyzing under-30 customers right now, so we’ll plot only the other age groups.

figsize=(10, 10),
title=‘Most Important Consideration By Age Group’

OK, this is all well and good, but the 60+ group has more people in it than the other groups, and so it’s hard to make a fair comparison. What do we do? We can plot each age group in a separate plot, and then compare the distributions.

“But wait,” you might think. “I don’t really want to write the code for 4 different plots.”

Well of course not! Who has time for that? Let’s write another function to do it for us.

def plot_counts_by_age_group(groupby_count_obj, age_group, ax=None):

"""Takes a count-aggregated groupby object, an age group, and an 
(optional) AxesSubplot, and draws a barplot for that group."""

sort_order = groupby_count_obj.loc[age_group].sort_index().index

sns.barplot(y = groupby_count_obj.loc[age_group].index, 
            x = groupby_count_obj.loc[age_group].values, 
            order = sort_order, 
            palette = 'rocket', edgecolor = 'black', 
            ax = ax
            ).set_title("Age {}".format(age_group))

I believe it was Jenny Bryan, in her wonderful talk “Code Smells and Feels,” who first tipped me off to the following:

If you find yourself copying and pasting code and just changing a few values, you really ought to just write a function.

This has been a great guide for me in deciding when it is and isn’t worth it to write a function for something. A rule of thumb I like to use is that if I would be copying and pasting more than 3 times, I write a function.

There are also benefits other than convenience to this approach, such as that it:

  • reduces the possibility for error (when copying and pasting, it’s easy to accidentally forget to change a value)
  • makes for more readable code
  • builds up your personal toolbox of functions
  • forces you to think at a higher level of abstraction

(All of which improve your programming skills and make the people who need to read your code happier!)

# Setup for the 2x2 subplot grid

Note we don’t want to share the x axis since we have counts

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(8, 6), sharey=True)

ax.flatten() avoids having to explicitly reference a subplot index in ax

Use consideration_grouped.index[:-1] because we’re not plotting the under-30s

for subplot, age_group in zip(ax.flatten(), list(consideration_grouped.index)[:-1]):
plot_counts_by_age_group(consideration_grouped, age_group, ax=subplot)


This is, of course, generated data from a uniform distribution, and we would thus not expect to see any significant differences between groups. Hopefully your own survey data will be more interesting.

Next, let’s address another format of question. In this one, we need to see how interested each age group is in a given benefit. Happily, these questions are actually easier to deal with than the former type. Let’s take a look:

benefits = survey_data.iloc[:, 7:]

And look, since this is a small DataFrame, age_group is appended already and we won’t have to add it.

ben_col_names = headers[7:].values

benefits = rename_columns(benefits, ben_col_names)


Cool. Now we have the subsetted data, but we can’t just aggregate it by count this time like we could with the other question — the last question had NaNs that would be excluded to give the true count for that response, but with this one, we would just get the number of responses for each age group overall:


This is definitely not what we want! The point of the question is to understand how interested the different age groups are, and we need to preserve that information. All this tells us is how many people in each age group responded to the question.

So what do we do? One way to go would be to re-encode these responses numerically. But what if we want to preserve the relationship on an even more granular level? If we encode numerically, we can take the median and average of each age group’s level of interest. But what if what we’re really interested in is the specific percentage of people per age group who chose each interest level? It’d be easier to convey that info in a barplot, with the text preserved.

That’s what we’re going to do next. And — you guessed it — it’s time to write another function.

order = [‘Not Interested at all’, ‘Somewhat uninterested’,
‘Neutral’, ‘Somewhat interested’, ‘Very interested’]

def plot_benefit_question(df, col_name, age_group, order=order,
palette=‘Spectral’, ax=None):

"""Takes a relevant DataFrame, the name of the column (benefit) we want info on,
    and an age group, and returns a plot of the answers to that benefit question."""

reduced_df = df[[col_name, 'age_group']]

# Gets the relative frequencies (percentages) for "this-age-group" only
data_to_plot = reduced_df[reduced_df['age_group'] == age_group][col_name].value_counts(normalize=True)

sns.barplot(y = data_to_plot.index, 
            x = data_to_plot.values, 
            order = order, 
            ax = ax,
            palette = palette, 
            edgecolor = 'black'
            ).set_title('Age {}: {}'.format(age_group, col_name))

Quick note to new learners: Most people won’t say this explicitly, but let me be clear on how visualizations are often made. Generally speaking, it is a highly iterative process. Even the most experienced data scientists don’t just write up a plot with all of these specifications off the top of their head.

Generally, you start with .plot(kind=‘bar’), or similar depending on the plot you want, and then you change size, color maps, get the groups properly sorted using order=, specify whether the labels should be rotated, and set x- or y-axis labels invisible, and more, depending on what you think is best for whoever will be using the visualizations.

So don’t be intimidated by the long blocks of code you see when people are making plots. They’re usually created over a span of minutes while testing out different specifications, not by writing perfect code from scratch in one go.

Now we can plot another 2x2 for each benefit broken out by age group. But we’d have to do that for all 4 benefits! Again: who has time for that? Instead, we’ll loop over each benefit, and each age group within each benefit, using a couple of for loops. But if you’re interested, I’d challenge you to refactor this into a function if you happen to have many questions that are formatted like this.

# Exclude age_group from the list of benefits
all_benefits = list(benefits.columns[:-1])

Exclude under-30s

buckets_except_under30 = [group for group in benefits[‘age_group’].unique()
if group != ‘<30’]

for benefit in all_benefits:

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(8, 6), 
                       sharey=True, sharex=True)

for a, age_group in zip(ax.flatten(), buckets_except_under30):
    plot_benefit_question(benefits, benefit, 
                          age_group=age_group, ax=a)
    # Keeps x-axis tick labels for each group of plots
    a.xaxis.set_tick_params(which='both', labelbottom=True)
    # Suppresses displaying the question along the y-axis


Success! And if we wanted to export each individual set of plots, we would simply add the line plt.savefig(‘{}_interest_by_age.png’.format(benefit)), and matplotlib would automatically save a beautifully sharp rendering of each set of plots.

This makes it especially easy for folks on other teams to use your findings; you can simply export them to a plots folder, and people can browse the images and be able to drag and drop them right into a PowerPoint presentation or other report.

These could use a tad more padding, so if I were to do this again, I would increase the allowed height for the figure slightly.

Let’s do one more example: numerically encoding the benefits, as we mentioned earlier. Then we can generate a heatmap of the correlations between interest in different benefits.

def encode_interest(interest):
“”“Takes a string indicating interest and encodes it to an ordinal
(numerical) variable.”“”

if interest == 'Not Interested at all':
    x = 1
if interest == 'Somewhat uninterested':
    x = 2

if interest == 'Neutral':
    x = 3
if interest == 'Somewhat interested':
    x = 4
if interest == 'Very interested':
    x = 5

return x  

benefits_encoded = benefits.iloc[:, :-1].copy()

Map the ordinal variable

for column in benefits.iloc[:, :-1].columns:
benefits_encoded[column] = benefits[column].map(encode_interest)


And lastly, we’ll generate the correlation matrix and plot the correlations.

# Use Spearman instead of default Pearson, since these

are ordinal variables!

corr_matrix = benefits_encoded.corr(method=‘spearman’)


fig, ax = plt.subplots(figsize=(8, 6))

vmin and vmax control the range of the colormap

sns.heatmap(corr_matrix, cmap=‘RdBu’, annot=True, fmt=‘.2f’,
vmin=-1, vmax=1)

plt.title(“Correlations Between Desired Benefits”)

Add tight_layout to ensure the labels don’t get cut off


Again, since the data is randomly generated, we would expect there to be little to no correlation, and that is indeed what we find. (It is funny to note that SQL tutorials are slightly negatively correlated with drag-and-drop features, which is actually what we might expect to see in real data!)

Let’s do one last type of plot, one that’s closely related to the heatmap: the clustermap. Clustermaps make correlations especially informative in analyzing survey responses, because they use hierarchical clustering to (in this case) group benefits together by how closely related they are. So instead of eyeballing the heatmap for which individual benefits are positively or negatively associated, which can get a little crazy when you have 10+ benefits, the plot will be segmented into clusters, which is a little easier to look at.

You can also easily change the linkage type used in the calculation, if you’re familiar with the mathematical details of hierarchical clustering. Some of the available options are ‘single’, ‘average’, and ‘ward’ — I won’t get into the details, but ‘ward’ is generally a safe bet when starting out.

sns.clustermap(corr_matrix, method=‘ward’, cmap=‘RdBu’, annot=True,
vmin=-1, vmax=1, figsize=(14,12))

plt.title(“Correlations Between Desired Benefits”)

Long labels often require a little tweaking, so I’d recommend renaming your benefits to shorter names prior to using a clustermap.

A quick assessment of this shows that the clustering algorithm believes drag-and-drop features and ready-made formulas cluster together, while custom dashboard templates and SQL tutorials form another cluster. Since the correlations are so weak, you can see that the “height” of when the benefits link together to form a cluster is very tall. (This means you should probably not base any business decisions on this finding!) Hopefully the example is illustrative despite the weak relationships.

I hope you enjoyed this quick tutorial about working with survey data and writing functions to quickly generate visualizations of your findings! If you think you know an even more efficient way of doing things, feel free to let me know in the comments — this is just what I came up with when I needed to produce insights on individual questions as quickly as possible.

Originally published by Charlene Chambliss at https://towardsdatascience.com

Learn More

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Data Science A-Z™: Real-Life Data Science Exercises Included

☞ R Programming A-Z™: R For Data Science With Real Exercises!

☞ Python for Data Science and Machine Learning Bootcamp

☞ Tableau 10 A-Z: Hands-On Tableau Training For Data Science!

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

#python #data-science

What is GEEK

Buddha Community

Cleaning, Analyzing, and Visualizing Survey Data in Python
 iOS App Dev

iOS App Dev


Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Arvel  Parker

Arvel Parker


Basic Data Types in Python | Python Web Development For Beginners

At the end of 2019, Python is one of the fastest-growing programming languages. More than 10% of developers have opted for Python development.

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Table of Contents  hide

I Mutable objects

II Immutable objects

III Built-in data types in Python

Mutable objects

The Size and declared value and its sequence of the object can able to be modified called mutable objects.

Mutable Data Types are list, dict, set, byte array

Immutable objects

The Size and declared value and its sequence of the object can able to be modified.

Immutable data types are int, float, complex, String, tuples, bytes, and frozen sets.

id() and type() is used to know the Identity and data type of the object







Built-in data types in Python

a**=str(“Hello python world”)****#str**














Numbers (int,Float,Complex)

Numbers are stored in numeric Types. when a number is assigned to a variable, Python creates Number objects.

#signed interger




Python supports 3 types of numeric data.

int (signed integers like 20, 2, 225, etc.)

float (float is used to store floating-point numbers like 9.8, 3.1444, 89.52, etc.)

complex (complex numbers like 8.94j, 4.0 + 7.3j, etc.)

A complex number contains an ordered pair, i.e., a + ib where a and b denote the real and imaginary parts respectively).


The string can be represented as the sequence of characters in the quotation marks. In python, to define strings we can use single, double, or triple quotes.

# String Handling

‘Hello Python’

#single (') Quoted String

“Hello Python”

# Double (") Quoted String

“”“Hello Python”“”

‘’‘Hello Python’‘’

# triple (‘’') (“”") Quoted String

In python, string handling is a straightforward task, and python provides various built-in functions and operators for representing strings.

The operator “+” is used to concatenate strings and “*” is used to repeat the string.


output**:****‘Hello python’**

"python "*****2

'Output : Python python ’

#python web development #data types in python #list of all python data types #python data types #python datatypes #python types #python variable type

Ray  Patel

Ray Patel


Lambda, Map, Filter functions in python

Welcome to my Blog, In this article, we will learn python lambda function, Map function, and filter function.

Lambda function in python: Lambda is a one line anonymous function and lambda takes any number of arguments but can only have one expression and python lambda syntax is

Syntax: x = lambda arguments : expression

Now i will show you some python lambda function examples:

#python #anonymous function python #filter function in python #lambda #lambda python 3 #map python #python filter #python filter lambda #python lambda #python lambda examples #python map

Sid  Schuppe

Sid Schuppe


How To Blend Data in Google Data Studio For Better Data Analysis

Using data to inform decisions is essential to product management, or anything really. And thankfully, we aren’t short of it. Any online application generates an abundance of data and it’s up to us to collect it and then make sense of it.

Google Data Studio helps us understand the meaning behind data, enabling us to build beautiful visualizations and dashboards that transform data into stories. If it wasn’t already, data literacy is as much a fundamental skill as learning to read or write. Or it certainly will be.

Nothing is more powerful than data democracy, where anyone in your organization can regularly make decisions informed with data. As part of enabling this, we need to be able to visualize data in a way that brings it to life and makes it more accessible. I’ve recently been learning how to do this and wanted to share some of the cool ways you can do this in Google Data Studio.

#google-data-studio #blending-data #dashboard #data-visualization #creating-visualizations #how-to-visualize-data #data-analysis #data-visualisation

HI Python

HI Python


Must-Know Data Science Libraries in Python

Python is the most widespread and popular programming language in data science, software development, and related fields. The simplicity of codes in Python, which helps learners avoid any confusion, is the key to this popularity. Python has constantly been developing, and it keeps getting updated for more ease in using. With 137,000 plus libraries and tools, Python has always provided its users with the solutions to problems of any complexity level. This reason makes Python the ideal language for Data Science operations. This article focuses on some of the essential and must-learn libraries in Python used heavily by Data Scientists. I have tried to cover different libraries used in various stages of a data science cycle, such as Data Mining, processing and modeling, Data Visualization.

Learn Data Science in Python from here!

#data-visualization #data #data-science #python-programming #python #must-know data science libraries in python