In this article, we will explore the different use cases of Pandas value_counts(). You’ll learn how to use it to deal with the following common tasks.
If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.
If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.
In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.
#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition
EDA is a way to understand what the data is all about. It is very important as it helps us to understand the outliers, relationship of features within the data with the help of graphs and plots.
EDA is a time taking process as we need to make visualizations between different features using libraries like Matplot, seaborn, etc.
There is a way to automate this process by a single line of code using the library Pandas Visual Analysis.
Let’s understand the different sections in the user interface :
#data-analysis #machine-learning #data-visualization #data-science #data analysis #exploratory data analysis
According to the ICAO standard, the passport number should be up to 9 characters long and can contain numbers and letters. During your work as an analyst, you can come along a data set containing the passports and you will be asked to explore it.
I have recently worked with one such set and I’d like to share the steps of this analysis with you, including:
First, let’s load the data. Since the dataset contains only one column, it’s quite straightforward.
# import the packages which will be used import pandas as pd import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv(r"path\data.csv") df.info()
The .info() command will tell us that we have 10902 passports in the dataset and all are imported as “object” which means that the format is
As an initial step of any analysis should be the check if there are some duplicated values. In our case, there are, so we will remove them using pandas’s
print(len(df["PassportNumber"].unique()))# if lower than 10902 there are duplicates df.drop_duplicates(inplace=True) # or df = df.drop_duplicates()
Usually, you continue with the check of the longest and the shortest passport.
[In]: df["PassportNumber"].agg(["min","max"]) [Out]: min 000000050 max ZXD244549 Name: PassportNumber, dtype: object
You might become happy, that all the passports are 9 characters long, but you would be misled. The data are having string format so that the lowest “string” value is the one which starts with the highest number of zeros and the largest the one which has the most zeds at the beginning.
# ordering of the strings is not the same as order of numbers 0 > 0001 > 001 > 1 > 123 > AB > Z > Z123 > ZZ123
In order to see the length of the passports let’s look at their length.
[In]: df["PassportNumber"].apply(len).agg(["min","max"]) [Out]: min 3 max 17 Name: PassportNumber, dtype: object
In the contracts to our initial belief, the shortest passport contains only 3 characters while the longest is 17 (way over the expected maximum of 9) characters long.
Let’s expand our data frame with the
'len'column so that we can have a look at examples:
# Add the 'len' column df['len'] = df["PassportNumber"].apply(len) # look on the examples having the maximum lenght [In]: df[df["len"]==df['len'].max()] [Out]: PassportNumber len 25109300000000000 17 27006100000000000 17 # look on the examples having the minimum lenght [In]: df[df["len"]==df['len'].min()] [Out]: PassportNumber len 179 3 917 3 237 3
The 3 digit passport numbers look suspicious, but they meet the ICAO criteria, but the longest ones are obviously too long, however, they contain quite many trailing zeros. Maybe someone just added the zeros in order to meet some data storage requirements.
Let’s have a look at the overall length distribution of our data sample.
# calculate count of appearance of various lengths counts_by_value = df["len"].value_counts().reset_index() separator = pd.Series(["|"]*df["len"].value_counts().shape) separator.name = "|" counts_by_index = df["len"].value_counts().sort_index().reset_index() lenght_distribution_df = pd.concat([counts_by_value, separator, counts_by_index], axis=1) # draw the chart ax = df["len"].value_counts().sort_index().plot(kind="bar") ax.set_xlabel("position") ax.set_ylabel("number of records") for p in ax.patches: ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.05))
Distribution of the passport lengths of the data sample
We see, that the most passports number in our sample, are 7, 8 or 9 characters long. Quite a few are however 10 or 12 characters long, which is unexpected.
Maybe the long passports are having leading or trailing zeros, like our example with 17 characters.
In order to explore these zero-pads let’s add two more columns to our data set — ‘leading_zeros’ and ‘trailing_zeros’ to contain the number of leading and trailing zeros.
# number of leading zeros can be calculated by subtracting the length of the string l-stripped off the leading zeros from the total length of the string df["leading_zeros"] = df["PassportNumber"].apply(lambda x: len(x) - len(x.lstrip("0"))) # similarly the number of the trailing zeros can be calculated by subtracting the length of the string r-stripped off the leading zeros from the total length of the string df["trailing_zeros"] = df["PassportNumber"].apply(lambda x: len(x) - len(x.rstrip("0")))
Then we can easily display the passport which has more than 9 characters to check if the have any leading or trailing zeros:
[In]: df[df["len"]>9] [Out]: PassportNumber len leading_zeros trailing_zeros 73846290957 11 0 0 N614226700 10 0 2 WC76717593 10 0 0 ...
#pandas #exploratory-data-analysis #data-analysis #dataset #data analysis
Pandas is one of the predominant data analysis tools which is highly appreciated among data scientists. It provides numerous flexible and versatile functions to perform efficient data analysis.
In this article, we will go over 3 pandas tricks that I think will make you a more happy pandas user. It is better to explain these tricks with some examples. Thus, we start by creating a data frame to wok on.
The data frame contains daily sales quantities of 3 different stores. We first create a period of 10 days using the
date_range function of pandas.
import numpy as np import pandas as pd days = pd.date_range("2020-01-01", periods=10, freq="D")
The days variable will be used as a column. We also need a sales quantity column which can be generated by the
randint function of numpy. Then, we create a data frame with 3 columns for each store.
#machine-learning #data-science #python #python pandas tricks #efficient data analysis #python pandas tricks for efficient data analysis
With possibly everything that one can think of which revolves around data, the need for people who can transform data into a manner that helps in making the best of the available data is at its peak. This brings our attention to two major aspects of data – data science and data analysis. Many tend to get confused between the two and often misuse one in place of the other. In reality, they are different from each other in a couple of aspects. Read on to find how data analysis and data science are different from each other.
Before jumping straight into the differences between the two, it is critical to understand the commonalities between data analysis and data science. First things first – both these areas revolve primarily around data. Next, the prime objective of both of them remains the same – to meet the business objective and aid in the decision-making ability. Also, both these fields demand the person be well acquainted with the business problems, market size, opportunities, risks and a rough idea of what could be the possible solutions.
Now, addressing the main topic of interest – how are data analysis and data science different from each other.
As far as data science is concerned, it is nothing but drawing actionable insights from raw data. Data science has most of the work done in these three areas –
#big data #latest news #how are data analysis and data science different from each other #data science #data analysis #data analysis and data science different