Chih- Yu Lin

1612785060

How to build an Exploratory Data Analysis app using Pandas Profiling | Streamlit

In this video, I will be showing you how to build your very own EDA app in Python using Pandas Profiling and Streamlit. With this app you will be able to upload a CSV data to the app and it will automatically generate an exploratory data analysis (EDA) report.

👉 Code: https://github.com/dataprofessor/eda-app

👉 Demo: https://share.streamlit.io/dataprofessor/eda-app/main/app.py

Subbscribe: https://www.youtube.com/channel/UCV8e2g4IWQqK71bbzGDEI4Q

#streamlit #python #pandas #data-science

What is GEEK

Buddha Community

How to build an Exploratory Data Analysis app using Pandas Profiling | Streamlit
Siphiwe  Nair

Siphiwe Nair

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Aketch  Rachel

Aketch Rachel

1625001660

Exploratory Data Analysis in Few Seconds

EDA is a way to understand what the data is all about. It is very important as it helps us to understand the outliers, relationship of features within the data with the help of graphs and plots.

EDA is a time taking process as we need to make visualizations between different features using libraries like Matplot, seaborn, etc.

There is a way to automate this process by a single line of code using the library Pandas Visual Analysis.

About Pandas Visual Analysis

  1. It is an open-source python library used for Exploratory Data Analysis.
  2. It creates an interactive user interface to visualize datasets in Jupyter Notebook.
  3. Visualizations created can be downloaded as images from the interface itself.
  4. It has a selection type that will help to visualize patterns with and without outliers.

Implementation

  1. Installation
  2. 2. Importing Dataset
  3. 3. EDA using Pandas Visual Analysis

Understanding Output

Let’s understand the different sections in the user interface :

  1. Statistical Analysis: This section will show the statistical properties like Mean, Median, Mode, and Quantiles of all numerical features.
  2. Scatter Plot-It shows the Distribution between 2 different features with the help of a scatter plot. you can choose features to be plotted on the X and Y axis from the dropdown.
  3. Histogram-It shows the distribution between 2 Different features with the help of a Histogram.

#data-analysis #machine-learning #data-visualization #data-science #data analysis #exploratory data analysis

Tia  Gottlieb

Tia Gottlieb

1594144380

Exploratory Data Analysis — Passport Numbers in Pandas

According to the ICAO standard, the passport number should be up to 9 characters long and can contain numbers and letters. During your work as an analyst, you can come along a data set containing the passports and you will be asked to explore it.

I have recently worked with one such set and I’d like to share the steps of this analysis with you, including:

  • Number of records
  • Duplicated records
  • Length of the records
  • Analysis of the leading and trailing zeros
  • Appearance of the character at a specific position
  • Where do letters appear in the string using regular expressions (regexes)
  • Length on the sequence of letters
  • Is there a common prefix
  • Anonymize/Randomize the data while keeping the characteristics of the dataset

You can go through the steps with me. Get the (randomized data) from github. It also contains the Jupiter notebook with all the steps.

Basic Dataset Exploration

First, let’s load the data. Since the dataset contains only one column, it’s quite straightforward.

# import the packages which will be used
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"path\data.csv")
df.info()

The .info() command will tell us that we have 10902 passports in the dataset and all are imported as “object” which means that the format is string.

Duplicates

As an initial step of any analysis should be the check if there are some duplicated values. In our case, there are, so we will remove them using pandas’s .drop_duplicates() method.

print(len(df["PassportNumber"].unique()))# if lower than 10902 there are duplicates

df.drop_duplicates(inplace=True) # or df = df.drop_duplicates()

Length of the passports

Usually, you continue with the check of the longest and the shortest passport.

[In]: df["PassportNumber"].agg(["min","max"])
[Out]: 
   min 000000050
   max ZXD244549
   Name: PassportNumber, dtype: object

You might become happy, that all the passports are 9 characters long, but you would be misled. The data are having string format so that the lowest “string” value is the one which starts with the highest number of zeros and the largest the one which has the most zeds at the beginning.

# ordering of the strings is not the same as order of numbers
0 > 0001 > 001 > 1 > 123 > AB > Z > Z123 > ZZ123

In order to see the length of the passports let’s look at their length.

[In]: df["PassportNumber"].apply(len).agg(["min","max"])
[Out]: 
   min 3
   max 17
   Name: PassportNumber, dtype: object

In the contracts to our initial belief, the shortest passport contains only 3 characters while the longest is 17 (way over the expected maximum of 9) characters long.

Let’s expand our data frame with the 'len'column so that we can have a look at examples:

# Add the 'len' column
df['len'] = df["PassportNumber"].apply(len)

# look on the examples having the maximum lenght
[In]: df[df["len"]==df['len'].max()]
[Out]: 
   PassportNumber    len
   25109300000000000 17
   27006100000000000 17
# look on the examples having the minimum lenght
[In]: df[df["len"]==df['len'].min()]
[Out]: 
   PassportNumber    len
   179               3
   917               3
   237               3

The 3 digit passport numbers look suspicious, but they meet the ICAO criteria, but the longest ones are obviously too long, however, they contain quite many trailing zeros. Maybe someone just added the zeros in order to meet some data storage requirements.

Let’s have a look at the overall length distribution of our data sample.

# calculate count of appearance of various lengths
counts_by_value = df["len"].value_counts().reset_index()
separator = pd.Series(["|"]*df["len"].value_counts().shape[0])
separator.name = "|"
counts_by_index = df["len"].value_counts().sort_index().reset_index()
lenght_distribution_df = pd.concat([counts_by_value, separator, counts_by_index], axis=1)
# draw the chart
ax = df["len"].value_counts().sort_index().plot(kind="bar")
ax.set_xlabel("position")
ax.set_ylabel("number of records")
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.05))

Distribution of the passport lengths of the data sample

We see, that the most passports number in our sample, are 7, 8 or 9 characters long. Quite a few are however 10 or 12 characters long, which is unexpected.

Leading and trailing zeros

Maybe the long passports are having leading or trailing zeros, like our example with 17 characters.

In order to explore these zero-pads let’s add two more columns to our data set — ‘leading_zeros’ and ‘trailing_zeros’ to contain the number of leading and trailing zeros.

# number of leading zeros can be calculated by subtracting the length of the string l-stripped off the leading zeros from the total length of the string

df["leading_zeros"] = df["PassportNumber"].apply(lambda x: len(x) - len(x.lstrip("0")))
# similarly the number of the trailing zeros can be calculated by subtracting the length of the string r-stripped off the leading zeros from the total length of the string
df["trailing_zeros"] = df["PassportNumber"].apply(lambda x: len(x) - len(x.rstrip("0")))

Then we can easily display the passport which has more than 9 characters to check if the have any leading or trailing zeros:

[In]: df[df["len"]>9]
[Out]:
   PassportNumber  len  leading_zeros trailing_zeros
   73846290957     11   0             0
   N614226700      10   0             2
   WC76717593      10   0             0
   ...

#pandas #exploratory-data-analysis #data-analysis #dataset #data analysis

Fredy  Larson

Fredy Larson

1595059664

How long does it take to develop/build an app?

With more of us using smartphones, the popularity of mobile applications has exploded. In the digital era, the number of people looking for products and services online is growing rapidly. Smartphone owners look for mobile applications that give them quick access to companies’ products and services. As a result, mobile apps provide customers with a lot of benefits in just one device.

Likewise, companies use mobile apps to increase customer loyalty and improve their services. Mobile Developers are in high demand as companies use apps not only to create brand awareness but also to gather information. For that reason, mobile apps are used as tools to collect valuable data from customers to help companies improve their offer.

There are many types of mobile applications, each with its own advantages. For example, native apps perform better, while web apps don’t need to be customized for the platform or operating system (OS). Likewise, hybrid apps provide users with comfortable user experience. However, you may be wondering how long it takes to develop an app.

To give you an idea of how long the app development process takes, here’s a short guide.

App Idea & Research

app-idea-research

_Average time spent: two to five weeks _

This is the initial stage and a crucial step in setting the project in the right direction. In this stage, you brainstorm ideas and select the best one. Apart from that, you’ll need to do some research to see if your idea is viable. Remember that coming up with an idea is easy; the hard part is to make it a reality.

All your ideas may seem viable, but you still have to run some tests to keep it as real as possible. For that reason, when Web Developers are building a web app, they analyze the available ideas to see which one is the best match for the targeted audience.

Targeting the right audience is crucial when you are developing an app. It saves time when shaping the app in the right direction as you have a clear set of objectives. Likewise, analyzing how the app affects the market is essential. During the research process, App Developers must gather information about potential competitors and threats. This helps the app owners develop strategies to tackle difficulties that come up after the launch.

The research process can take several weeks, but it determines how successful your app can be. For that reason, you must take your time to know all the weaknesses and strengths of the competitors, possible app strategies, and targeted audience.

The outcomes of this stage are app prototypes and the minimum feasible product.

#android app #frontend #ios app #minimum viable product (mvp) #mobile app development #web development #android app development #app development #app development for ios and android #app development process #ios and android app development #ios app development #stages in app development

George  Koelpin

George Koelpin

1603486800

Top 20 Pandas Functions which are commonly used for Exploratory Data Analysis.

When it comes to data science or data analysis, Python is pretty much always the language of choice. Its library Pandas is the one you cannot, and more importantly, shouldn’t avoid.

Pandas is a predominantly used python data analysis library. It provides many functions and methods to expedite the data analysis process. What makes pandas so common is its functionality, flexibility, and simple syntax.

While Pandas by itself isn’t that difficult to learn, mainly due to the self-explanatory method names, having a cheat sheet is still worthy, especially if you want to code out something quickly. That’s why today I want to put the focus on how I use Pandas to do Exploratory Data Analysis by providing you with the list of my most used methods and also a detailed explanation of those.

Dataset used

I will do the examples on the california housing dataset.This Dataset and code is available in this link.

https://github.com/jbolla368/Pandas-using-in-EDA.git

Here’s how to import the Pandas library and load in the dataset:

Image for post

Image for post

#exploratory-data-analysis #pandas #data-science #python-pandas #data-folkz