Noah  Rowe

Noah Rowe

1596400380

Classification Basics: Walk-through with the Iris Data Set

Learn the basics of classification with guided code from the iris data set

Image for post

When I was first learning how to code, I would practice my data skills on different data sets to create mini Jupyter Notebook reference guides. Since I have found this to be an incredibly useful tool, I thought I’d start sharing code walk-throughs. Hopefully this is a positive resource for those learning to use Python for data science, and something that can be referenced in future projects. Full code is available on Github.

Getting Set Up

The first step is to import the preloaded data sets from the scikit-learn python library. More info on the “toy” data sets included in the package can be found here. The data description will also give more information on the features, statistics, and sources.

from sklearn.datasets import load_iris

#save data information as variable
iris = load_iris()
#view data description and information
print(iris.DESCR)

Image for post

The data will be pre-saved as a dictionary with the keys “data” and “target”, each paired with an array of lists as the values. Initially the information will be output like this:

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3\. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5\. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3] ...
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ...}

Putting Data into a Data Frame

Feature Data

To view the data more easily we can put this information into a data frame by using the Pandas library. Let’s create a data frame to store the data information about the flowers’ features first.

import pandas as pd

#make sure to save the data frame to a variable
data = pd.DataFrame(iris.data)
data.head()

Image for post

Using data.head() will automatically output the first 5 rows of data, but if we can also specify how many rows we want in the brackets, data.head(10).

Now, we have a data frame with the iris data, but the columns are not clearly labeled. Looking at the data description we printed above, or referencing the source code tells us more about the features. In the documentation the data features are listed as:

  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm

Let’s rename the columns so the features are clear.

data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

#note: it is common practice to use underscores between words, and avoid spaces
data.head()

Image for post

Target Data

Now that the data related to the features is neatly in a data frame, we can do the same with the target data.

#put target data into data frame
target = pd.DataFrame(iris.target)

#Lets rename the column so that we know that these values refer to the target values
target = target.rename(columns = {0: 'target'})
target.head()

The target data frame is only one column, and it gives a list of the values 0, 1, and 2. We will use the information from the feature data to predict if a flower belongs in group 0, 1, or 2. But what do these numbers refer to?

  • 0 is Iris Setosa
  • 1 is Iris Versicolour
  • 2 is Iris Virginica

Exploratory Data Analysis (EDA)

To help us understand our data better, let’s first combine the two data frames we just created. By doing this we can see the features and class determination of the flowers together.

df = pd.concat([data, target], axis = 1)

#note: it is common practice to name your data frame as "df", but you can name it anything as long as you are clear and consistent
#in the code above, axis = 1 tells the data frame to add the target data frame as another column of the data data frame, axis = 0 would add the values as another row on the bottom
df.head()

Image for post

Now we can clearly see the relationship between an individual flower’s features and its class.

Data Cleaning

It’s super important to look through your data, make sure it is clean, and begin to explore relationships between features and target variables. Since this is a relatively simple data set there is not much cleaning that needs to be done, but let’s walk through the steps.

  1. Look at Data Types
df.dtypes

Image for post

float = numbers with decimals (1.678)

int = integer or whole number without decimals (1, 2, 3)

obj = object, string, or words (‘hello’)

The 64 after these data types refers to how many bits of storage the value occupies. You will often seen 32 or 64.

In this data set, the data types are all ready for modeling. In some instances the number values will be coded as objects, so we would have to change the data types before performing statistic modeling.

2. Check for Missing Values

df.isnull().sum()

This data set is not missing any values. While this makes modeling much easier, this is not usually the case — data is always messy in real life. If there were missing values you could delete rows of data that had missing values, or there are several options of how you could fill that missing number (with the column’s mean, previous value…).

#exploratory-data-analysis #logistic-regression #data-science #basics #classification #data analysis

What is GEEK

Buddha Community

Classification Basics: Walk-through with the Iris Data Set
Siphiwe  Nair

Siphiwe Nair

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Noah  Rowe

Noah Rowe

1596400380

Classification Basics: Walk-through with the Iris Data Set

Learn the basics of classification with guided code from the iris data set

Image for post

When I was first learning how to code, I would practice my data skills on different data sets to create mini Jupyter Notebook reference guides. Since I have found this to be an incredibly useful tool, I thought I’d start sharing code walk-throughs. Hopefully this is a positive resource for those learning to use Python for data science, and something that can be referenced in future projects. Full code is available on Github.

Getting Set Up

The first step is to import the preloaded data sets from the scikit-learn python library. More info on the “toy” data sets included in the package can be found here. The data description will also give more information on the features, statistics, and sources.

from sklearn.datasets import load_iris

#save data information as variable
iris = load_iris()
#view data description and information
print(iris.DESCR)

Image for post

The data will be pre-saved as a dictionary with the keys “data” and “target”, each paired with an array of lists as the values. Initially the information will be output like this:

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3\. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5\. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3] ...
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ...}

Putting Data into a Data Frame

Feature Data

To view the data more easily we can put this information into a data frame by using the Pandas library. Let’s create a data frame to store the data information about the flowers’ features first.

import pandas as pd

#make sure to save the data frame to a variable
data = pd.DataFrame(iris.data)
data.head()

Image for post

Using data.head() will automatically output the first 5 rows of data, but if we can also specify how many rows we want in the brackets, data.head(10).

Now, we have a data frame with the iris data, but the columns are not clearly labeled. Looking at the data description we printed above, or referencing the source code tells us more about the features. In the documentation the data features are listed as:

  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm

Let’s rename the columns so the features are clear.

data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

#note: it is common practice to use underscores between words, and avoid spaces
data.head()

Image for post

Target Data

Now that the data related to the features is neatly in a data frame, we can do the same with the target data.

#put target data into data frame
target = pd.DataFrame(iris.target)

#Lets rename the column so that we know that these values refer to the target values
target = target.rename(columns = {0: 'target'})
target.head()

The target data frame is only one column, and it gives a list of the values 0, 1, and 2. We will use the information from the feature data to predict if a flower belongs in group 0, 1, or 2. But what do these numbers refer to?

  • 0 is Iris Setosa
  • 1 is Iris Versicolour
  • 2 is Iris Virginica

Exploratory Data Analysis (EDA)

To help us understand our data better, let’s first combine the two data frames we just created. By doing this we can see the features and class determination of the flowers together.

df = pd.concat([data, target], axis = 1)

#note: it is common practice to name your data frame as "df", but you can name it anything as long as you are clear and consistent
#in the code above, axis = 1 tells the data frame to add the target data frame as another column of the data data frame, axis = 0 would add the values as another row on the bottom
df.head()

Image for post

Now we can clearly see the relationship between an individual flower’s features and its class.

Data Cleaning

It’s super important to look through your data, make sure it is clean, and begin to explore relationships between features and target variables. Since this is a relatively simple data set there is not much cleaning that needs to be done, but let’s walk through the steps.

  1. Look at Data Types
df.dtypes

Image for post

float = numbers with decimals (1.678)

int = integer or whole number without decimals (1, 2, 3)

obj = object, string, or words (‘hello’)

The 64 after these data types refers to how many bits of storage the value occupies. You will often seen 32 or 64.

In this data set, the data types are all ready for modeling. In some instances the number values will be coded as objects, so we would have to change the data types before performing statistic modeling.

2. Check for Missing Values

df.isnull().sum()

This data set is not missing any values. While this makes modeling much easier, this is not usually the case — data is always messy in real life. If there were missing values you could delete rows of data that had missing values, or there are several options of how you could fill that missing number (with the column’s mean, previous value…).

#exploratory-data-analysis #logistic-regression #data-science #basics #classification #data analysis

Gerhard  Brink

Gerhard Brink

1620629020

Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.

Introduction

As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).


This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Enterprise Data Management: Stick to the Basics

Lots of people have increasing volumes of data and are trying to run data management programs to better sort it. Interestingly, people’s problems are pretty much the same throughout different sectors of any industry, and data management helps them configure solutions.

The fundamentals of enterprise data management (EDM), which one uses to tackle these kinds of initiatives, are the same whether one is in the health sector, a telco travel company, or a government agency, and more! Therefore, the fundamental practices that one needs to follow to manage data are similar from one industry to another.

For example, suppose you’re about to set off and design a program. In this case, it may be your integration platform project or your big warehouse project; however, the principles for designing that program of work is pretty much the same regardless of the actual details of the project.

#big data #bigdata #big data analytics #data management #data modeling #data governance #enterprise data #enterprise data management #edm

Cyrus  Kreiger

Cyrus Kreiger

1618039260

How Has COVID-19 Impacted Data Science?

The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.

Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.

#big data #data #data analysis #data security #data integration #etl #data warehouse #data breach #elt