Best way to Impute categorical data using Groupby 

We know that we can replace the nan values with mean or median using fillna(). What if the NAN data is correlated to another categorical column?

What if the expected NAN value is a categorical value?

Below are some useful tips to handle NAN values.

Definitely you are doing it with Pandas and Numpy.

import pandas as pd
import numpy as np


cl = pd.DataFrame({
'team':['A','A','A','A','A','B','B','B','B','B'],                   'class'['I','I','I','I','I','I','I','II','II','II'],
'value': [1, np.nan, 2, 2, 3, 1, 3, np.nan, 3,1]})

Image for post

Lets assume if you have to fillna for the data of liquor consumption rate, you can just fillna if no other data is relevant to it.

But if the age of the person is given then you can see a pattern in the age and consumption rate variables. Because the liquor consumption will not be in same level for all the people.

An another example is fillna in salary value could be related with age, job title and/or education.

In the above example, let assume that columns test and class are related to value.

Using ngroup you can name the group with the index.

#group-by #fillna #mean #mode #pandas #pandas

What is GEEK

Buddha Community

Best way to Impute categorical data using Groupby 
Siphiwe  Nair

Siphiwe Nair


Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Gerhard  Brink

Gerhard Brink


Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.


As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).

This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Uriah  Dietrich

Uriah Dietrich


What Is ETLT? Merging the Best of ETL and ELT Into a Single ETLT Data Integration Strategy

Data integration solutions typically advocate that one approach – either ETL or ELT – is better than the other. In reality, both ETL (extract, transform, load) and ELT (extract, load, transform) serve indispensable roles in the data integration space:

  • ETL is valuable when it comes to data quality, data security, and data compliance. It can also save money on data warehousing costs. However, ETL is slow when ingesting unstructured data, and it can lack flexibility.
  • ELT is fast when ingesting large amounts of raw, unstructured data. It also brings flexibility to your data integration and data analytics strategies. However, ELT sacrifices data quality, security, and compliance in many cases.

Because ETL and ELT present different strengths and weaknesses, many organizations are using a hybrid “ETLT” approach to get the best of both worlds. In this guide, we’ll help you understand the “why, what, and how” of ETLT, so you can determine if it’s right for your use-case.

#data science #data #data security #data integration #etl #data warehouse #data breach #elt #bid data

Data Lake and Data Mesh Use Cases

As data mesh advocates come to suggest that the data mesh should replace the monolithic, centralized data lake, I wanted to check in with Dipti Borkar, co-founder and Chief Product Officer at Ahana. Dipti has been a tremendous resource for me over the years as she has held leadership positions at Couchbase, Kinetica, and Alluxio.


  • A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the resource format and in addition to the originating data stores.
  • A data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design. Mesh is an abstraction layer that sits atop data sources and provides access.

According to Dipti, while data lakes and data mesh both have use cases they work well for, data mesh can’t replace the data lake unless all data sources are created equal — and for many, that’s not the case.

Data Sources

All data sources are not equal. There are different dimensions of data:

  • Amount of data being stored
  • Importance of the data
  • Type of data
  • Type of analysis to be supported
  • Longevity of the data being stored
  • Cost of managing and processing the data

Each data source has its purpose. Some are built for fast access for small amounts of data, some are meant for real transactions, some are meant for data that applications need, and some are meant for getting insights on large amounts of data.


Things changed when AWS commoditized the storage layer with the AWS S3 object-store 15 years ago. Given the ubiquity and affordability of S3 and other cloud storage, companies are moving most of this data to cloud object stores and building data lakes, where it can be analyzed in many different ways.

Because of the low cost, enterprises can store all of their data — enterprise, third-party, IoT, and streaming — into an S3 data lake. However, the data cannot be processed there. You need engines on top like Hive, Presto, and Spark to process it. Hadoop tried to do this with limited success. Presto and Spark have solved the SQL in S3 query problem.

#big data #big data analytics #data lake #data lake and data mesh #data lake #data mesh

Getting Started With Data Imputation Using Autoimpute

A large majority of datasets in the real world contain missing data. This leads to an issue since most Python machine learning models only work with clean datasets. As a result, analysts need to figure out how to deal with the missing data before proceeding on to the modeling step. Unfortunately, most data professionals are mainly focused on the modeling aspect and they do not pay much attention to the missing values. They usually either just drop the rows with missing values or rely on simple data imputation (replacement) techniques such as mean/median imputation. Such techniques can negatively impact model performance. This is where the Autoimpute library comes in — it provides you a framework for the proper handling of missing data.

Types of imputation

  1. Univariate imputation: Impute values using only the target variable itself, for example, mean imputation.
  2. Multivariate imputation: Impute values based on other variables, such as, using linear regression to estimate the missing values based on other variables.
  3. Single imputation: Impute any missing values within the dataset only once to create a single imputed dataset.
  4. Multiple imputation: Impute the same missing values within the dataset multiple times. This basically involves running the single imputation multiple times to get multiple imputed datasets (explained with a detailed example in the next section).

Using Autoimpute

Now let’s demonstrate how to tackle the issue of missingness using the Autoimpute library. This library provides a framework for handling missing data from the exploration phase up until the modeling phase. The image below shows a basic flowchart of how this process works on regression using multiple imputation.

Image for post

Flowchart demonstrating how multiple imputation works with linear regression.

In the above image, the raw dataset is imputed three times to create three new datasets, each one having its own new imputed values. Separate regressions are run on each of the new datasets and the parameters obtained from these regressions are pooled to form a single model. This process can be generalized to other values of ‘n’ (number of imputed datasets) and various other models.

In order to understand one major advantage of obtaining multiple datasets, we must keep in mind that the missing values are actually unknown and we are not looking to obtain the exact point estimates for them. Instead, we are trying to capture the fact that we do not know the true value and that the value could vary. This technique of having multiple imputed datasets containing different values helps in capturing this variability.

Importing Libraries

We’ll start off by importing the required libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm, binom
import seaborn as sns
from autoimpute.utils import md_pattern, proportions
from autoimpute.visuals import plot_md_locations, plot_md_percent
from autoimpute.visuals import plot_imp_dists, plot_imp_boxplots
from autoimpute.visuals import plot_imp_swarm
from autoimpute.imputations import MultipleImputer

The complete code for this article can be downloaded from this repository:

Creating Dummy Dataset

For demonstration purposes, we create a dummy dataset with 1000 observations. The dataset contains two variables; predictor ‘x’ and response ‘_y’. _Forty percent of the observations in ‘_y’ _are randomly replaced by missing values while ‘_x’ _is fully observed. The correlation between ‘_x’ _and ‘_y’ is approximately0.8. _A scatter plot of the data is shown below.

Image for post

#python #data-visualization #data-science #data-imputation #missing-data #data analysis