Gap Statistic: Dynamically Get The Suggested Clusters in The Data

Python implementation of the Gap Statistic


Dynamically identify the suggested number of clusters in a data-set using the gap statistic.

Full example available in a notebook HERE


Bleeding edge:

pip install git+git://


pip install --upgrade gap-stat

With Rust extension:

pip install --upgrade gap-stat[rust]


pip uninstall gap-stat


This package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in "Estimating the number of clusters in a data set via the gap statistic" (Tibshirani et al.).

The methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:

  • Taking the k maximizing the Gap value, which is calculated for each k. This, however, might not always be possible, as for many datasets this value is monotonically increasing or decreasing.
  • Taking the smallest k such that Gap(k) >= Gap(k+1) - s(k+1). This is the method suggested in Tibshirani et al. (consult the paper for details). The measure diff = Gap(k) - Gap(k+1) + s(k+1) is calculated for each k; the parallel here, then, is to take the smallest k for which diff is positive. Note that in some cases this can be true for the entire range of k.
  • Taking the k maximizing the Gap* value, an alternative measure suggested in "A comparison of Gap statistic definitions with and with-out logarithm function" by Mohajer, Englmeier and Schmid. The authors claim this measure avoids the over-estimation of the number of clusters from which the original Gap statistics suffers, and can also suggest an optimal value for k for cases in which Gap cannot. They do warn, however, that the original Gap statistic performs better than Gap* in the case of overlapped clusters, due to its tendency to overestimate the number of clusters.

Note that none of the above methods is guaranteed to find an optimal value for k, and that they often contradict one another. Rather, they can provide more information on which to base your choice of k, which should take numerous other factors into account


First, construct an OptimalK object. Optional intialization parameters are:

  • n_jobs - Splits computation into this number of parallel jobs. Requires choosing a parallel backend.
  • parallel_backend - Possible values are joblib, rust or multiprocessing for the built-in Python backend. If parallel_backend == 'rust' it will use all cores.
  • clusterer - Takes a custom clusterer function to be used when clustering. See the example notebook for more details.
  • clusterer_kwargs - Any keyword arguments to be forwarded to the custom clusterer function on each call.

An example intialization:

optimalK = OptimalK(n_jobs=4, parallel_backend='joblib')

After the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:

  • X - A pandas dataframe or numpy array of data points of shape (n_samples, n_features).
  • n_refs - The number of random reference data sets to use as inertia reference to actual data. Optional.
  • cluster_array - A 1-dimensional iterable of integers; each representing n_clusters to try on the data. Optional.

For example:

import numpy as np
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))

After performing the search procedure, a DataFrame of gap values and other usefull statistics for each passed cluster count is now available as the gap_df attributre of the OptimalK object:


The columns of the dataframe are:

  • n_clusters - The number of clusters for which the statistics in this row were calculated.
  • gap_value - The Gap value for this n.
  • gap* - The Gap* value for this n.
  • ref_dispersion_std - The standard deviation of the reference distributions for this n.
  • sk - The standard error of the Gap statistic for this n.
  • sk* - The standard error of the Gap* statistic for this n.
  • diff - The diff value for this n (see the methodology section for details).
  • diff* - The diff* value for this n (corresponding to the diff value for Gap*).

Additionally, the relation between the above measures and the number of clusters can be plotted by calling the OptimalK.plot_results() method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:

  • A plot of the Gap value versus n, the number of clusters.
  • A plot of diff versus n.
  • A plot of the Gap* value versus n, the number of clusters.
  • A plot of the diff* value versus n.

Download Details:
Author: milesgranger
Source Code:
License: View license

#rust  #machine-learning #python #statistics 

What is GEEK

Buddha Community

Gap Statistic: Dynamically Get The Suggested Clusters in The Data
Shubham Ankit

Shubham Ankit


How to Automate Excel with Python | Python Excel Tutorial (OpenPyXL)

How to Automate Excel with Python

In this article, We will show how we can use python to automate Excel . A useful Python library is Openpyxl which we will learn to do Excel Automation


Openpyxl is a Python library that is used to read from an Excel file or write to an Excel file. Data scientists use Openpyxl for data analysis, data copying, data mining, drawing charts, styling sheets, adding formulas, and more.

Workbook: A spreadsheet is represented as a workbook in openpyxl. A workbook consists of one or more sheets.

Sheet: A sheet is a single page composed of cells for organizing data.

Cell: The intersection of a row and a column is called a cell. Usually represented by A1, B5, etc.

Row: A row is a horizontal line represented by a number (1,2, etc.).

Column: A column is a vertical line represented by a capital letter (A, B, etc.).

Openpyxl can be installed using the pip command and it is recommended to install it in a virtual environment.

pip install openpyxl


We start by creating a new spreadsheet, which is called a workbook in Openpyxl. We import the workbook module from Openpyxl and use the function Workbook() which creates a new workbook.

from openpyxl
import Workbook
#creates a new workbook
wb = Workbook()
#Gets the first active worksheet
ws =
#creating new worksheets by using the create_sheet method

ws1 = wb.create_sheet("sheet1", 0) #inserts at first position
ws2 = wb.create_sheet("sheet2") #inserts at last position
ws3 = wb.create_sheet("sheet3", -1) #inserts at penultimate position

#Renaming the sheet
ws.title = "Example"

#save the workbook = "example.xlsx")


We load the file using the function load_Workbook() which takes the filename as an argument. The file must be saved in the same working directory.

#loading a workbook
wb = openpyxl.load_workbook("example.xlsx")




#getting sheet names
result = ['sheet1', 'Sheet', 'sheet3', 'sheet2']

#getting a particular sheet
sheet1 = wb["sheet2"]

#getting sheet title
result = 'sheet2'

#Getting the active sheet
sheetactive =
result = 'sheet1'




#get a cell from the sheet
sheet1["A1"] <
  Cell 'Sheet1'.A1 >

  #get the cell value
ws["A1"].value 'Segment'

#accessing cell using row and column and assigning a value
d = ws.cell(row = 4, column = 2, value = 10)




#looping through each row and column
for x in range(1, 5):
  for y in range(1, 5):
  print(x, y, ws.cell(row = x, column = y)

#getting the highest row number

#getting the highest column number

There are two functions for iterating through rows and columns.

Iter_rows() => returns the rows
Iter_cols() => returns the columns {
  min_row = 4, max_row = 5, min_col = 2, max_col = 5
} => This can be used to set the boundaries
for any iteration.


#iterating rows
for row in ws.iter_rows(min_row = 2, max_col = 3, max_row = 3):
  for cell in row:
  print(cell) <
  Cell 'Sheet1'.A2 >
  Cell 'Sheet1'.B2 >
  Cell 'Sheet1'.C2 >
  Cell 'Sheet1'.A3 >
  Cell 'Sheet1'.B3 >
  Cell 'Sheet1'.C3 >

  #iterating columns
for col in ws.iter_cols(min_row = 2, max_col = 3, max_row = 3):
  for cell in col:
  print(cell) <
  Cell 'Sheet1'.A2 >
  Cell 'Sheet1'.A3 >
  Cell 'Sheet1'.B2 >
  Cell 'Sheet1'.B3 >
  Cell 'Sheet1'.C2 >
  Cell 'Sheet1'.C3 >

To get all the rows of the worksheet we use the method worksheet.rows and to get all the columns of the worksheet we use the method worksheet.columns. Similarly, to iterate only through the values we use the method worksheet.values.


for row in ws.values:
  for value in row:



Writing to a workbook can be done in many ways such as adding a formula, adding charts, images, updating cell values, inserting rows and columns, etc… We will discuss each of these with an example.




#creates a new workbook
wb = openpyxl.Workbook()

#saving the workbook"new.xlsx")




#creating a new sheet
ws1 = wb.create_sheet(title = "sheet 2")

#creating a new sheet at index 0
ws2 = wb.create_sheet(index = 0, title = "sheet 0")

#checking the sheet names
wb.sheetnames['sheet 0', 'Sheet', 'sheet 2']

#deleting a sheet
del wb['sheet 0']

#checking sheetnames
wb.sheetnames['Sheet', 'sheet 2']




#checking the sheet value

#adding value to cell
ws['B2'] = 367

#checking value




We often require formulas to be included in our Excel datasheet. We can easily add formulas using the Openpyxl module just like you add values to a cell.

For example:

import openpyxl
from openpyxl
import Workbook

wb = openpyxl.load_workbook("new1.xlsx")
ws = wb['Sheet']

ws['A9'] = '=SUM(A2:A8)'"new2.xlsx")

The above program will add the formula (=SUM(A2:A8)) in cell A9. The result will be as below.




Two or more cells can be merged to a rectangular area using the method merge_cells(), and similarly, they can be unmerged using the method unmerge_cells().

For example:
Merge cells

#merge cells B2 to C9
ws['B2'] = "Merged cells"

Adding the above code to the previous example will merge cells as below.




#unmerge cells B2 to C9

The above code will unmerge cells from B2 to C9.


To insert an image we import the image function from the module openpyxl.drawing.image. We then load our image and add it to the cell as shown in the below example.


import openpyxl
from openpyxl
import Workbook
from openpyxl.drawing.image
import Image

wb = openpyxl.load_workbook("new1.xlsx")
ws = wb['Sheet']
#loading the image(should be in same folder)
img = Image('logo.png')
ws['A1'] = "Adding image"
#adjusting size
img.height = 130
img.width = 200
#adding img to cell A3

ws.add_image(img, 'A3')"new2.xlsx")




Charts are essential to show a visualization of data. We can create charts from Excel data using the Openpyxl module chart. Different forms of charts such as line charts, bar charts, 3D line charts, etc., can be created. We need to create a reference that contains the data to be used for the chart, which is nothing but a selection of cells (rows and columns). I am using sample data to create a 3D bar chart in the below example:


import openpyxl
from openpyxl
import Workbook
from openpyxl.chart
import BarChart3D, Reference, series

wb = openpyxl.load_workbook("example.xlsx")
ws =

values = Reference(ws, min_col = 3, min_row = 2, max_col = 3, max_row = 40)
chart = BarChart3D()
ws.add_chart(chart, "E3")"MyChart.xlsx")


How to Automate Excel with Python with Video Tutorial

Welcome to another video! In this video, We will cover how we can use python to automate Excel. I'll be going over everything from creating workbooks to accessing individual cells and stylizing cells. There is a ton of things that you can do with Excel but I'll just be covering the core/base things in OpenPyXl.

⭐️ Timestamps ⭐️
00:00 | Introduction
02:14 | Installing openpyxl
03:19 | Testing Installation
04:25 | Loading an Existing Workbook
06:46 | Accessing Worksheets
07:37 | Accessing Cell Values
08:58 | Saving Workbooks
09:52 | Creating, Listing and Changing Sheets
11:50 | Creating a New Workbook
12:39 | Adding/Appending Rows
14:26 | Accessing Multiple Cells
20:46 | Merging Cells
22:27 | Inserting and Deleting Rows
23:35 | Inserting and Deleting Columns
24:48 | Copying and Moving Cells
26:06 | Practical Example, Formulas & Cell Styling

📄 Resources 📄
OpenPyXL Docs: 
Code Written in This Tutorial: 


 iOS App Dev

iOS App Dev


Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Gerhard  Brink

Gerhard Brink


Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.


As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).

This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Cyrus  Kreiger

Cyrus Kreiger


How Has COVID-19 Impacted Data Science?

The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.

Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.

#big data #data #data analysis #data security #data integration #etl #data warehouse #data breach #elt

Macey  Kling

Macey Kling


Applications Of Data Science On 3D Imagery Data

CVDC 2020, the Computer Vision conference of the year, is scheduled for 13th and 14th of August to bring together the leading experts on Computer Vision from around the world. Organised by the Association of Data Scientists (ADaSCi), the premier global professional body of data science and machine learning professionals, it is a first-of-its-kind virtual conference on Computer Vision.

The second day of the conference started with quite an informative talk on the current pandemic situation. Speaking of talks, the second session “Application of Data Science Algorithms on 3D Imagery Data” was presented by Ramana M, who is the Principal Data Scientist in Analytics at Cyient Ltd.

Ramana talked about one of the most important assets of organisations, data and how the digital world is moving from using 2D data to 3D data for highly accurate information along with realistic user experiences.

The agenda of the talk included an introduction to 3D data, its applications and case studies, 3D data alignment, 3D data for object detection and two general case studies, which are-

  • Industrial metrology for quality assurance.
  • 3d object detection and its volumetric analysis.

This talk discussed the recent advances in 3D data processing, feature extraction methods, object type detection, object segmentation, and object measurements in different body cross-sections. It also covered the 3D imagery concepts, the various algorithms for faster data processing on the GPU environment, and the application of deep learning techniques for object detection and segmentation.

#developers corner #3d data #3d data alignment #applications of data science on 3d imagery data #computer vision #cvdc 2020 #deep learning techniques for 3d data #mesh data #point cloud data #uav data