In this instance, K-Means is used to analyse market segment clusters for a hotel in Portugal.

This analysis is based on the original study by **Antonio, Almeida and Nunes** as cited in the References section below.

Given **lead time** (the period of time from when the customer makes their booking to when they actually stay at the hotel), along with **ADR** (average daily rate per customer), the k-means clustering algorithm is used to visually identify which market segments are most profitable for the hotel.

A customer with a high ADR and a low lead time is ideal, as it means that 1) the customer is paying a high daily rate which means a greater profit margin for the hotel, while a low lead time means that the customer pays for their booking quicker — which increases cash flow for the hotel in question.

The data is loaded and 100 samples are chosen at random:

```
df = pd.read_csv('H1full.csv')
df = df.sample(n = 100)
```

The interval (or continuous random variables) are of **lead time** and **ADR** are defined as below:

```
leadtime = df['LeadTime']
adr = df['ADR']
```

Variables with a categorical component are defined using ‘’’cat.codes’’’, in this case **market segment**.

```
marketsegmentcat=df.MarketSegment.astype("category").cat.codes
marketsegmentcat=pd.Series(marketsegmentcat)
```

The purpose of this is to assign categorical codes to each market segment. For instance, here is a snippet of some of the market segment entries in the dataset:

```
10871 Online TA
7752 Online TA
35566 Offline TA/TO
1353 Online TA
17532 Online TA
...
1312 Online TA
10364 Groups
16113 Direct
23633 Online TA
23406 Direct
```

Upon applying `cat.codes`

, here are the corresponding categories.

```
10871 4
7752 4
35566 3
1353 4
17532 4
..
1312 4
10364 2
16113 1
23633 4
23406 1
```

The market segment labels are as follows:

**0**= Corporate**1**= Direct**2**= Groups**3**= Offline TA/TO**4**= Online TA

The lead time and ADR features are scaled using sklearn:

```
from sklearn.preprocessing import scale
X = scale(x1)
```

Here is a sample of X:

```
array([[ 1.07577693, -1.01441847],
[-0.75329711, 2.25432473],
[-0.60321924, -0.80994917],
[-0.20926483, 0.26328418],
[ 0.53174465, -0.40967609],
[-0.82833604, 0.40156369],
[-0.89399511, -1.01810593],
[ 0.59740372, 1.40823851],
[-0.89399511, -1.16560407],
```

When it comes to choosing the number of clusters, one possible solution is to use what is called the **elbow method**. Here is an example of an elbow curve:

This is a technique whereby the in-cluster variance for each cluster is calculated — the lower the variance, the tighter the cluster.

In this regard, as the score starts to flatten out, this means that the reduction in variance becomes less and less as we increase the number of clusters, which allows us to determine the ideal value for **k**.

However, this technique is not necessarily suitable for smaller clusters. Moreover, we already know the number of clusters (k=5) that we wish to define, as we already know the number of market segments that we wish to analyse.

Additionally, while k-means clustering methods may also use PCA (or Principal Dimensionality Reduction) to reduce the number of features, this is not appropriate in this case as the only two features being used (apart from market segment) are **ADR** and **lead time**.

SciPy is the most efficient open-source library in python. The main purpose is to compute mathematical and scientific problems. There are many sub-packages in SciPy which further increases its functionality. This is a very important package for data interpretation. We can segregate clusters from the data set. We can perform clustering using a single or multi-cluster. Initially, we generate the data set. Then we perform clustering on the data set. Let us learn more SciPy Clusters.

It is a method that can employ to determine clusters and their center. We can use this process on the raw data set. We can define a cluster when the points inside the cluster have the minimum distance when we compare it to points outside the cluster. The k-means method operates in two steps, given an initial set of k-centers,

- We define the cluster data points for the given cluster center. The points are such that they are closer to the cluster center than any other center.
- We then calculate the mean for all the data points. The mean value then becomes the new cluster center.

The process iterates until the center value becomes constant. We then fix and assign the center value. The implementation of this process is very accurate using the SciPy library.

This article provides an overview of core data science algorithms used in statistical data analysis, specifically k-means and k-medoids clustering.

Clustering is one of the major techniques used for statistical data analysis.

As the term suggests, “clustering” is defined as the process of gathering similar objects into different groups or distribution of datasets into subsets with a defined distance measure.

*K-means* clustering is touted as a foundational algorithm every data scientist ought to have in their toolbox. The popularity of the algorithm in the data science industry is due to its extraordinary features:

- Simplicity
- Speed
- Efficiency

Clustering comes under the data mining topic and there is a lot of research going on in this field and there exist many clustering algorithms.

The following are the main types of clustering algorithms.

*K-Means**Hierarchical clustering**DBSCAN*

Following are some of the applications of clustering

- Customer Segmentation: This is one of the most important use-cases of clustering in the sales and marketing domain. Here the aim is to group people or customers based on some similarities so that they can come up with different action items for the people in different groups. One example could be, amazon giving different offers to different people based on their buying patterns.
- Image Segmentation: Clustering is used in image segmentation where similar image pixels are grouped together. Pixels of different objects in the image are grouped together.

I consider myself an active StackOverflow user, despite my activity tends to vary depending on my daily workload. I enjoy answering questions with angular tag and I always try to create some working example to prove correctness of my answers.

To create angular demo I usually use either plunker or stackblitz or even jsfiddle. I like all of them but when I run into some errors I want to have a little bit more usable tool to undestand what’s going on.

Many people who ask questions on stackoverflow don’t want to isolate the problem and prepare minimal reproduction so they usually post all code to their questions on SO. They also tend to be not accurate and make a lot of mistakes in template syntax. To not waste a lot of time investigating where the error comes from I tried to create a tool that will help me to quickly find what causes the problem.

```
Angular demo runner
Online angular editor for building demo.
ng-run.com
<>
```

Let me show what I mean…

There are template parser errors that can be easy catched by stackblitz

It gives me some information but I want the error to be highlighted

