1676790120
scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Python packages (numpy, scipy) and follows a similar API to that of scikit-learn.
Native Python implementation. A native Python implementation for a variety of multi-label classification algorithms. To see the list of all supported classifiers, check this link.
Interface to Meka. A Meka wrapper class is implemented for reference purposes and integration. This provides access to all methods available in MEKA, MULAN, and WEKA — the reference standard in the field.
Builds upon giants! Team-up with the power of numpy and scikit. You can use scikit-learn's base classifiers as scikit-multilearn's classifiers. In addition, the two packages follow a similar API.
In most cases you will want to follow the requirements defined in the requirements/*.txt files in the package.
scipy
numpy
future
scikit-learn
liac-arff # for loading ARFF files
requests # for dataset module
networkx # for networkX base community detection clusterers
python-louvain # for networkX base community detection clusterers
keras
python-igraph # for igraph library based clusterers
python-graphtool # for graphtool base clusterers
Note: Installing graphtool is complicated, please see: graphtool install instructions
To install scikit-multilearn, simply type the following command:
$ pip install scikit-multilearn
This will install the latest release from the Python package index. If you wish to install the bleeding-edge version, then clone this repository and run setup.py
:
$ git clone https://github.com/scikit-multilearn/scikit-multilearn.git
$ cd scikit-multilearn
$ python setup.py
Before proceeding to classification, this library assumes that you have a dataset with the following matrices:
x_train
, x_test
: training and test feature matrices of size (n_samples, n_features)
y_train
, y_test
: training and test label matrices of size (n_samples, n_labels)
Suppose we wanted to use a problem-transformation method called Binary Relevance, which treats each label as a separate single-label classification problem, to a Support-vector machine (SVM) classifier, we simply perform the following tasks:
# Import BinaryRelevance from skmultilearn
from skmultilearn.problem_transform import BinaryRelevance
# Import SVC classifier from sklearn
from sklearn.svm import SVC
# Setup the classifier
classifier = BinaryRelevance(classifier=SVC(), require_dense=[False,True])
# Train
classifier.fit(X_train, y_train)
# Predict
y_pred = classifier.predict(X_test)
More examples and use-cases can be seen in the documentation. For using the MEKA wrapper, check this link.
This project is open for contributions. Here are some of the ways for you to contribute:
In case you want to implement your own multi-label classifier, please read our Developer's Guide to help you integrate your implementation in our API.
To make a contribution, just fork this repository, push the changes in your fork, open up an issue, and make a Pull Request!
We're also available in Slack! Just go to our slack group.
If you used scikit-multilearn in your research or project, please cite our work:
@ARTICLE{2017arXiv170201460S,
author = {{Szyma{\'n}ski}, P. and {Kajdanowicz}, T.},
title = "{A scikit-based Python environment for performing multi-label classification}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1702.01460},
year = 2017,
month = feb
}
Author: Scikit-multilearn
Source Code: https://github.com/scikit-multilearn/scikit-multilearn
License: BSD-2-Clause license
#machinelearning #python #clustering #scikitlearn #classification
1669975140
Identifying the Unknown With Clustering Metrics
Clustering in machine learning has a variety of applications, but how do you know which algorithm is best suited to your data? Here’s how to amplify your data insights with comparison metrics, including the F-measure.
Clustering is an unsupervised machine learning method to divide given data into groups based solely on the features of each sample. Sorting data into clusters can help identify unknown similarities between samples or reveal outliers in the data set. In the real world, clustering has significance across diverse fields from marketing to biology: Clustering applications include market segmentation, social network analysis, and diagnostic medical imaging.
Because this process is unsupervised, multiple clustering results can form around different features. For example, imagine you have a data set composed of various images of red trousers, black trousers, red shirts, and black shirts. One algorithm might find clusters based on clothing shape, while another might create groups based on color.
When analyzing a data set, we need a way to accurately measure the performance of different clustering algorithms; we may want to contrast the solutions of two algorithms, or see how close a clustering result is to an expected solution. In this article, we will explore some of the metrics that can be used for comparing different clustering results obtained from the same data.
Let’s define an example data set that we will use to explain various clustering metric concepts and examine what kinds of clusters it might produce.
First, a few common notations and terms:
Clustering results can vary based not only on sorting features but also the total number of clusters. The result depends on the algorithm, its sensitivity to small perturbations, the model’s parameters, and the data’s features. Using our previously mentioned data set of black and red trousers and shirts, there are a variety of clustering results that might be produced from different algorithms.
To distinguish between general clustering CC and our example clusterings, we will use a lowercase cc to describe our example clusterings:
Additional clusterings might include more than four clusters based on different features, such as whether a shirt is sleeveless or sleeved.
As seen in our example, a clustering method divides all the samples in a data set into non-empty disjoint subsets. In cluster cc, there is no image that belongs to both the trouser subset and the shirt subset: c1∩c2=∅c1∩c2=∅. This concept can be extended; no two subsets of any cluster have the same sample.
Most criteria for comparing clusterings can be described using the confusion matrix of the pair C,C′C,C′. The confusion matrix would be a K×K′K×K′ matrix whose kk′kk′th element (the element in the kkth row and k′k′th column) is the number of samples in the intersection of clusters CkCk of CC and C′k′Ck′′ of C′C′:
nkk′=|Ck∩C′k′|nkk′=|Ck∩Ck′′|
We’ll break this down using our simplified black and red trousers and shirts example, assuming that data set DD has 100 red trousers, 200 black trousers, 200 red shirts, and 300 black shirts. Let’s examine the confusion matrix of cc and c′′c″:
Since K=2K=2 and K′′=4K″=4, this is a 2×42×4 matrix. Let’s choose k=2k=2 and k′′=3k″=3. We see that element nkk′=n23=200nkk′=n23=200. This means that the intersection of c2c2 (shirts) and c′′3c′′3 (red shirts) is 200, which is correct since c2∩c′′3c2∩c′′3 would simply be the set of red shirts.
Clustering metrics can be broadly categorized into three groups based on the underlying cluster comparison method:
In this article, we only touch on a few of many metrics available, but our examples will serve to help define the three clustering metric groups.
Pair-counting requires examining all pairs of samples, then counting pairs where the clusterings agree and disagree. Each pair of samples can belong to one of four sets, where the set element counts (NijNij) are obtained from the confusion matrix:
The Rand index is defined as (N00+N11)/(n(n−1)/2)(N00+N11)/(n(n−1)/2), where nn represents the number of samples; it can also be read as (number of similarly treated pairs)/(total number of pairs). Although theoretically its value ranges between 0 and 1, its range is often much narrower in practice. A higher value means more similarity between the clusterings. (A Rand index of 1 would represent a perfect match where two clusterings have identical clusters.)
One limitation of the Rand index is its behavior when the number of clusters increases to approach the number of elements; in this case, it converges toward 1, creating challenges in accurately measuring clustering similarity. Several improved or modified versions of the Rand index have been introduced to address this issue. One variation is the adjusted Rand index; however, it assumes that two clusterings are drawn randomly with a fixed number of clusters and cluster elements.
These metrics are based on generic notions of information theory. We will discuss two of them: entropy and mutual information (MI).
Entropy describes how much information there is in a clustering. If the entropy associated with a clustering is 0, then there is no uncertainty about the cluster of a randomly picked sample, which is true when there is only one cluster.
MI describes how much information one clustering gives about the other. MI can indicate how much knowing the cluster of a sample in CC reduces the uncertainty about the cluster of the sample in C′C′.
Normalized mutual information is MI that is normalized by the geometric or arithmetic mean of the entropies of clusterings. Standard MI is not bound by a constant value, so normalized mutual information provides a more interpretable clustering metric.
Another popular metric in this category is variation of information (VI) that depends on both the entropy and MI of clusterings. Let H(C)H(C) be the entropy of a clustering and I(C,C′)I(C,C′) be the MI between two clusterings. VI between two clusterings can be defined as VI(C,C′)=H(C)+H(C′)−2I(C,C′)VI(C,C′)=H(C)+H(C′)−2I(C,C′). A VI of 0 represents a perfect match between two clusterings.
Set overlap metrics involve determining the best match for clusters in CC with clusters in C′C′ based on maximum overlap between the clusters. For all metrics in this category, a 1 means the clusterings are identical.
The maximum matching measure scans the confusion matrix in decreasing order and matches the largest entry of the confusion matrix first. It then removes the matched clusters and repeats the process sequentially until the clusters are exhausted.
The F-measure is another set overlap metric. Unlike the maximum matching measure, the F-measure is frequently used to compare a clustering to an optimal solution, instead of comparing two clusterings.
Because of the F-measure’s common use in machine learning models and important applications such as search engines, we’ll explore the F-measure in more detail with an example.
Let’s assume that CC is our ground truth, or optimal, solution. For any kkth cluster in CC, where k∈[1,K]k∈[1,K], we’ll calculate an individual F-measure with every cluster in clustering result C′C′. This individual F-measure indicates how well the cluster C′k′Ck′′ describes the cluster CkCk and can be determined through the precision and recall (two model evaluation metrics) for these clusters. Let’s define Ikk′Ikk′ as the intersection of elements in CC’s kkth cluster and C′C′’s k′k′th cluster, and |Ck||Ck| as the number of elements in the kkth cluster.
Precision p=Ikk′|C′k′|p=Ikk′|Ck′′|
Recall r=Ikk′|Ck|r=Ikk′|Ck|
Then, the individual F-measure of the kkth and k′k′th cluster can be calculated as the harmonic mean of the precision and recall for these clusters:
Fkk′=2rpr+p=2Ikk′|Ck|+|C′k′|Fkk′=2rpr+p=2Ikk′|Ck|+|Ck′′|
Now, to compare CC and C′C′, let’s look at the overall F-measure. First, we will create a matrix similar to a contingency table whose values are the individual F-measures of the clusters. Let’s assume that we’ve mapped CC’s clusters as rows of a table and C′C′’s clusters as columns, with table values corresponding to individual F-measures. Identify the cluster pair with the maximum individual F-measure, and remove the row and column corresponding to these clusters. Repeat this until the clusters are exhausted. Finally, we can define the overall F-measure:
F(C,C′)=1nK∑i=1nimax(F(Ci,C′j))∀j∈1,K′F(C,C′)=1n∑i=1Knimax(F(Ci,Cj′))∀j∈1,K′
As you can see, the overall F-measure is the weighted sum of our maximum individual F-measures for the clusters.
Any Python notebook suitable for machine learning, such as a Jupyter notebook, will work as our environment. Before we start, you may want to examine my GitHub repository’s README, extended readme_help_example.ipynb
example file, and requirements.txt
file (the required libraries).
We’ll use the sample data in the GitHub repository, which is made up of news articles. The data is arranged with information including category
, headline
, date
, and short_description
:
category | headline | date | short_description | |
---|---|---|---|---|
49999 | THE WORLDPOST | Drug War Deaths Climb To 1,800 In Philippines | 2016-08-22 | In the last seven weeks alone. |
49966 | TASTE | Yes, You Can Make Real Cuban-Style Coffee At Home | 2016-08-22 | It’s all about the crema. |
49965 | STYLE | KFC’s Fried Chicken-Scented Sunscreen Will Kee… | 2016-08-22 | For when you want to make yourself smell finge… |
49964 | POLITICS | HUFFPOLLSTER: Democrats Have A Solid Chance Of… | 2016-08-22 | HuffPost’s poll-based model indicates Senate R… |
We can use pandas to read, analyze, and manipulate the data. We’ll sort the data by date and select a small sample (10,000 news headlines) for our demo since the full data set is large:
import pandas as pd
df = pd.read_json("./sample_data/example_news_data.json", lines=True)
df.sort_values(by='date', inplace=True)
df = df[:10000]
len(df['category'].unique())
Upon running, you should see the notebook output the result 30, since there are 30 categories in this data sample. You may also run df.head(4)
to see how the data is stored. (It should match the table displayed in this section.)
Before applying the clustering, we should first preprocess the text to reduce redundant features of our model, including:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
nltk.download('stopwords')
stop_words = stopwords.words('english')
nltk.download('wordnet')
nltk.download('omw-1.4')
def preprocess(text: str) -> str:
text = text.lower()
text = re.sub('[^a-z]',' ',text)
text = re.sub('\s+', ' ', text)
text = text.split(" ")
words = [wordnet_lemmatizer.lemmatize(word, 'v') for word in text if word not in stop_words]
return " ".join(words)
df['processed_input'] = df['headline'].apply(preprocess)
The resulting preprocessed headlines are shown as processed_input
, which you can observe by again running df.head(4)
:
category | headline | date | short_description | processed_input | |
---|---|---|---|---|---|
49999 | THE WORLDPOST | Drug War Deaths Climb To 1,800 In Philippines | 2016-08-22 | In the last seven weeks alone. | drug war deaths climb philippines |
49966 | TASTE | Yes, You Can Make Real Cuban-Style Coffee At Home | 2016-08-22 | It’s all about the crema. | yes make real cuban style coffee home |
49965 | STYLE | KFC’s Fried Chicken-Scented Sunscreen Will Kee… | 2016-08-22 | For when you want to make yourself smell finge… | kfc fry chicken scent sunscreen keep skin get … |
49964 | POLITICS | HUFFPOLLSTER: Democrats Have A Solid Chance Of… | 2016-08-22 | HuffPost’s poll-based model indicates Senate R… | huffpollster democrats solid chance retake senate |
Now, we need to represent each headline as a numeric vector to be able to apply any machine learning model to it. There are various feature extraction techniques to achieve this; we will be using TF-IDF (term frequency-inverse document frequency). This technique reduces the effect of words occurring with high frequency in documents (in our example, news headlines), as these clearly should not be the deciding features in clustering or classifying them.
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=300, tokenizer=lambda x: x.split(' '))
tfidf_mat = vectorizer.fit_transform(df['processed_input'])
X = tfidf_mat.todense()
X[X==0]=0.00001
Next, we will try out our first clustering method, agglomerative clustering, on these feature vectors.
Considering the given news categories as the optimal solution, let’s compare these results to those of agglomerative clustering (with the desired number of clusters as 30 since there are 30 categories in the data set):
clusters_agg = AgglomerativeClustering(n_clusters=30).fit_predict(X)
df['class_prd'] = clusters_agg.astype(int)
We will identify the resulting clusters by integer labels; headlines belonging to the same cluster are assigned the same integer label. The cluster_measure
function from the compare_clusters
module of our GitHub repository returns the aggregate F-measure and number of perfectly matching clusters so we can see how accurate our clustering result was:
from clustering.compare_clusters import cluster_measure
# ‘cluster_measure` requires given text categories to be in the column ‘text_category`
df['text_category'] = df['category']
res_df, fmeasure_aggregate, true_matches = cluster_measure(df, gt_column='class_gt')
fmeasure_aggregate, len(true_matches)
# Outputs: (0.19858339749319176, 0)
On comparing these cluster results with the optimal solution, we get a low F-measure of 0.198 and 0 clusters matching with actual class groups, depicting that the agglomerative clusters do not align with the headline categories we chose. Let’s check out a cluster in the result to see what it looks like.
df[df['class_prd'] == 0]['category'].value_counts()
Upon examining the results, we see that this cluster contains headlines from all the categories:
POLITICS 1268
ENTERTAINMENT 712
THE WORLDPOST 373
HEALTHY LIVING 272
QUEER VOICES 251
PARENTS 212
BLACK VOICES 211
...
FIFTY 24
EDUCATION 23
COLLEGE 14
ARTS 13
So, our low F-measure makes sense considering that our result’s clusters do not align with the optimal solution. However, it is important to recall that the given category classification we chose reflects only one possible division of the data set. A low F-measure here doesn’t imply that the clustering result is wrong, but that the clustering result didn’t match our desired method of partitioning the data.
Let’s try another popular clustering algorithm on the same data set: k-means clustering. We will create a new dataframe and use the cluster_measure
function again:
kmeans = KMeans(n_clusters=30, random_state=0).fit(X)
df2 = df.copy()
df2['class_prd'] = kmeans.predict(X).astype(int)
res_df, fmeasure_aggregate, true_matches = cluster_measure(df2)
fmeasure_aggregate, len(true_matches)
# Outputs: (0.18332960871141976, 0)
Like the agglomerative clustering result, our k-means clustering result has formed clusters that are dissimilar to our given categories: It has an F-measure of 0.18 when compared to the optimal solution. Since the two clustering results have similar F-measures, it would be interesting to compare them to each other. We already have the clusterings, so we just need to calculate the F-measure. First, we’ll bring both results into one column, with class_gt
having the agglomerative clustering output, and class_prd
having the k-means clustering output:
df1 = df2.copy()
df1['class_gt'] = df['class_prd']
res_df, fmeasure_aggregate, true_matches = cluster_measure(df1, gt_column='class_gt')
fmeasure_aggregate, len(true_matches)
# Outputs: (0.4030316435020922, 0)
With a higher F-measure of 0.4, we can observe that the clusterings of the two algorithms are more similar to each other than they are to the optimal solution.
An understanding of the available clustering comparison metrics will expand your machine learning model analysis. We have seen the F-measure clustering metric in action, and gave you the basics you need to apply these learnings to your next clustering result. To learn even more, here are my top picks for further reading:
The Toptal Engineering Blog extends its gratitude to Luis Bronchal for reviewing the code samples presented in this article.
Original article source at: https://www.toptal.com/
1667874180
Cluster is an easy map annotation clustering library. This repository uses an efficient method (QuadTree) to aggregate pins into a cluster.
The Example is a great place to get started. It demonstrates how to:
$ pod try Cluster
Cluster is available via CocoaPods and Carthage.
To install Cluster with CocoaPods, add this to your Podfile
:
pod "Cluster"
To install Cluster with Carthage, add this to your Cartfile
:
github "efremidze/Cluster"
The ClusterManager
class generates, manages and displays annotation clusters.
let clusterManager = ClusterManager()
Create an object that conforms to the MKAnnotation
protocol, or extend an existing one. Next, add the annotation object to an instance of ClusterManager
with add(annotation:)
.
let annotation = Annotation(coordinate: CLLocationCoordinate2D(latitude: 21.283921, longitude: -157.831661))
manager.add(annotation)
Implement the map view’s mapView(_:viewFor:)
delegate method to configure the annotation view. Return an instance of MKAnnotationView
to display as a visual representation of the annotations.
To display clusters, return an instance of ClusterAnnotationView
.
extension ViewController: MKMapViewDelegate {
func mapView(_ mapView: MKMapView, viewFor annotation: MKAnnotation) -> MKAnnotationView? {
if let annotation = annotation as? ClusterAnnotation {
return CountClusterAnnotationView(annotation: annotation, reuseIdentifier: "cluster")
} else {
return MKPinAnnotationView(annotation: annotation, reuseIdentifier: "pin")
}
}
}
For performance reasons, you should generally reuse MKAnnotationView
objects in your map views. See the Example to learn more.
The ClusterAnnotationView
class exposes a countLabel
property. You can subclass ClusterAnnotationView
to provide custom behavior as needed. Here's an example of subclassing the ClusterAnnotationView
and customizing the layer borderColor
.
class CountClusterAnnotationView: ClusterAnnotationView {
override func configure() {
super.configure()
self.layer.cornerRadius = self.frame.width / 2
self.layer.masksToBounds = true
self.layer.borderColor = UIColor.white.cgColor
self.layer.borderWidth = 1.5
}
}
See the AnnotationView to learn more.
You can customize the appearance of the StyledClusterAnnotationView
by setting the style
property of the annotation.
let annotation = Annotation(coordinate: CLLocationCoordinate2D(latitude: 21.283921, longitude: -157.831661))
annotation.style = .color(color, radius: 25)
manager.add(annotation)
Several styles are available in the ClusterAnnotationStyle
enum:
color(UIColor, radius: CGFloat)
- Displays the annotations as a circle.image(UIImage?)
- Displays the annotation as an image.Once you have added the annotation, you need to return an instance of the StyledClusterAnnotationView
to display the styled annotation.
func mapView(_ mapView: MKMapView, viewFor annotation: MKAnnotation) -> MKAnnotationView? {
if let annotation = annotation as? ClusterAnnotation {
return StyledClusterAnnotationView(annotation: annotation, reuseIdentifier: identifier, style: style)
}
}
To remove annotations, you can call remove(annotation:)
. However the annotations will still display until you call reload()
.
manager.remove(annotation)
In the case that shouldRemoveInvisibleAnnotations
is set to false
, annotations that have been removed may still appear on map until calling reload()
on visible region.
Implement the map view’s mapView(_:regionDidChangeAnimated:)
delegate method to reload the ClusterManager
when the region changes.
func mapView(_ mapView: MKMapView, regionDidChangeAnimated animated: Bool) {
clusterManager.reload(mapView: mapView) { finished in
// handle completion
}
}
You should call reload()
anytime you add or remove annotations.
The ClusterManager
class exposes several properties to configure clustering:
var zoomLevel: Double // The current zoom level of the visible map region.
var maxZoomLevel: Double // The maximum zoom level before disabling clustering.
var minCountForClustering: Int // The minimum number of annotations for a cluster. The default is `2`.
var shouldRemoveInvisibleAnnotations: Bool // Whether to remove invisible annotations. The default is `true`.
var shouldDistributeAnnotationsOnSameCoordinate: Bool // Whether to arrange annotations in a circle if they have the same coordinate. The default is `true`.
var distanceFromContestedLocation: Double // The distance in meters from contested location when the annotations have the same coordinate. The default is `3`.
var clusterPosition: ClusterPosition // The position of the cluster annotation. The default is `.nearCenter`.
The ClusterManagerDelegate
protocol provides a number of functions to manage clustering and configure cells.
// The size of each cell on the grid at a given zoom level.
func cellSize(for zoomLevel: Double) -> Double? { ... }
// Whether to cluster the given annotation.
func shouldClusterAnnotation(_ annotation: MKAnnotation) -> Bool { ... }
Author: Efremidze
Source Code: https://github.com/efremidze/Cluster
License: MIT license
1665752640
A Julia package for linear manifold clustering.
Prior to Julia v0.7.0
Pkg.clone("https://github.com/wildart/LMCLUS.jl.git")
For Julia v0.7.0/1.0.0
pkg> add https://github.com/wildart/LMCLUS.jl.git#0.4.0
For Julia 1.1+, add BoffinStuff registry in the package manager before installing the package.
pkg> registry add https://github.com/wildart/BoffinStuff.git
pkg> add LMCLUS
Julia Version | LMCLUS version |
---|---|
v0.3.* | v0.0.2 |
v0.4.* | v0.1.2 |
v0.5.* | v0.2.0 |
v0.6.* | v0.3.0 |
≥v0.7.* | v0.4.0 |
≥v1.1.* | ≥v0.4.1 |
Author: Wildart
Source Code: https://github.com/wildart/LMCLUS.jl
License: MIT license
1661420520
Go Clustering SQL Driver - A clustering, implementation-agnostic "meta"-driver for any backend implementing "database/sql/driver".
It does (latency-based) load-balancing and error-recovery over all registered nodes.
It is assumed that database-state is transparently replicated over all nodes by some database-side clustering solution. This driver ONLY handles the client side of such a cluster.
This package simply multiplexes the driver.Open() function of sql/driver to every attached node. The function is called on each node, returning the first successfully opened connection. (Any connections opening subsequently will be closed.) If opening does not succeed for any node, the latest error gets returned. Any other errors will be masked by default. However, any given latest error for any attached node will remain exposed through expvar, as well as some basic counters and timestamps.
To make use of this kind of clustering, use this package with any backend driver implementing "database/sql/driver" like so:
import "database/sql"
import "github.com/go-sql-driver/mysql"
import "github.com/EnumApps/clustersql"
const ( WriteDriver = "write_conn" ReadDriver = "read_conn" SessDriver = "sess_conn" ) There is currently no way around instanciating the backend driver explicitly
mysqlDriver := mysql.MySQLDriver{}
You can perform backend-driver specific settings such as
err := mysql.SetLogger(mylogger)
Create a new clustering driver with the backend driver
readerDriver := clustersql.NewDriver(mysqlDriver, ReadDriver)
Add nodes, including driver-specific name format, in this case Go-MySQL DSN. Here, we add three nodes belonging to a galera cluster
readerDriver.AddNode("galera1", "reader:password@tcp(dbhost1:3306)/db")
readerDriver.AddNode("galera2", "reader:password@tcp(dbhost2:3306)/db")
readerDriver.AddNode("galera3", "reader:password@tcp(dbhost3:3306)/db")
Make the clusterDriver available to the go sql interface under an arbitrary name
sql.Register(ReadDriver, readerDriver)
Create a new clustering driver with the backend driver
sessionDriver := clustersql.NewDriver(mysqlDriver, SessDriver)
Add nodes, including driver-specific name format, in this case Go-MySQL DSN. Here, we add three nodes belonging to a galera cluster
sessionDriver.AddNode("galera1", "sess_user:password@tcp(dbhost1:3306)/sessdb")
sessionDriver.AddNode("galera2", "sess_user:password@tcp(dbhost2:3306)/sessdb")
sessionDriver.AddNode("galera3", "sess_user:password@tcp(dbhost3:3306)/sessdb")
Make the clusterDriver available to the go sql interface under an arbitrary name
sql.Register(SessDriver, sessionDriver)
Open the registered clusterDriver with an arbitrary DSN string (not used)
db, err := sql.Open(WriteDriver, "")
readonly_db, err := sql.Open(ReadDriver, "")
session_db, err := sql.Open(SessDriver, "")
Continue to use the sql interface as documented at http://golang.org/pkg/database/sql/
Before using this in production, you should configure your cluster details in config.toml and run
go test -v .
Note however, that non-failure of the above is no guarantee for a correctly set-up cluster.
Author: EnumApps
Source Code: https://github.com/EnumApps/clustersql
License: BSD-2-Clause license
1658384229
QuickShift [1] is a fast method for hierarchical clustering, which first constructs the clustering tree, and subsequently allows to quickly cut links in the tree which exceed a specified length. This second step can be performed for different link-lengths without having to re-run the clustering itself. Care has been taken to provide a high-performance implementation.
[1] Quick Shift and Kernel Methods for Mode Seeking
a = quickshift(data)
a = quickshift(data, sigma)
# cluster ndim x nsamplex matrix data.
# sigma: Gaussian kernel width, see paper
labels = quickshiftlabels(a::QuickShift)
labels = quickshiftlabels(a::QuickShift, maxlinklength)
# cut links in the tree with length > maxlinklength
# return cluster labels for data points.
quickshiftplot(a, data, labels)
# plot data points and hierarchical links
# needs PyPlot installed, only for 2D
data 2 x N | Runtime quickshift | Runtime quickshiftlabels |
---|---|---|
1000 | 0.06 sec | 0.0002 sec |
10000 | 0.27 sec | 0.004 sec |
100000 | 9.67 sec | 0.04 sec |
For larger numbers of data points, you might want to use KShiftsClustering.jl to cluster the N
data points to e.g. 10.000 cluster centers, and then perform QuickShift on those.
Comparison with kmedoids
for 20.000 points:
using Clustering, QuickShiftClustering, FunctionalDataUtils
data = rand(2,20000)
@time a = kmedoids(1-exp(-distance(data,data)*10),10)
# => elapsed time: 56.666481916 seconds (41126243444 bytes allocated, 15.31% gc time)
@time labels = quickshiftlabels(quickshift(data))
# => elapsed time: 1.187448525 seconds (277816624 bytes allocated, 28.79% gc time)
using FunctionalData
data = @p map unstack(1:10) (x->10*randn(2,1).+randn(2,100)) | flatten
using QuickShiftClustering
a = quickshift(data)
labels = quickshiftlabels(a)
quickshiftplot(a, data, labels)
Author: rened
Source Code: https://github.com/rened/QuickShiftClustering.jl
License: View license
1655942280
Clustering.jl
Methods for data clustering and evaluation of clustering quality.
Pkg.add("Clustering")
Julia packages providing other clustering methods:
Documentation:
Author: JuliaStats
Source Code: https://github.com/JuliaStats/Clustering.jl
License: View license
1646860440
bottleneck
Bottleneck is a lightweight and zero-dependency Task Scheduler and Rate Limiter for Node.js and the browser.
Bottleneck is an easy solution as it adds very little complexity to your code. It is battle-hardened, reliable and production-ready and used on a large scale in private companies and open source software.
It supports Clustering: it can rate limit jobs across multiple Node.js instances. It uses Redis and strictly atomic operations to stay reliable in the presence of unreliable clients and networks. It also supports Redis Cluster and Redis Sentinel.
submit()
schedule()
wrap()
updateSettings()
incrementReservoir()
currentReservoir()
stop()
chain()
npm install --save bottleneck
import Bottleneck from "bottleneck";
// Note: To support older browsers and Node <6.0, you must import the ES5 bundle instead.
var Bottleneck = require("bottleneck/es5");
Most APIs have a rate limit. For example, to execute 3 requests per second:
const limiter = new Bottleneck({
minTime: 333
});
If there's a chance some requests might take longer than 333ms and you want to prevent more than 1 request from running at a time, add maxConcurrent: 1
:
const limiter = new Bottleneck({
maxConcurrent: 1,
minTime: 333
});
minTime
and maxConcurrent
are enough for the majority of use cases. They work well together to ensure a smooth rate of requests. If your use case requires executing requests in bursts or every time a quota resets, look into Reservoir Intervals.
Instead of this:
myFunction(arg1, arg2)
.then((result) => {
/* handle result */
});
Do this:
limiter.schedule(() => myFunction(arg1, arg2))
.then((result) => {
/* handle result */
});
Or this:
const wrapped = limiter.wrap(myFunction);
wrapped(arg1, arg2)
.then((result) => {
/* handle result */
});
Instead of this:
const result = await myFunction(arg1, arg2);
Do this:
const result = await limiter.schedule(() => myFunction(arg1, arg2));
Or this:
const wrapped = limiter.wrap(myFunction);
const result = await wrapped(arg1, arg2);
Instead of this:
someAsyncCall(arg1, arg2, callback);
Do this:
limiter.submit(someAsyncCall, arg1, arg2, callback);
Remember...
Bottleneck builds a queue of jobs and executes them as soon as possible. By default, the jobs will be executed in the order they were received.
Read the 'Gotchas' and you're good to go. Or keep reading to learn about all the fine tuning and advanced options available. If your rate limits need to be enforced across a cluster of computers, read the Clustering docs.
Need help debugging your application?
Instead of throttling maybe you want to batch up requests into fewer calls?
schedule()
or wrap()
only returns once all the work it does has completed.Instead of this:
limiter.schedule(() => {
tasksArray.forEach(x => processTask(x));
// BAD, we return before our processTask() functions are finished processing!
});
Do this:
limiter.schedule(() => {
const allTasks = tasksArray.map(x => processTask(x));
// GOOD, we wait until all tasks are done.
return Promise.all(allTasks);
});
bind()
the object:// instead of this:
limiter.schedule(object.doSomething);
// do this:
limiter.schedule(object.doSomething.bind(object));
// or, wrap it in an arrow function instead:
limiter.schedule(() => object.doSomething());
Bottleneck requires Node 6+ to function. However, an ES5 build is included: var Bottleneck = require("bottleneck/es5");
.
Make sure you're catching "error"
events emitted by your limiters!
Consider setting a maxConcurrent
value instead of leaving it null
. This can help your application's performance, especially if you think the limiter's queue might become very long.
If you plan on using priorities
, make sure to set a maxConcurrent
value.
When using submit()
, if a callback isn't necessary, you must pass null
or an empty function instead. It will not work otherwise.
When using submit()
, make sure all the jobs will eventually complete by calling their callback, or set an expiration
. Even if you submitted your job with a null
callback , it still needs to call its callback. This is particularly important if you are using a maxConcurrent
value that isn't null
(unlimited), otherwise those not completed jobs will be clogging up the limiter and no new jobs will be allowed to run. It's safe to call the callback more than once, subsequent calls are ignored.
Using tools like mockdate
in your tests to change time in JavaScript will likely result in undefined behavior from Bottleneck.
const limiter = new Bottleneck({/* options */});
Basic options:
Option | Default | Description |
---|---|---|
maxConcurrent | null (unlimited) | How many jobs can be executing at the same time. Consider setting a value instead of leaving it null , it can help your application's performance, especially if you think the limiter's queue might get very long. |
minTime | 0 ms | How long to wait after launching a job before launching another one. |
highWater | null (unlimited) | How long can the queue be? When the queue length exceeds that value, the selected strategy is executed to shed the load. |
strategy | Bottleneck.strategy.LEAK | Which strategy to use when the queue gets longer than the high water mark. Read about strategies. Strategies are never executed if highWater is null . |
penalty | 15 * minTime , or 5000 when minTime is 0 | The penalty value used by the BLOCK strategy. |
reservoir | null (unlimited) | How many jobs can be executed before the limiter stops executing jobs. If reservoir reaches 0 , no jobs will be executed until it is no longer 0 . New jobs will still be queued up. |
reservoirRefreshInterval | null (disabled) | Every reservoirRefreshInterval milliseconds, the reservoir value will be automatically updated to the value of reservoirRefreshAmount . The reservoirRefreshInterval value should be a multiple of 250 (5000 for Clustering). |
reservoirRefreshAmount | null (disabled) | The value to set reservoir to when reservoirRefreshInterval is in use. |
reservoirIncreaseInterval | null (disabled) | Every reservoirIncreaseInterval milliseconds, the reservoir value will be automatically incremented by reservoirIncreaseAmount . The reservoirIncreaseInterval value should be a multiple of 250 (5000 for Clustering). |
reservoirIncreaseAmount | null (disabled) | The increment applied to reservoir when reservoirIncreaseInterval is in use. |
reservoirIncreaseMaximum | null (disabled) | The maximum value that reservoir can reach when reservoirIncreaseInterval is in use. |
Promise | Promise (built-in) | This lets you override the Promise library used by Bottleneck. |
Reservoir Intervals let you execute requests in bursts, by automatically controlling the limiter's reservoir
value. The reservoir
is simply the number of jobs the limiter is allowed to execute. Once the value reaches 0, it stops starting new jobs.
There are 2 types of Reservoir Intervals: Refresh Intervals and Increase Intervals.
In this example, we throttle to 100 requests every 60 seconds:
const limiter = new Bottleneck({
reservoir: 100, // initial value
reservoirRefreshAmount: 100,
reservoirRefreshInterval: 60 * 1000, // must be divisible by 250
// also use maxConcurrent and/or minTime for safety
maxConcurrent: 1,
minTime: 333 // pick a value that makes sense for your use case
});
reservoir
is a counter decremented every time a job is launched, we set its initial value to 100. Then, every reservoirRefreshInterval
(60000 ms), reservoir
is automatically updated to be equal to the reservoirRefreshAmount
(100).
In this example, we throttle jobs to meet the Shopify API Rate Limits. Users are allowed to send 40 requests initially, then every second grants 2 more requests up to a maximum of 40.
const limiter = new Bottleneck({
reservoir: 40, // initial value
reservoirIncreaseAmount: 2,
reservoirIncreaseInterval: 1000, // must be divisible by 250
reservoirIncreaseMaximum: 40,
// also use maxConcurrent and/or minTime for safety
maxConcurrent: 5,
minTime: 250 // pick a value that makes sense for your use case
});
Reservoir Intervals are an advanced feature, please take the time to read and understand the following warnings.
Reservoir Intervals are not a replacement for minTime
and maxConcurrent
. It's strongly recommended to also use minTime
and/or maxConcurrent
to spread out the load. For example, suppose a lot of jobs are queued up because the reservoir
is 0. Every time the Refresh Interval is triggered, a number of jobs equal to reservoirRefreshAmount
will automatically be launched, all at the same time! To prevent this flooding effect and keep your application running smoothly, use minTime
and maxConcurrent
to stagger the jobs.
The Reservoir Interval starts from the moment the limiter is created. Let's suppose we're using reservoirRefreshAmount: 5
. If you happen to add 10 jobs just 1ms before the refresh is triggered, the first 5 will run immediately, then 1ms later it will refresh the reservoir value and that will make the last 5 also run right away. It will have run 10 jobs in just over 1ms no matter what your reservoir interval was!
Reservoir Intervals prevent a limiter from being garbage collected. Call limiter.disconnect()
to clear the interval and allow the memory to be freed. However, it's not necessary to call .disconnect()
to allow the Node.js process to exit.
Adds a job to the queue. This is the callback version of schedule()
.
limiter.submit(someAsyncCall, arg1, arg2, callback);
You can pass null
instead of an empty function if there is no callback, but someAsyncCall
still needs to call its callback to let the limiter know it has completed its work.
submit()
can also accept advanced options.
Adds a job to the queue. This is the Promise and async/await version of submit()
.
const fn = function(arg1, arg2) {
return httpGet(arg1, arg2); // Here httpGet() returns a promise
};
limiter.schedule(fn, arg1, arg2)
.then((result) => {
/* ... */
});
In other words, schedule()
takes a function fn and a list of arguments. schedule()
returns a promise that will be executed according to the rate limits.
schedule()
can also accept advanced options.
Here's another example:
// suppose that `client.get(url)` returns a promise
const url = "https://wikipedia.org";
limiter.schedule(() => client.get(url))
.then(response => console.log(response.body));
Takes a function that returns a promise. Returns a function identical to the original, but rate limited.
const wrapped = limiter.wrap(fn);
wrapped()
.then(function (result) {
/* ... */
})
.catch(function (error) {
// Bottleneck might need to fail the job even if the original function can never fail.
// For example, your job is taking longer than the `expiration` time you've set.
});
submit()
, schedule()
, and wrap()
all accept advanced options.
// Submit
limiter.submit({/* options */}, someAsyncCall, arg1, arg2, callback);
// Schedule
limiter.schedule({/* options */}, fn, arg1, arg2);
// Wrap
const wrapped = limiter.wrap(fn);
wrapped.withOptions({/* options */}, arg1, arg2);
Option | Default | Description |
---|---|---|
priority | 5 | A priority between 0 and 9 . A job with a priority of 4 will be queued ahead of a job with a priority of 5 . Important: You must set a low maxConcurrent value for priorities to work, otherwise there is nothing to queue because jobs will be be scheduled immediately! |
weight | 1 | Must be an integer equal to or higher than 0 . The weight is what increases the number of running jobs (up to maxConcurrent ) and decreases the reservoir value. |
expiration | null (unlimited) | The number of milliseconds a job is given to complete. Jobs that execute for longer than expiration ms will be failed with a BottleneckError . |
id | <no-id> | You should give an ID to your jobs, it helps with debugging. |
A strategy is a simple algorithm that is executed every time adding a job would cause the number of queued jobs to exceed highWater
. Strategies are never executed if highWater
is null
.
When adding a new job to a limiter, if the queue length reaches highWater
, drop the oldest job with the lowest priority. This is useful when jobs that have been waiting for too long are not important anymore. If all the queued jobs are more important (based on their priority
value) than the one being added, it will not be added.
Same as LEAK
, except it will only drop jobs that are less important than the one being added. If all the queued jobs are as or more important than the new one, it will not be added.
When adding a new job to a limiter, if the queue length reaches highWater
, do not add the new job. This strategy totally ignores priority levels.
When adding a new job to a limiter, if the queue length reaches highWater
, the limiter falls into "blocked mode". All queued jobs are dropped and no new jobs will be accepted until the limiter unblocks. It will unblock after penalty
milliseconds have passed without receiving a new job. penalty
is equal to 15 * minTime
(or 5000
if minTime
is 0
) by default. This strategy is ideal when bruteforce attacks are to be expected. This strategy totally ignores priority levels.
minTime
setting.Note: By default, Bottleneck does not keep track of DONE jobs, to save memory. You can enable this feature by passing trackDoneStatus: true
as an option when creating a limiter.
const counts = limiter.counts();
console.log(counts);
/*
{
RECEIVED: 0,
QUEUED: 0,
RUNNING: 0,
EXECUTING: 0,
DONE: 0
}
*/
Returns an object with the current number of jobs per status in the limiter.
console.log(limiter.jobStatus("some-job-id"));
// Example: QUEUED
Returns the status of the job with the provided job id in the limiter. Returns null
if no job with that id exist.
console.log(limiter.jobs("RUNNING"));
// Example: ['id1', 'id2']
Returns an array of all the job ids with the specified status in the limiter. Not passing a status string returns all the known ids.
const count = limiter.queued(priority);
console.log(count);
priority
is optional. Returns the number of QUEUED
jobs with the given priority
level. Omitting the priority
argument returns the total number of queued jobs in the limiter.
const count = await limiter.clusterQueued();
console.log(count);
Returns the number of QUEUED
jobs in the Cluster.
if (limiter.empty()) {
// do something...
}
Returns a boolean which indicates whether there are any RECEIVED
or QUEUED
jobs in the limiter.
limiter.running()
.then((count) => console.log(count));
Returns a promise that returns the total weight of the RUNNING
and EXECUTING
jobs in the Cluster.
limiter.done()
.then((count) => console.log(count));
Returns a promise that returns the total weight of DONE
jobs in the Cluster. Does not require passing the trackDoneStatus: true
option.
limiter.check()
.then((wouldRunNow) => console.log(wouldRunNow));
Checks if a new job would be executed immediately if it was submitted now. Returns a promise that returns a boolean.
'error'
limiter.on("error", function (error) {
/* handle errors here */
});
The two main causes of error events are: uncaught exceptions in your event handlers, and network errors when Clustering is enabled.
'failed'
limiter.on("failed", function (error, jobInfo) {
// This will be called every time a job fails.
});
'retry'
See Retries to learn how to automatically retry jobs.
limiter.on("retry", function (message, jobInfo) {
// This will be called every time a job is retried.
});
'empty'
limiter.on("empty", function () {
// This will be called when `limiter.empty()` becomes true.
});
'idle'
limiter.on("idle", function () {
// This will be called when `limiter.empty()` is `true` and `limiter.running()` is `0`.
});
'dropped'
limiter.on("dropped", function (dropped) {
// This will be called when a strategy was triggered.
// The dropped request is passed to this event listener.
});
'depleted'
limiter.on("depleted", function (empty) {
// This will be called every time the reservoir drops to 0.
// The `empty` (boolean) argument indicates whether `limiter.empty()` is currently true.
});
'debug'
limiter.on("debug", function (message, data) {
// Useful to figure out what the limiter is doing in real time
// and to help debug your application
});
'received' 'queued' 'scheduled' 'executing' 'done'
limiter.on("queued", function (info) {
// This event is triggered when a job transitions from one Lifecycle stage to another
});
See Jobs Lifecycle for more information.
These Lifecycle events are not triggered for jobs located on another limiter in a Cluster, for performance reasons.
Use removeAllListeners()
with an optional event name as first argument to remove listeners.
Use .once()
instead of .on()
to only receive a single event.
The following example:
const limiter = new Bottleneck();
// Listen to the "failed" event
limiter.on("failed", async (error, jobInfo) => {
const id = jobInfo.options.id;
console.warn(`Job ${id} failed: ${error}`);
if (jobInfo.retryCount === 0) { // Here we only retry once
console.log(`Retrying job ${id} in 25ms!`);
return 25;
}
});
// Listen to the "retry" event
limiter.on("retry", (error, jobInfo) => console.log(`Now retrying ${jobInfo.options.id}`));
const main = async function () {
let executions = 0;
// Schedule one job
const result = await limiter.schedule({ id: 'ABC123' }, async () => {
executions++;
if (executions === 1) {
throw new Error("Boom!");
} else {
return "Success!";
}
});
console.log(`Result: ${result}`);
}
main();
will output
Job ABC123 failed: Error: Boom!
Retrying job ABC123 in 25ms!
Now retrying ABC123
Result: Success!
To re-run your job, simply return an integer from the 'failed'
event handler. The number returned is how many milliseconds to wait before retrying it. Return 0
to retry it immediately.
IMPORTANT: When you ask the limiter to retry a job it will not send it back into the queue. It will stay in the EXECUTING
state until it succeeds or until you stop retrying it. This means that it counts as a concurrent job for maxConcurrent
even while it's just waiting to be retried. The number of milliseconds to wait ignores your minTime
settings.
limiter.updateSettings(options);
The options are the same as the limiter constructor.
Note: Changes don't affect SCHEDULED
jobs.
limiter.incrementReservoir(incrementBy);
Returns a promise that returns the new reservoir value.
limiter.currentReservoir()
.then((reservoir) => console.log(reservoir));
Returns a promise that returns the current reservoir value.
The stop()
method is used to safely shutdown a limiter. It prevents any new jobs from being added to the limiter and waits for all EXECUTING
jobs to complete.
limiter.stop(options)
.then(() => {
console.log("Shutdown completed!")
});
stop()
returns a promise that resolves once all the EXECUTING
jobs have completed and, if desired, once all non-EXECUTING
jobs have been dropped.
Option | Default | Description |
---|---|---|
dropWaitingJobs | true | When true , drop all the RECEIVED , QUEUED and RUNNING jobs. When false , allow those jobs to complete before resolving the Promise returned by this method. |
dropErrorMessage | This limiter has been stopped. | The error message used to drop jobs when dropWaitingJobs is true . |
enqueueErrorMessage | This limiter has been stopped and cannot accept new jobs. | The error message used to reject a job added to the limiter after stop() has been called. |
Tasks that are ready to be executed will be added to that other limiter. Suppose you have 2 types of tasks, A and B. They both have their own limiter with their own settings, but both must also follow a global limiter G:
const limiterA = new Bottleneck( /* some settings */ );
const limiterB = new Bottleneck( /* some different settings */ );
const limiterG = new Bottleneck( /* some global settings */ );
limiterA.chain(limiterG);
limiterB.chain(limiterG);
// Requests added to limiterA must follow the A and G rate limits.
// Requests added to limiterB must follow the B and G rate limits.
// Requests added to limiterG must follow the G rate limits.
To unchain, call limiter.chain(null);
.
The Group
feature of Bottleneck manages many limiters automatically for you. It creates limiters dynamically and transparently.
Let's take a DNS server as an example of how Bottleneck can be used. It's a service that sees a lot of abuse and where incoming DNS requests need to be rate limited. Bottleneck is so tiny, it's acceptable to create one limiter for each origin IP, even if it means creating thousands of limiters. The Group
feature is perfect for this use case. Create one Group and use the origin IP to rate limit each IP independently. Each call with the same key (IP) will be routed to the same underlying limiter. A Group is created like a limiter:
const group = new Bottleneck.Group(options);
The options
object will be used for every limiter created by the Group.
The Group is then used with the .key(str)
method:
// In this example, the key is an IP
group.key("77.66.54.32").schedule(() => {
/* process the request */
});
str
: The key to use. All jobs added with the same key will use the same underlying limiter. Default: ""
The return value of .key(str)
is a limiter. If it doesn't already exist, it is generated for you. Calling key()
is how limiters are created inside a Group.
Limiters that have been idle for longer than 5 minutes are deleted to avoid memory leaks, this value can be changed by passing a different timeout
option, in milliseconds.
group.on("created", (limiter, key) => {
console.log("A new limiter was created for key: " + key)
// Prepare the limiter, for example we'll want to listen to its "error" events!
limiter.on("error", (err) => {
// Handle errors here
})
});
Listening for the "created"
event is the recommended way to set up a new limiter. Your event handler is executed before key()
returns the newly created limiter.
const group = new Bottleneck.Group({ maxConcurrent: 2, minTime: 250 });
group.updateSettings({ minTime: 500 });
After executing the above commands, new limiters will be created with { maxConcurrent: 2, minTime: 500 }
.
str
: The key for the limiter to delete.Manually deletes the limiter at the specified key. When using Clustering, the Redis data is immediately deleted and the other Groups in the Cluster will eventually delete their local key automatically, unless it is still being used.
Returns an array containing all the keys in the Group.
Same as group.keys()
, but returns all keys in this Group ID across the Cluster.
const limiters = group.limiters();
console.log(limiters);
// [ { key: "some key", limiter: <limiter> }, { key: "some other key", limiter: <some other limiter> } ]
Some APIs can accept multiple operations in a single call. Bottleneck's Batching feature helps you take advantage of those APIs:
const batcher = new Bottleneck.Batcher({
maxTime: 1000,
maxSize: 10
});
batcher.on("batch", (batch) => {
console.log(batch); // ["some-data", "some-other-data"]
// Handle batch here
});
batcher.add("some-data");
batcher.add("some-other-data");
batcher.add()
returns a Promise that resolves once the request has been flushed to a "batch"
event.
Option | Default | Description |
---|---|---|
maxTime | null (unlimited) | Maximum acceptable time (in milliseconds) a request can have to wait before being flushed to the "batch" event. |
maxSize | null (unlimited) | Maximum number of requests in a batch. |
Batching doesn't throttle requests, it only groups them up optimally according to your maxTime
and maxSize
settings.
Clustering lets many limiters access the same shared state, stored in Redis. Changes to the state are Atomic, Consistent and Isolated (and fully ACID with the right Durability configuration), to eliminate any chances of race conditions or state corruption. Your settings, such as maxConcurrent
, minTime
, etc., are shared across the whole cluster, which means —for example— that { maxConcurrent: 5 }
guarantees no more than 5 jobs can ever run at a time in the entire cluster of limiters. 100% of Bottleneck's features are supported in Clustering mode. Enabling Clustering is as simple as changing a few settings. It's also a convenient way to store or export state for later use.
Bottleneck will attempt to spread load evenly across limiters.
First, add redis
or ioredis
to your application's dependencies:
# NodeRedis (https://github.com/NodeRedis/node_redis)
npm install --save redis
# or ioredis (https://github.com/luin/ioredis)
npm install --save ioredis
Then create a limiter or a Group:
const limiter = new Bottleneck({
/* Some basic options */
maxConcurrent: 5,
minTime: 500
id: "my-super-app" // All limiters with the same id will be clustered together
/* Clustering options */
datastore: "redis", // or "ioredis"
clearDatastore: false,
clientOptions: {
host: "127.0.0.1",
port: 6379
// Redis client options
// Using NodeRedis? See https://github.com/NodeRedis/node_redis#options-object-properties
// Using ioredis? See https://github.com/luin/ioredis/blob/master/API.md#new-redisport-host-options
}
});
Option | Default | Description |
---|---|---|
datastore | "local" | Where the limiter stores its internal state. The default ("local" ) keeps the state in the limiter itself. Set it to "redis" or "ioredis" to enable Clustering. |
clearDatastore | false | When set to true , on initial startup, the limiter will wipe any existing Bottleneck state data on the Redis db. |
clientOptions | {} | This object is passed directly to the redis client library you've selected. |
clusterNodes | null | ioredis only. When clusterNodes is not null, the client will be instantiated by calling new Redis.Cluster(clusterNodes, clientOptions) instead of new Redis(clientOptions) . |
timeout | null (no TTL) | The Redis TTL in milliseconds (TTL) for the keys created by the limiter. When timeout is set, the limiter's state will be automatically removed from Redis after timeout milliseconds of inactivity. |
Redis | null | Overrides the import/require of the redis/ioredis library. You shouldn't need to set this option unless your application is failing to start due to a failure to require/import the client library. |
Note: When using Groups, the timeout
option has a default of 300000
milliseconds and the generated limiters automatically receive an id
with the pattern ${group.id}-${KEY}
.
Note: If you are seeing a runtime error due to the require()
function not being able to load redis
/ioredis
, then directly pass the module as the Redis
option. Example:
import Redis from "ioredis"
const limiter = new Bottleneck({
id: "my-super-app",
datastore: "ioredis",
clientOptions: { host: '12.34.56.78', port: 6379 },
Redis
});
Unfortunately, this is a side effect of having to disable inlining, which is necessary to make Bottleneck easy to use in the browser.
The first limiter connecting to Redis will store its constructor options on Redis and all subsequent limiters will be using those settings. You can alter the constructor options used by all the connected limiters by calling updateSettings()
. The clearDatastore
option instructs a new limiter to wipe any previous Bottleneck data (for that id
), including previously stored settings.
Queued jobs are NOT stored on Redis. They are local to each limiter. Exiting the Node.js process will lose those jobs. This is because Bottleneck has no way to propagate the JS code to run a job across a different Node.js process than the one it originated on. Bottleneck doesn't keep track of the queue contents of the limiters on a cluster for performance and reliability reasons. You can use something like BeeQueue
in addition to Bottleneck to get around this limitation.
Due to the above, functionality relying on the queue length happens purely locally:
highWater
and load shedding (strategies) are per limiter. However, one limiter entering Blocked mode will put the entire cluster in Blocked mode until penalty
milliseconds have passed. See Strategies."empty"
event is triggered when the (local) queue is empty."idle"
event is triggered when the (local) queue is empty and no jobs are currently running anywhere in the cluster.You must work around these limitations in your application code if they are an issue to you. The publish()
method could be useful here.
The current design guarantees reliability, is highly performant and lets limiters come and go. Your application can scale up or down, and clients can be disconnected at any time without issues.
It is strongly recommended that you give an id
to every limiter and Group since it is used to build the name of your limiter's Redis keys! Limiters with the same id
inside the same Redis db will be sharing the same datastore.
It is strongly recommended that you set an expiration
(See Job Options) on every job, since that lets the cluster recover from crashed or disconnected clients. Otherwise, a client crashing while executing a job would not be able to tell the cluster to decrease its number of "running" jobs. By using expirations, those lost jobs are automatically cleared after the specified time has passed. Using expirations is essential to keeping a cluster reliable in the face of unpredictable application bugs, network hiccups, and so on.
Network latency between Node.js and Redis is not taken into account when calculating timings (such as minTime
). To minimize the impact of latency, Bottleneck only performs a single Redis call per lifecycle transition. Keeping the Redis server close to your limiters will help you get a more consistent experience. Keeping the system time consistent across all clients will also help.
It is strongly recommended to set up an "error"
listener on all your limiters and on your Groups.
The ready()
, publish()
and clients()
methods also exist when using the local
datastore, for code compatibility reasons: code written for redis
/ioredis
won't break with local
.
This method returns a promise that resolves once the limiter is connected to Redis.
As of v2.9.0, it's no longer necessary to wait for .ready()
to resolve before issuing commands to a limiter. The commands will be queued until the limiter successfully connects. Make sure to listen to the "error"
event to handle connection errors.
const limiter = new Bottleneck({/* options */});
limiter.on("error", (err) => {
// handle network errors
});
limiter.ready()
.then(() => {
// The limiter is ready
});
This method broadcasts the message
string to every limiter in the Cluster. It returns a promise.
const limiter = new Bottleneck({/* options */});
limiter.on("message", (msg) => {
console.log(msg); // prints "this is a string"
});
limiter.publish("this is a string");
To send objects, stringify them first:
limiter.on("message", (msg) => {
console.log(JSON.parse(msg).hello) // prints "world"
});
limiter.publish(JSON.stringify({ hello: "world" }));
If you need direct access to the redis clients, use .clients()
:
console.log(limiter.clients());
// { client: <Redis Client>, subscriber: <Redis Client> }
ioredis
datastore and the clusterNodes
option.ioredis
datastore.b_
. It also uses pubsub channels starting with b_
It will not interfere with any other data stored on the server.SCRIPT LOAD
command. These scripts only take up a few Kb of memory. Running the SCRIPT FLUSH
command will cause any connected limiters to experience critical errors until a new limiter connects to Redis and loads the scripts again.Bottleneck needs to create 2 Redis Clients to function, one for normal operations and one for pubsub subscriptions. These 2 clients are kept in a Bottleneck.RedisConnection
(NodeRedis) or a Bottleneck.IORedisConnection
(ioredis) object, referred to as the Connection object.
By default, every Group and every standalone limiter (a limiter not created by a Group) will create their own Connection object, but it is possible to manually control this behavior. In this example, every Group and limiter is sharing the same Connection object and therefore the same 2 clients:
const connection = new Bottleneck.RedisConnection({
clientOptions: {/* NodeRedis/ioredis options */}
// ioredis also accepts `clusterNodes` here
});
const limiter = new Bottleneck({ connection: connection });
const group = new Bottleneck.Group({ connection: connection });
You can access and reuse the Connection object of any Group or limiter:
const group = new Bottleneck.Group({ connection: limiter.connection });
When a Connection object is created manually, the connectivity "error"
events are emitted on the Connection itself.
connection.on("error", (err) => { /* handle connectivity errors here */ });
If you already have a NodeRedis/ioredis client, you can ask Bottleneck to reuse it, although currently the Connection object will still create a second client for pubsub operations:
import Redis from "redis";
const client = new Redis.createClient({/* options */});
const connection = new Bottleneck.RedisConnection({
// `clientOptions` and `clusterNodes` will be ignored since we're passing a raw client
client: client
});
const limiter = new Bottleneck({ connection: connection });
const group = new Bottleneck.Group({ connection: connection });
Depending on your application, using more clients can improve performance.
Use the disconnect(flush)
method to close the Redis clients.
limiter.disconnect();
group.disconnect();
If you created the Connection object manually, you need to call connection.disconnect()
instead, for safety reasons.
Debugging complex scheduling logic can be difficult, especially when priorities, weights, and network latency all interact with one another.
If your application is not behaving as expected, start by making sure you're catching "error"
events emitted by your limiters and your Groups. Those errors are most likely uncaught exceptions from your application code.
Make sure you've read the 'Gotchas' section.
To see exactly what a limiter is doing in real time, listen to the "debug"
event. It contains detailed information about how the limiter is executing your code. Adding job IDs to all your jobs makes the debug output more readable.
When Bottleneck has to fail one of your jobs, it does so by using BottleneckError
objects. This lets you tell those errors apart from your own code's errors:
limiter.schedule(fn)
.then((result) => { /* ... */ } )
.catch((error) => {
if (error instanceof Bottleneck.BottleneckError) {
/* ... */
}
});
The internal algorithms essentially haven't changed from v1, but many small changes to the interface were made to introduce new features.
All the breaking changes:
require("bottleneck/es5")
if you need ES5 support in v2. Bottleneck v1 will continue to use ES5 only.Cluster
feature is now called Group
. This is to distinguish it from the new v2 Clustering feature.Group
constructor takes an options object to match the limiter constructor.submitPriority()
, use submit()
with an options object instead.schedulePriority()
, use schedule()
with an options object instead.rejectOnDrop
option is now true
by default. It can be set to false
if you wish to retain v1 behavior. However this option is left undocumented as enabling it is considered to be a poor practice.null
instead of 0
to indicate an unlimited maxConcurrent
value.null
instead of -1
to indicate an unlimited highWater
value.changeSettings()
to updateSettings()
, it now returns a promise to indicate completion. It takes the same options object as the constructor.nbQueued()
to queued()
.nbRunning
to running()
, it now returns its result using a promise.isBlocked()
.changePenalty()
, it is now done through the options object like any other limiter setting.changeReservoir()
, it is now done through the options object like any other limiter setting.stopAll()
. Use the new stop()
method.check()
now accepts an optional weight
argument, and returns its result using a promise.Group
changeTimeout()
method. Instead, pass a timeout
option when creating a Group.Version 2 is more user-friendly and powerful.
After upgrading your code, please take a minute to read the Debugging your application chapter.
This README is always in need of improvements. If wording can be clearer and simpler, please consider forking this repo and submitting a Pull Request, or simply opening an issue.
Suggestions and bug reports are also welcome.
To work on the Bottleneck code, simply clone the repo, makes your changes to the files located in src/
only, then run ./scripts/build.sh && npm test
to ensure that everything is set up correctly.
To speed up compilation time during development, run ./scripts/build.sh dev
instead. Make sure to build and test without dev
before submitting a PR.
The tests must also pass in Clustering mode and using the ES5 bundle. You'll need a Redis server running locally (latency needs to be minimal to run the tests). If the server isn't using the default hostname and port, you can set those in the .env
file. Then run ./scripts/build.sh && npm run test-all
.
All contributions are appreciated and will be considered.
Author: SGrondin
Source Code: https://github.com/SGrondin/bottleneck
License: MIT License
1643303820
Bottleneck is a lightweight and zero-dependency Task Scheduler and Rate Limiter for Node.js and the browser.
Bottleneck is an easy solution as it adds very little complexity to your code. It is battle-hardened, reliable and production-ready and used on a large scale in private companies and open source software.
It supports Clustering: it can rate limit jobs across multiple Node.js instances. It uses Redis and strictly atomic operations to stay reliable in the presence of unreliable clients and networks. It also supports Redis Cluster and Redis Sentinel.
submit()
schedule()
wrap()
updateSettings()
incrementReservoir()
currentReservoir()
stop()
chain()
npm install --save bottleneck
import Bottleneck from "bottleneck";
// Note: To support older browsers and Node <6.0, you must import the ES5 bundle instead.
var Bottleneck = require("bottleneck/es5");
Most APIs have a rate limit. For example, to execute 3 requests per second:
const limiter = new Bottleneck({
minTime: 333
});
If there's a chance some requests might take longer than 333ms and you want to prevent more than 1 request from running at a time, add maxConcurrent: 1
:
const limiter = new Bottleneck({
maxConcurrent: 1,
minTime: 333
});
minTime
and maxConcurrent
are enough for the majority of use cases. They work well together to ensure a smooth rate of requests. If your use case requires executing requests in bursts or every time a quota resets, look into Reservoir Intervals.
Instead of this:
myFunction(arg1, arg2)
.then((result) => {
/* handle result */
});
Do this:
limiter.schedule(() => myFunction(arg1, arg2))
.then((result) => {
/* handle result */
});
Or this:
const wrapped = limiter.wrap(myFunction);
wrapped(arg1, arg2)
.then((result) => {
/* handle result */
});
Instead of this:
const result = await myFunction(arg1, arg2);
Do this:
const result = await limiter.schedule(() => myFunction(arg1, arg2));
Or this:
const wrapped = limiter.wrap(myFunction);
const result = await wrapped(arg1, arg2);
Instead of this:
someAsyncCall(arg1, arg2, callback);
Do this:
limiter.submit(someAsyncCall, arg1, arg2, callback);
Remember...
Bottleneck builds a queue of jobs and executes them as soon as possible. By default, the jobs will be executed in the order they were received.
Read the 'Gotchas' and you're good to go. Or keep reading to learn about all the fine tuning and advanced options available. If your rate limits need to be enforced across a cluster of computers, read the Clustering docs.
Need help debugging your application?
Instead of throttling maybe you want to batch up requests into fewer calls?
schedule()
or wrap()
only returns once all the work it does has completed.Instead of this:
limiter.schedule(() => {
tasksArray.forEach(x => processTask(x));
// BAD, we return before our processTask() functions are finished processing!
});
Do this:
limiter.schedule(() => {
const allTasks = tasksArray.map(x => processTask(x));
// GOOD, we wait until all tasks are done.
return Promise.all(allTasks);
});
bind()
the object:// instead of this:
limiter.schedule(object.doSomething);
// do this:
limiter.schedule(object.doSomething.bind(object));
// or, wrap it in an arrow function instead:
limiter.schedule(() => object.doSomething());
Bottleneck requires Node 6+ to function. However, an ES5 build is included: var Bottleneck = require("bottleneck/es5");
.
Make sure you're catching "error"
events emitted by your limiters!
Consider setting a maxConcurrent
value instead of leaving it null
. This can help your application's performance, especially if you think the limiter's queue might become very long.
If you plan on using priorities
, make sure to set a maxConcurrent
value.
When using submit()
, if a callback isn't necessary, you must pass null
or an empty function instead. It will not work otherwise.
When using submit()
, make sure all the jobs will eventually complete by calling their callback, or set an expiration
. Even if you submitted your job with a null
callback , it still needs to call its callback. This is particularly important if you are using a maxConcurrent
value that isn't null
(unlimited), otherwise those not completed jobs will be clogging up the limiter and no new jobs will be allowed to run. It's safe to call the callback more than once, subsequent calls are ignored.
Using tools like mockdate
in your tests to change time in JavaScript will likely result in undefined behavior from Bottleneck.
const limiter = new Bottleneck({/* options */});
Basic options:
Option | Default | Description |
---|---|---|
maxConcurrent | null (unlimited) | How many jobs can be executing at the same time. Consider setting a value instead of leaving it null , it can help your application's performance, especially if you think the limiter's queue might get very long. |
minTime | 0 ms | How long to wait after launching a job before launching another one. |
highWater | null (unlimited) | How long can the queue be? When the queue length exceeds that value, the selected strategy is executed to shed the load. |
strategy | Bottleneck.strategy.LEAK | Which strategy to use when the queue gets longer than the high water mark. Read about strategies. Strategies are never executed if highWater is null . |
penalty | 15 * minTime , or 5000 when minTime is 0 | The penalty value used by the BLOCK strategy. |
reservoir | null (unlimited) | How many jobs can be executed before the limiter stops executing jobs. If reservoir reaches 0 , no jobs will be executed until it is no longer 0 . New jobs will still be queued up. |
reservoirRefreshInterval | null (disabled) | Every reservoirRefreshInterval milliseconds, the reservoir value will be automatically updated to the value of reservoirRefreshAmount . The reservoirRefreshInterval value should be a multiple of 250 (5000 for Clustering). |
reservoirRefreshAmount | null (disabled) | The value to set reservoir to when reservoirRefreshInterval is in use. |
reservoirIncreaseInterval | null (disabled) | Every reservoirIncreaseInterval milliseconds, the reservoir value will be automatically incremented by reservoirIncreaseAmount . The reservoirIncreaseInterval value should be a multiple of 250 (5000 for Clustering). |
reservoirIncreaseAmount | null (disabled) | The increment applied to reservoir when reservoirIncreaseInterval is in use. |
reservoirIncreaseMaximum | null (disabled) | The maximum value that reservoir can reach when reservoirIncreaseInterval is in use. |
Promise | Promise (built-in) | This lets you override the Promise library used by Bottleneck. |
Reservoir Intervals let you execute requests in bursts, by automatically controlling the limiter's reservoir
value. The reservoir
is simply the number of jobs the limiter is allowed to execute. Once the value reaches 0, it stops starting new jobs.
There are 2 types of Reservoir Intervals: Refresh Intervals and Increase Intervals.
In this example, we throttle to 100 requests every 60 seconds:
const limiter = new Bottleneck({
reservoir: 100, // initial value
reservoirRefreshAmount: 100,
reservoirRefreshInterval: 60 * 1000, // must be divisible by 250
// also use maxConcurrent and/or minTime for safety
maxConcurrent: 1,
minTime: 333 // pick a value that makes sense for your use case
});
reservoir
is a counter decremented every time a job is launched, we set its initial value to 100. Then, every reservoirRefreshInterval
(60000 ms), reservoir
is automatically updated to be equal to the reservoirRefreshAmount
(100).
In this example, we throttle jobs to meet the Shopify API Rate Limits. Users are allowed to send 40 requests initially, then every second grants 2 more requests up to a maximum of 40.
const limiter = new Bottleneck({
reservoir: 40, // initial value
reservoirIncreaseAmount: 2,
reservoirIncreaseInterval: 1000, // must be divisible by 250
reservoirIncreaseMaximum: 40,
// also use maxConcurrent and/or minTime for safety
maxConcurrent: 5,
minTime: 250 // pick a value that makes sense for your use case
});
Reservoir Intervals are an advanced feature, please take the time to read and understand the following warnings.
Reservoir Intervals are not a replacement for minTime
and maxConcurrent
. It's strongly recommended to also use minTime
and/or maxConcurrent
to spread out the load. For example, suppose a lot of jobs are queued up because the reservoir
is 0. Every time the Refresh Interval is triggered, a number of jobs equal to reservoirRefreshAmount
will automatically be launched, all at the same time! To prevent this flooding effect and keep your application running smoothly, use minTime
and maxConcurrent
to stagger the jobs.
The Reservoir Interval starts from the moment the limiter is created. Let's suppose we're using reservoirRefreshAmount: 5
. If you happen to add 10 jobs just 1ms before the refresh is triggered, the first 5 will run immediately, then 1ms later it will refresh the reservoir value and that will make the last 5 also run right away. It will have run 10 jobs in just over 1ms no matter what your reservoir interval was!
Reservoir Intervals prevent a limiter from being garbage collected. Call limiter.disconnect()
to clear the interval and allow the memory to be freed. However, it's not necessary to call .disconnect()
to allow the Node.js process to exit.
Adds a job to the queue. This is the callback version of schedule()
.
limiter.submit(someAsyncCall, arg1, arg2, callback);
You can pass null
instead of an empty function if there is no callback, but someAsyncCall
still needs to call its callback to let the limiter know it has completed its work.
submit()
can also accept advanced options.
Adds a job to the queue. This is the Promise and async/await version of submit()
.
const fn = function(arg1, arg2) {
return httpGet(arg1, arg2); // Here httpGet() returns a promise
};
limiter.schedule(fn, arg1, arg2)
.then((result) => {
/* ... */
});
In other words, schedule()
takes a function fn and a list of arguments. schedule()
returns a promise that will be executed according to the rate limits.
schedule()
can also accept advanced options.
Here's another example:
// suppose that `client.get(url)` returns a promise
const url = "https://wikipedia.org";
limiter.schedule(() => client.get(url))
.then(response => console.log(response.body));
Takes a function that returns a promise. Returns a function identical to the original, but rate limited.
const wrapped = limiter.wrap(fn);
wrapped()
.then(function (result) {
/* ... */
})
.catch(function (error) {
// Bottleneck might need to fail the job even if the original function can never fail.
// For example, your job is taking longer than the `expiration` time you've set.
});
submit()
, schedule()
, and wrap()
all accept advanced options.
// Submit
limiter.submit({/* options */}, someAsyncCall, arg1, arg2, callback);
// Schedule
limiter.schedule({/* options */}, fn, arg1, arg2);
// Wrap
const wrapped = limiter.wrap(fn);
wrapped.withOptions({/* options */}, arg1, arg2);
Option | Default | Description |
---|---|---|
priority | 5 | A priority between 0 and 9 . A job with a priority of 4 will be queued ahead of a job with a priority of 5 . Important: You must set a low maxConcurrent value for priorities to work, otherwise there is nothing to queue because jobs will be be scheduled immediately! |
weight | 1 | Must be an integer equal to or higher than 0 . The weight is what increases the number of running jobs (up to maxConcurrent ) and decreases the reservoir value. |
expiration | null (unlimited) | The number of milliseconds a job is given to complete. Jobs that execute for longer than expiration ms will be failed with a BottleneckError . |
id | <no-id> | You should give an ID to your jobs, it helps with debugging. |
A strategy is a simple algorithm that is executed every time adding a job would cause the number of queued jobs to exceed highWater
. Strategies are never executed if highWater
is null
.
When adding a new job to a limiter, if the queue length reaches highWater
, drop the oldest job with the lowest priority. This is useful when jobs that have been waiting for too long are not important anymore. If all the queued jobs are more important (based on their priority
value) than the one being added, it will not be added.
Same as LEAK
, except it will only drop jobs that are less important than the one being added. If all the queued jobs are as or more important than the new one, it will not be added.
When adding a new job to a limiter, if the queue length reaches highWater
, do not add the new job. This strategy totally ignores priority levels.
When adding a new job to a limiter, if the queue length reaches highWater
, the limiter falls into "blocked mode". All queued jobs are dropped and no new jobs will be accepted until the limiter unblocks. It will unblock after penalty
milliseconds have passed without receiving a new job. penalty
is equal to 15 * minTime
(or 5000
if minTime
is 0
) by default. This strategy is ideal when bruteforce attacks are to be expected. This strategy totally ignores priority levels.
minTime
setting.Note: By default, Bottleneck does not keep track of DONE jobs, to save memory. You can enable this feature by passing trackDoneStatus: true
as an option when creating a limiter.
const counts = limiter.counts();
console.log(counts);
/*
{
RECEIVED: 0,
QUEUED: 0,
RUNNING: 0,
EXECUTING: 0,
DONE: 0
}
*/
Returns an object with the current number of jobs per status in the limiter.
console.log(limiter.jobStatus("some-job-id"));
// Example: QUEUED
Returns the status of the job with the provided job id in the limiter. Returns null
if no job with that id exist.
console.log(limiter.jobs("RUNNING"));
// Example: ['id1', 'id2']
Returns an array of all the job ids with the specified status in the limiter. Not passing a status string returns all the known ids.
const count = limiter.queued(priority);
console.log(count);
priority
is optional. Returns the number of QUEUED
jobs with the given priority
level. Omitting the priority
argument returns the total number of queued jobs in the limiter.
const count = await limiter.clusterQueued();
console.log(count);
Returns the number of QUEUED
jobs in the Cluster.
if (limiter.empty()) {
// do something...
}
Returns a boolean which indicates whether there are any RECEIVED
or QUEUED
jobs in the limiter.
limiter.running()
.then((count) => console.log(count));
Returns a promise that returns the total weight of the RUNNING
and EXECUTING
jobs in the Cluster.
limiter.done()
.then((count) => console.log(count));
Returns a promise that returns the total weight of DONE
jobs in the Cluster. Does not require passing the trackDoneStatus: true
option.
limiter.check()
.then((wouldRunNow) => console.log(wouldRunNow));
Checks if a new job would be executed immediately if it was submitted now. Returns a promise that returns a boolean.
'error'
limiter.on("error", function (error) {
/* handle errors here */
});
The two main causes of error events are: uncaught exceptions in your event handlers, and network errors when Clustering is enabled.
'failed'
limiter.on("failed", function (error, jobInfo) {
// This will be called every time a job fails.
});
'retry'
See Retries to learn how to automatically retry jobs.
limiter.on("retry", function (message, jobInfo) {
// This will be called every time a job is retried.
});
'empty'
limiter.on("empty", function () {
// This will be called when `limiter.empty()` becomes true.
});
'idle'
limiter.on("idle", function () {
// This will be called when `limiter.empty()` is `true` and `limiter.running()` is `0`.
});
'dropped'
limiter.on("dropped", function (dropped) {
// This will be called when a strategy was triggered.
// The dropped request is passed to this event listener.
});
'depleted'
limiter.on("depleted", function (empty) {
// This will be called every time the reservoir drops to 0.
// The `empty` (boolean) argument indicates whether `limiter.empty()` is currently true.
});
'debug'
limiter.on("debug", function (message, data) {
// Useful to figure out what the limiter is doing in real time
// and to help debug your application
});
'received' 'queued' 'scheduled' 'executing' 'done'
limiter.on("queued", function (info) {
// This event is triggered when a job transitions from one Lifecycle stage to another
});
See Jobs Lifecycle for more information.
These Lifecycle events are not triggered for jobs located on another limiter in a Cluster, for performance reasons.
Use removeAllListeners()
with an optional event name as first argument to remove listeners.
Use .once()
instead of .on()
to only receive a single event.
The following example:
const limiter = new Bottleneck();
// Listen to the "failed" event
limiter.on("failed", async (error, jobInfo) => {
const id = jobInfo.options.id;
console.warn(`Job ${id} failed: ${error}`);
if (jobInfo.retryCount === 0) { // Here we only retry once
console.log(`Retrying job ${id} in 25ms!`);
return 25;
}
});
// Listen to the "retry" event
limiter.on("retry", (error, jobInfo) => console.log(`Now retrying ${jobInfo.options.id}`));
const main = async function () {
let executions = 0;
// Schedule one job
const result = await limiter.schedule({ id: 'ABC123' }, async () => {
executions++;
if (executions === 1) {
throw new Error("Boom!");
} else {
return "Success!";
}
});
console.log(`Result: ${result}`);
}
main();
will output
Job ABC123 failed: Error: Boom!
Retrying job ABC123 in 25ms!
Now retrying ABC123
Result: Success!
To re-run your job, simply return an integer from the 'failed'
event handler. The number returned is how many milliseconds to wait before retrying it. Return 0
to retry it immediately.
IMPORTANT: When you ask the limiter to retry a job it will not send it back into the queue. It will stay in the EXECUTING
state until it succeeds or until you stop retrying it. This means that it counts as a concurrent job for maxConcurrent
even while it's just waiting to be retried. The number of milliseconds to wait ignores your minTime
settings.
limiter.updateSettings(options);
The options are the same as the limiter constructor.
Note: Changes don't affect SCHEDULED
jobs.
limiter.incrementReservoir(incrementBy);
Returns a promise that returns the new reservoir value.
limiter.currentReservoir()
.then((reservoir) => console.log(reservoir));
Returns a promise that returns the current reservoir value.
The stop()
method is used to safely shutdown a limiter. It prevents any new jobs from being added to the limiter and waits for all EXECUTING
jobs to complete.
limiter.stop(options)
.then(() => {
console.log("Shutdown completed!")
});
stop()
returns a promise that resolves once all the EXECUTING
jobs have completed and, if desired, once all non-EXECUTING
jobs have been dropped.
Option | Default | Description |
---|---|---|
dropWaitingJobs | true | When true , drop all the RECEIVED , QUEUED and RUNNING jobs. When false , allow those jobs to complete before resolving the Promise returned by this method. |
dropErrorMessage | This limiter has been stopped. | The error message used to drop jobs when dropWaitingJobs is true . |
enqueueErrorMessage | This limiter has been stopped and cannot accept new jobs. | The error message used to reject a job added to the limiter after stop() has been called. |
Tasks that are ready to be executed will be added to that other limiter. Suppose you have 2 types of tasks, A and B. They both have their own limiter with their own settings, but both must also follow a global limiter G:
const limiterA = new Bottleneck( /* some settings */ );
const limiterB = new Bottleneck( /* some different settings */ );
const limiterG = new Bottleneck( /* some global settings */ );
limiterA.chain(limiterG);
limiterB.chain(limiterG);
// Requests added to limiterA must follow the A and G rate limits.
// Requests added to limiterB must follow the B and G rate limits.
// Requests added to limiterG must follow the G rate limits.
To unchain, call limiter.chain(null);
.
The Group
feature of Bottleneck manages many limiters automatically for you. It creates limiters dynamically and transparently.
Let's take a DNS server as an example of how Bottleneck can be used. It's a service that sees a lot of abuse and where incoming DNS requests need to be rate limited. Bottleneck is so tiny, it's acceptable to create one limiter for each origin IP, even if it means creating thousands of limiters. The Group
feature is perfect for this use case. Create one Group and use the origin IP to rate limit each IP independently. Each call with the same key (IP) will be routed to the same underlying limiter. A Group is created like a limiter:
const group = new Bottleneck.Group(options);
The options
object will be used for every limiter created by the Group.
The Group is then used with the .key(str)
method:
// In this example, the key is an IP
group.key("77.66.54.32").schedule(() => {
/* process the request */
});
str
: The key to use. All jobs added with the same key will use the same underlying limiter. Default: ""
The return value of .key(str)
is a limiter. If it doesn't already exist, it is generated for you. Calling key()
is how limiters are created inside a Group.
Limiters that have been idle for longer than 5 minutes are deleted to avoid memory leaks, this value can be changed by passing a different timeout
option, in milliseconds.
group.on("created", (limiter, key) => {
console.log("A new limiter was created for key: " + key)
// Prepare the limiter, for example we'll want to listen to its "error" events!
limiter.on("error", (err) => {
// Handle errors here
})
});
Listening for the "created"
event is the recommended way to set up a new limiter. Your event handler is executed before key()
returns the newly created limiter.
const group = new Bottleneck.Group({ maxConcurrent: 2, minTime: 250 });
group.updateSettings({ minTime: 500 });
After executing the above commands, new limiters will be created with { maxConcurrent: 2, minTime: 500 }
.
str
: The key for the limiter to delete.Manually deletes the limiter at the specified key. When using Clustering, the Redis data is immediately deleted and the other Groups in the Cluster will eventually delete their local key automatically, unless it is still being used.
Returns an array containing all the keys in the Group.
Same as group.keys()
, but returns all keys in this Group ID across the Cluster.
const limiters = group.limiters();
console.log(limiters);
// [ { key: "some key", limiter: <limiter> }, { key: "some other key", limiter: <some other limiter> } ]
Some APIs can accept multiple operations in a single call. Bottleneck's Batching feature helps you take advantage of those APIs:
const batcher = new Bottleneck.Batcher({
maxTime: 1000,
maxSize: 10
});
batcher.on("batch", (batch) => {
console.log(batch); // ["some-data", "some-other-data"]
// Handle batch here
});
batcher.add("some-data");
batcher.add("some-other-data");
batcher.add()
returns a Promise that resolves once the request has been flushed to a "batch"
event.
Option | Default | Description |
---|---|---|
maxTime | null (unlimited) | Maximum acceptable time (in milliseconds) a request can have to wait before being flushed to the "batch" event. |
maxSize | null (unlimited) | Maximum number of requests in a batch. |
Batching doesn't throttle requests, it only groups them up optimally according to your maxTime
and maxSize
settings.
Clustering lets many limiters access the same shared state, stored in Redis. Changes to the state are Atomic, Consistent and Isolated (and fully ACID with the right Durability configuration), to eliminate any chances of race conditions or state corruption. Your settings, such as maxConcurrent
, minTime
, etc., are shared across the whole cluster, which means —for example— that { maxConcurrent: 5 }
guarantees no more than 5 jobs can ever run at a time in the entire cluster of limiters. 100% of Bottleneck's features are supported in Clustering mode. Enabling Clustering is as simple as changing a few settings. It's also a convenient way to store or export state for later use.
Bottleneck will attempt to spread load evenly across limiters.
First, add redis
or ioredis
to your application's dependencies:
# NodeRedis (https://github.com/NodeRedis/node_redis)
npm install --save redis
# or ioredis (https://github.com/luin/ioredis)
npm install --save ioredis
Then create a limiter or a Group:
const limiter = new Bottleneck({
/* Some basic options */
maxConcurrent: 5,
minTime: 500
id: "my-super-app" // All limiters with the same id will be clustered together
/* Clustering options */
datastore: "redis", // or "ioredis"
clearDatastore: false,
clientOptions: {
host: "127.0.0.1",
port: 6379
// Redis client options
// Using NodeRedis? See https://github.com/NodeRedis/node_redis#options-object-properties
// Using ioredis? See https://github.com/luin/ioredis/blob/master/API.md#new-redisport-host-options
}
});
Option | Default | Description |
---|---|---|
datastore | "local" | Where the limiter stores its internal state. The default ("local" ) keeps the state in the limiter itself. Set it to "redis" or "ioredis" to enable Clustering. |
clearDatastore | false | When set to true , on initial startup, the limiter will wipe any existing Bottleneck state data on the Redis db. |
clientOptions | {} | This object is passed directly to the redis client library you've selected. |
clusterNodes | null | ioredis only. When clusterNodes is not null, the client will be instantiated by calling new Redis.Cluster(clusterNodes, clientOptions) instead of new Redis(clientOptions) . |
timeout | null (no TTL) | The Redis TTL in milliseconds (TTL) for the keys created by the limiter. When timeout is set, the limiter's state will be automatically removed from Redis after timeout milliseconds of inactivity. |
Redis | null | Overrides the import/require of the redis/ioredis library. You shouldn't need to set this option unless your application is failing to start due to a failure to require/import the client library. |
Note: When using Groups, the timeout
option has a default of 300000
milliseconds and the generated limiters automatically receive an id
with the pattern ${group.id}-${KEY}
.
Note: If you are seeing a runtime error due to the require()
function not being able to load redis
/ioredis
, then directly pass the module as the Redis
option. Example:
import Redis from "ioredis"
const limiter = new Bottleneck({
id: "my-super-app",
datastore: "ioredis",
clientOptions: { host: '12.34.56.78', port: 6379 },
Redis
});
Unfortunately, this is a side effect of having to disable inlining, which is necessary to make Bottleneck easy to use in the browser.
The first limiter connecting to Redis will store its constructor options on Redis and all subsequent limiters will be using those settings. You can alter the constructor options used by all the connected limiters by calling updateSettings()
. The clearDatastore
option instructs a new limiter to wipe any previous Bottleneck data (for that id
), including previously stored settings.
Queued jobs are NOT stored on Redis. They are local to each limiter. Exiting the Node.js process will lose those jobs. This is because Bottleneck has no way to propagate the JS code to run a job across a different Node.js process than the one it originated on. Bottleneck doesn't keep track of the queue contents of the limiters on a cluster for performance and reliability reasons. You can use something like BeeQueue
in addition to Bottleneck to get around this limitation.
Due to the above, functionality relying on the queue length happens purely locally:
highWater
and load shedding (strategies) are per limiter. However, one limiter entering Blocked mode will put the entire cluster in Blocked mode until penalty
milliseconds have passed. See Strategies."empty"
event is triggered when the (local) queue is empty."idle"
event is triggered when the (local) queue is empty and no jobs are currently running anywhere in the cluster.You must work around these limitations in your application code if they are an issue to you. The publish()
method could be useful here.
The current design guarantees reliability, is highly performant and lets limiters come and go. Your application can scale up or down, and clients can be disconnected at any time without issues.
It is strongly recommended that you give an id
to every limiter and Group since it is used to build the name of your limiter's Redis keys! Limiters with the same id
inside the same Redis db will be sharing the same datastore.
It is strongly recommended that you set an expiration
(See Job Options) on every job, since that lets the cluster recover from crashed or disconnected clients. Otherwise, a client crashing while executing a job would not be able to tell the cluster to decrease its number of "running" jobs. By using expirations, those lost jobs are automatically cleared after the specified time has passed. Using expirations is essential to keeping a cluster reliable in the face of unpredictable application bugs, network hiccups, and so on.
Network latency between Node.js and Redis is not taken into account when calculating timings (such as minTime
). To minimize the impact of latency, Bottleneck only performs a single Redis call per lifecycle transition. Keeping the Redis server close to your limiters will help you get a more consistent experience. Keeping the system time consistent across all clients will also help.
It is strongly recommended to set up an "error"
listener on all your limiters and on your Groups.
The ready()
, publish()
and clients()
methods also exist when using the local
datastore, for code compatibility reasons: code written for redis
/ioredis
won't break with local
.
This method returns a promise that resolves once the limiter is connected to Redis.
As of v2.9.0, it's no longer necessary to wait for .ready()
to resolve before issuing commands to a limiter. The commands will be queued until the limiter successfully connects. Make sure to listen to the "error"
event to handle connection errors.
const limiter = new Bottleneck({/* options */});
limiter.on("error", (err) => {
// handle network errors
});
limiter.ready()
.then(() => {
// The limiter is ready
});
This method broadcasts the message
string to every limiter in the Cluster. It returns a promise.
const limiter = new Bottleneck({/* options */});
limiter.on("message", (msg) => {
console.log(msg); // prints "this is a string"
});
limiter.publish("this is a string");
To send objects, stringify them first:
limiter.on("message", (msg) => {
console.log(JSON.parse(msg).hello) // prints "world"
});
limiter.publish(JSON.stringify({ hello: "world" }));
If you need direct access to the redis clients, use .clients()
:
console.log(limiter.clients());
// { client: <Redis Client>, subscriber: <Redis Client> }
ioredis
datastore and the clusterNodes
option.ioredis
datastore.b_
. It also uses pubsub channels starting with b_
It will not interfere with any other data stored on the server.SCRIPT LOAD
command. These scripts only take up a few Kb of memory. Running the SCRIPT FLUSH
command will cause any connected limiters to experience critical errors until a new limiter connects to Redis and loads the scripts again.Bottleneck needs to create 2 Redis Clients to function, one for normal operations and one for pubsub subscriptions. These 2 clients are kept in a Bottleneck.RedisConnection
(NodeRedis) or a Bottleneck.IORedisConnection
(ioredis) object, referred to as the Connection object.
By default, every Group and every standalone limiter (a limiter not created by a Group) will create their own Connection object, but it is possible to manually control this behavior. In this example, every Group and limiter is sharing the same Connection object and therefore the same 2 clients:
const connection = new Bottleneck.RedisConnection({
clientOptions: {/* NodeRedis/ioredis options */}
// ioredis also accepts `clusterNodes` here
});
const limiter = new Bottleneck({ connection: connection });
const group = new Bottleneck.Group({ connection: connection });
You can access and reuse the Connection object of any Group or limiter:
const group = new Bottleneck.Group({ connection: limiter.connection });
When a Connection object is created manually, the connectivity "error"
events are emitted on the Connection itself.
connection.on("error", (err) => { /* handle connectivity errors here */ });
If you already have a NodeRedis/ioredis client, you can ask Bottleneck to reuse it, although currently the Connection object will still create a second client for pubsub operations:
import Redis from "redis";
const client = new Redis.createClient({/* options */});
const connection = new Bottleneck.RedisConnection({
// `clientOptions` and `clusterNodes` will be ignored since we're passing a raw client
client: client
});
const limiter = new Bottleneck({ connection: connection });
const group = new Bottleneck.Group({ connection: connection });
Depending on your application, using more clients can improve performance.
Use the disconnect(flush)
method to close the Redis clients.
limiter.disconnect();
group.disconnect();
If you created the Connection object manually, you need to call connection.disconnect()
instead, for safety reasons.
Debugging complex scheduling logic can be difficult, especially when priorities, weights, and network latency all interact with one another.
If your application is not behaving as expected, start by making sure you're catching "error"
events emitted by your limiters and your Groups. Those errors are most likely uncaught exceptions from your application code.
Make sure you've read the 'Gotchas' section.
To see exactly what a limiter is doing in real time, listen to the "debug"
event. It contains detailed information about how the limiter is executing your code. Adding job IDs to all your jobs makes the debug output more readable.
When Bottleneck has to fail one of your jobs, it does so by using BottleneckError
objects. This lets you tell those errors apart from your own code's errors:
limiter.schedule(fn)
.then((result) => { /* ... */ } )
.catch((error) => {
if (error instanceof Bottleneck.BottleneckError) {
/* ... */
}
});
The internal algorithms essentially haven't changed from v1, but many small changes to the interface were made to introduce new features.
All the breaking changes:
require("bottleneck/es5")
if you need ES5 support in v2. Bottleneck v1 will continue to use ES5 only.Cluster
feature is now called Group
. This is to distinguish it from the new v2 Clustering feature.Group
constructor takes an options object to match the limiter constructor.submitPriority()
, use submit()
with an options object instead.schedulePriority()
, use schedule()
with an options object instead.rejectOnDrop
option is now true
by default. It can be set to false
if you wish to retain v1 behavior. However this option is left undocumented as enabling it is considered to be a poor practice.null
instead of 0
to indicate an unlimited maxConcurrent
value.null
instead of -1
to indicate an unlimited highWater
value.changeSettings()
to updateSettings()
, it now returns a promise to indicate completion. It takes the same options object as the constructor.nbQueued()
to queued()
.nbRunning
to running()
, it now returns its result using a promise.isBlocked()
.changePenalty()
, it is now done through the options object like any other limiter setting.changeReservoir()
, it is now done through the options object like any other limiter setting.stopAll()
. Use the new stop()
method.check()
now accepts an optional weight
argument, and returns its result using a promise.Group
changeTimeout()
method. Instead, pass a timeout
option when creating a Group.Version 2 is more user-friendly and powerful.
After upgrading your code, please take a minute to read the Debugging your application chapter.
This README is always in need of improvements. If wording can be clearer and simpler, please consider forking this repo and submitting a Pull Request, or simply opening an issue.
Suggestions and bug reports are also welcome.
To work on the Bottleneck code, simply clone the repo, makes your changes to the files located in src/
only, then run ./scripts/build.sh && npm test
to ensure that everything is set up correctly.
To speed up compilation time during development, run ./scripts/build.sh dev
instead. Make sure to build and test without dev
before submitting a PR.
The tests must also pass in Clustering mode and using the ES5 bundle. You'll need a Redis server running locally (latency needs to be minimal to run the tests). If the server isn't using the default hostname and port, you can set those in the .env
file. Then run ./scripts/build.sh && npm run test-all
.
All contributions are appreciated and will be considered.
Author: SGrondin
Source Code: https://github.com/SGrondin/bottleneck
License: MIT License
1641506040
dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data.
dedupe will help you:
dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.
If you or your organization would like professional assistance in working with the dedupe library, Dedupe.io LLC offers consulting services. Read more about pricing and available services here.
A cloud service powered by the dedupe library for de-duplicating and finding matches in your data. It provides a step-by-step wizard for uploading your data, setting up a model, training, clustering and reviewing the results.
Dedupe.io also supports record linkage across data sources and continuous matching and training through an API.
For more, see the Dedupe.io product site, tutorials on how to use it, and differences between it and the dedupe library.
Dedupe is well adopted by the Python community. Check out this blogpost, a YouTube video on how to use Dedupe with Python and a Youtube video on how to apply Dedupe at scale using Spark.
Command line tool for de-duplicating and linking CSV files. Read about it on Source Knight-Mozilla OpenNews.
If you only want to use dedupe, install it this way:
pip install dedupe
Familiarize yourself with dedupe's API, and get started on your project. Need inspiration? Have a look at some examples.
We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.
Once you have virtualenvwrapper set up,
mkvirtualenv dedupe
git clone git://github.com/dedupeio/dedupe.git
cd dedupe
pip install "numpy>=1.9"
pip install -r requirements.txt
cython src/*.pyx
pip install -e .
If these tests pass, then everything should have been installed correctly!
pytest
Afterwards, whenever you want to work on dedupe,
workon dedupe
Unit tests of core dedupe functions
pytest
Using Deduplication
python tests/canonical.py
Using Record Linkage
python tests/canonical_matching.py
Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.
If something is not behaving intuitively, it is a bug, and should be reported. Report it here
Copyright (c) 2019 Forest Gregg and Derek Eder. Released under the MIT License.
Third-party copyright in this distribution is noted where applicable.
If you use Dedupe in an academic work, please give this citation:
Forest Gregg and Derek Eder. 2019. Dedupe. https://github.com/dedupeio/dedupe.
Author: Dedupeio
Source Code: https://github.com/dedupeio/dedupe
License: MIT License
1634759280
Amazon Elastic Container Service is a managed container orchestration service which allows you to deploy and scale containerized applications. An overview of the features and pricing can be found at the AWS website.
ECS consists out of a few components:
Here you will specify the Docker image to be used, memory, CPU, etc. for your container. You will create a Docker image for a basic Spring Boot Application, upload it to ECR, create a Task Definition for the image, create a Cluster and deploy the container by means of a Service to the Cluster.
1627911476
This Edureka video on "Clustering Algorithms" will help you understand the various aspects of clustering using K Means in Python.
#clustering #algorithms #datascience #python #kmeans
1624333080
K-means is one of the simplest unsupervised machine learning algorithms that solve the well-known data clustering problem. Clustering is one of the most common data analysis tasks used to get an intuition about data structure. It is defined as finding the subgroups in the data such that each data points in different clusters are very different. We are trying to find the homogeneous subgroups within the data. Each group’s data points are similarly based on similarity metrics like a Euclidean-based distance or correlation-based distance.
The algorithm can do clustering analysis based on features or samples. We try to find the subcategory of sampling based on attributes or try to find the subcategory of parts based on samples. The practical applications of such a procedure are many: the best use of clustering in amazon and Netflix recommended system, given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some structure of data that might indicate that the data is separable.
K-means the clustering algorithm whose primary goal is to group similar elements or data points into a cluster.
K in k-means represents the number of clusters.
A cluster refers to a collection of data points aggregated together because of certain similarities.
K-means clustering is an iterative algorithm that starts with k random numbers used as mean values to define clusters. Data points belong to the group represented by the mean value to which they are closest. This mean value co-ordinates called the centroid.
Iteratively, the mean value of each cluster’s data points is computed, and the new mean values are used to restart the process till the mean stops changing. The disadvantage of k-means is that it a local search procedure and could miss global patterns.
The k initial centroids can be randomly selected. Another approach of determining k is to compute the entire dataset’s mean and add _k _random co-ordinates to it to make k initial points. Another method is to determine the principal component of the data and divide it into _k _equal partitions. The mean of each section can be used as initial centroids.
#data-science #algorithms #clustering #k-means #machine-learning
1623087480
The k-means clustering algorithm is a foundational algorithm that every data scientist should know. It is popular because it is simple, fast, and efficient. It works by dividing all the points into a preselected number (k) of clusters based on the distance between the point and the center of each cluster. The original k-means algorithm is limited because it works only in the Euclidean space and results in suboptimal cluster assignments when the real clusters are unequal in size. Despite its shortcomings, k-means remains one of the most powerful tools for clustering and has been used in healthcare, natural language processing, and physical sciences.
Extensions of the k-means algorithms include smarter starting positions for its k centers, allowing variable cluster sizes, and including more distances than Euclidean distance. In this article, we will focus on methods like PAM, CLARA, and CLARANS, which incorporate distance measures beyond the Euclidean distance. These methods are yet to enjoy the fame of k-means because they are slower than k-means for large datasets without a comparable gain in optimality. However, as we will see in this article, researchers have developed newer versions of these algorithms that promise to provide better accuracy and speeds than k-means.
For anyone who needs a quick reminder, StatQuest has a great video on k-means clustering.
For this article, we will focus on where k-means fails. Vanilla k-means, as explained in the video, has several disadvantages:
The above figure shows an example of k-means clustering of the mouse data set using k-means, where k-means performs poorly due to varying cluster sizes.
Instead of using the mean of the cluster to partition, the medoid, or the most centrally located data point in the cluster can be used to partition the data points; The medoid is the least dissimilar point to all points in the cluster. The medoid is also less sensitive to outliers in the data. These partitions can also use arbitrary distances instead of relying on the Euclidean distance. This is the crux of the clustering algorithm named Partition Around Medoids (PAM), and its extensions CLARA and CLARANS. Watch this video for a succinct explanation of the method.
In short, the following are the steps involved in the PAM method (reference):
The time complexity of the PAM algorithm is in the order of O(k(n - k)2), which makes it much slower than the k-means algorithm. Kaufman and Rousseeuw (1990) proposed an improvement that traded optimality for speed, named CLARA (Clustering For Large Applications). In CLARA, the main dataset is split into several smaller, randomly sampled subsets of the data. The PAM algorithm is applied to each subset to obtain the medoids for each set, and the set of medoids that give the best performance on the main dataset are kept. Dudoit and Fridlyand (2003) improve the CLARA workflow by combining the medoids from different samples by voting or bagging, which aims to reduce the variability that would come from applying CLARA.
Another variation named CLARANS (Clustering Large Applications based upon RANdomized Search) (Ng and Han 2002) works by combining sampling and searching on a graph. In this graph, each node represents a set of k medoids. Each node is connected to another node if the set of k medoids in each node differs by one. The graph can be traversed until a local minimum is reached, and that minimum provides the best estimate for the medoids of the dataset.
Schubert and Rousseeuw (2019) proposed a faster version of PAM, which can be extended to CLARA, by changing how the algorithm caches the distance values. They summarize it well here:
“This caching was enabled by changing the nesting order of the loops in the algorithm, showing once more how much seemingly minor-looking implementation details can matter (Kriegel et al., 2017). As a second improvement, we propose to find the best swap for each medoid and execute as many as possible in each iteration, which reduces the number of iterations needed for convergence without loss of quality, as demonstrated in the experiments, and as supported by theoretical considerations. In this article, we proposed a modification of the popular PAM algorithm that typically yields an O(k) fold speedup, by clever caching of partial results in order to avoid recomputation.”
In another variation, Yue et al. (2016) proposed a MapReduce framework for speeding up the calculations of the k-medoids algorithm and named it the K-Medoids++ algorithm.
More recently, Tiwari et al. (2020) cast the problem of choosing k medoids into a multi-arm bandit problem and solved it using the Upper Confidence Bound algorithm. This variation was faster than PAM and matched its accuracy.
#2020 dec tutorials #overviews #algorithms #clustering #explained
1622135220
Machine Learning is one of the hottest technologies in 2020, as the data is increasing day by day the need of Machine Learning is also increasing exponentially. Machine Learning is a very vast topic that has different algorithms and use cases in each domain and Industry. One of which is Unsupervised Learning in which we can see the use of Clustering.
Unsupervised learning is a technique in which the machine learns from unlabeled data. As we do not know the labels there is no right answer given for the machine to learn from it, but the machine itself finds some patterns out of the given data to come up with the answers to the business problem.
Clustering is a Machine Learning Unsupervised Learning technique that involves the grouping of given unlabeled data. In each cleaned data set, by using Clustering Algorithm we can cluster the given data points into each group. The clustering Algorithm assumes that the data points that are in the same cluster should have similar properties, while data points in different clusters should have highly dissimilar properties.
In this article, we are going to learn the need of clustering, different types of clustering along with their pros and cons.
Clustering is a widely used ML Algorithm which allows us to find hidden relationships between the data points in our dataset.
Examples:
1) Customers are segmented according to similarities of the previous customers and can be used for recommendations.
2) Based on a collection of text data, we can organize the data according to the content similarities in order to create a topic hierarchy.
3) Image processing mainly in biology research for identifying the underlying patterns.
4) Spam filtering.
5) Identifying Fraudulent and Criminal activities.
6) It can also be used for fantasy football and sports.
There are many types of Clustering Algorithms in Machine learning. We are going to discuss the below three algorithms in this article:
1) K-Means Clustering.
2) Mean-Shift Clustering.
3) DBSCAN.
K-Means is the most popular clustering algorithm among the other clustering algorithms in Machine Learning. We can see this algorithm used in many top industries or even in a lot of introduction courses. It is one of the easiest models to start with both in implementation and understanding.
Step-1 We first select a random number of k to use and randomly initialize their respective center points.
Step-2 Each data point is then classified by calculating the distance (Euclidean or Manhattan) between that point and each group center, and then clustering the data point to be in the cluster whose center is closest to it.
Step-3 We recompute the group center by taking the mean of all the vectors in the group.
Step-4 We repeat all these steps for a n number of iterations or until the group centers don’t change much.
#artificial intelligence #clustering