Royce  Reinger

Royce Reinger

1676790120

A Scikit-learn Based Module for Multi-label Et. Al. Classification

Scikit-multilearn

scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Python packages (numpy, scipy) and follows a similar API to that of scikit-learn.

Features

Native Python implementation. A native Python implementation for a variety of multi-label classification algorithms. To see the list of all supported classifiers, check this link.

Interface to Meka. A Meka wrapper class is implemented for reference purposes and integration. This provides access to all methods available in MEKA, MULAN, and WEKA — the reference standard in the field.

Builds upon giants! Team-up with the power of numpy and scikit. You can use scikit-learn's base classifiers as scikit-multilearn's classifiers. In addition, the two packages follow a similar API.

Dependencies

In most cases you will want to follow the requirements defined in the requirements/*.txt files in the package.

Base dependencies

scipy
numpy
future
scikit-learn
liac-arff # for loading ARFF files
requests # for dataset module
networkx # for networkX base community detection clusterers
python-louvain # for networkX base community detection clusterers
keras

GPL-incurring dependencies for two clusterers

python-igraph # for igraph library based clusterers
python-graphtool # for graphtool base clusterers

Note: Installing graphtool is complicated, please see: graphtool install instructions

Installation

To install scikit-multilearn, simply type the following command:

$ pip install scikit-multilearn

This will install the latest release from the Python package index. If you wish to install the bleeding-edge version, then clone this repository and run setup.py:

$ git clone https://github.com/scikit-multilearn/scikit-multilearn.git
$ cd scikit-multilearn
$ python setup.py

Basic Usage

Before proceeding to classification, this library assumes that you have a dataset with the following matrices:

  • x_train, x_test: training and test feature matrices of size (n_samples, n_features)
  • y_train, y_test: training and test label matrices of size (n_samples, n_labels)

Suppose we wanted to use a problem-transformation method called Binary Relevance, which treats each label as a separate single-label classification problem, to a Support-vector machine (SVM) classifier, we simply perform the following tasks:

# Import BinaryRelevance from skmultilearn
from skmultilearn.problem_transform import BinaryRelevance

# Import SVC classifier from sklearn
from sklearn.svm import SVC

# Setup the classifier
classifier = BinaryRelevance(classifier=SVC(), require_dense=[False,True])

# Train
classifier.fit(X_train, y_train)

# Predict
y_pred = classifier.predict(X_test)

More examples and use-cases can be seen in the documentation. For using the MEKA wrapper, check this link.

Contributing

This project is open for contributions. Here are some of the ways for you to contribute:

  • Bug reports/fix
  • Features requests
  • Use-case demonstrations
  • Documentation updates

In case you want to implement your own multi-label classifier, please read our Developer's Guide to help you integrate your implementation in our API.

To make a contribution, just fork this repository, push the changes in your fork, open up an issue, and make a Pull Request!

We're also available in Slack! Just go to our slack group.

Cite

If you used scikit-multilearn in your research or project, please cite our work:

@ARTICLE{2017arXiv170201460S,
   author = {{Szyma{\'n}ski}, P. and {Kajdanowicz}, T.},
   title = "{A scikit-based Python environment for performing multi-label classification}",
   journal = {ArXiv e-prints},
   archivePrefix = "arXiv",
   eprint = {1702.01460},
   year = 2017,
   month = feb
}


Download Details:

Author: Scikit-multilearn
Source Code: https://github.com/scikit-multilearn/scikit-multilearn 
License: BSD-2-Clause license

#machinelearning #python #clustering #scikitlearn #classification 

A Scikit-learn Based Module for Multi-label Et. Al. Classification

How to Identifying The Unknown with Clustering Metrics

Identifying the Unknown With Clustering Metrics

Clustering in machine learning has a variety of applications, but how do you know which algorithm is best suited to your data? Here’s how to amplify your data insights with comparison metrics, including the F-measure.

Clustering is an unsupervised machine learning method to divide given data into groups based solely on the features of each sample. Sorting data into clusters can help identify unknown similarities between samples or reveal outliers in the data set. In the real world, clustering has significance across diverse fields from marketing to biology: Clustering applications include market segmentation, social network analysis, and diagnostic medical imaging.

Because this process is unsupervised, multiple clustering results can form around different features. For example, imagine you have a data set composed of various images of red trousers, black trousers, red shirts, and black shirts. One algorithm might find clusters based on clothing shape, while another might create groups based on color.

When analyzing a data set, we need a way to accurately measure the performance of different clustering algorithms; we may want to contrast the solutions of two algorithms, or see how close a clustering result is to an expected solution. In this article, we will explore some of the metrics that can be used for comparing different clustering results obtained from the same data.

Understanding Clustering: A Brief Example

Let’s define an example data set that we will use to explain various clustering metric concepts and examine what kinds of clusters it might produce.

First, a few common notations and terms:

  • DD: the data set
  • AA, BB: two clusters that are subsets of our data set
  • CC: the ground truth clustering of DD that we will compare another cluster to
    • Clustering CC has KK clusters, C=C1,…,CkC=C1,…,Ck
  • C′C′: a second clustering of DD
    • Clustering C′C′ has K′K′ clusters, C′=C′1,…,C′k′C′=C1′,…,Ck′′

Clustering results can vary based not only on sorting features but also the total number of clusters. The result depends on the algorithm, its sensitivity to small perturbations, the model’s parameters, and the data’s features. Using our previously mentioned data set of black and red trousers and shirts, there are a variety of clustering results that might be produced from different algorithms.

To distinguish between general clustering CC and our example clusterings, we will use a lowercase cc to describe our example clusterings:

  • cc, with clusters based on shape: c=c1,c2c=c1,c2, where c1c1 represents trousers and c2c2 represents shirts
  • c′c′, with clusters based on color: c′=c′1,c′2c′=c1′,c2′, where c′1c1′ represents red clothes and c′2c2′ represents black clothes
  • c′′c″, with clusters based on shape and color: c′′=c′′1,c′′2,c′′3,c′′4c″=c′′1,c′′2,c′′3,c′′4, where c′′1c′′1 represents red trousers, c′′2c′′2 represents black trousers, c′′3c′′3 represents red shirts, and c′′4c′′4 represents black shirts

Additional clusterings might include more than four clusters based on different features, such as whether a shirt is sleeveless or sleeved.

As seen in our example, a clustering method divides all the samples in a data set into non-empty disjoint subsets. In cluster cc, there is no image that belongs to both the trouser subset and the shirt subset: c1∩c2=∅c1∩c2=∅. This concept can be extended; no two subsets of any cluster have the same sample.

An Overview of Clustering Comparison Metrics

Most criteria for comparing clusterings can be described using the confusion matrix of the pair C,C′C,C′. The confusion matrix would be a K×K′K×K′ matrix whose kk′kk′th element (the element in the kkth row and k′k′th column) is the number of samples in the intersection of clusters CkCk of CC and C′k′Ck′′ of C′C′:

nkk′=|Ck∩C′k′|nkk′=|Ck∩Ck′′|

We’ll break this down using our simplified black and red trousers and shirts example, assuming that data set DD has 100 red trousers, 200 black trousers, 200 red shirts, and 300 black shirts. Let’s examine the confusion matrix of cc and c′′c″:

 

Two copies of the same matrix with two rows and four columns: "100, 200, 0, 0" on the top row, and "0, 0, 200, 300" on the bottom row. The second copy has row and column labels with dotted-line borders. Its top row is labeled "c1" with a light blue border, and the bottom row is labeled "c2" with a dark blue border. Its columns, from left to right: "c''1" (light green border), "c''2" (medium green border), "c''3" (dark green border), and "c''4" (gray border). On the second copy, an arrow points to the 200 that is an element in the second row and third column. At the base of that arrow is: "nkk' = the absolute value of Ck and C'k': n23 = the absolute value of c2 and c''3 = 200."

 

Since K=2K=2 and K′′=4K″=4, this is a 2×42×4 matrix. Let’s choose k=2k=2 and k′′=3k″=3. We see that element nkk′=n23=200nkk′=n23=200. This means that the intersection of c2c2 (shirts) and c′′3c′′3 (red shirts) is 200, which is correct since c2∩c′′3c2∩c′′3 would simply be the set of red shirts.

Clustering metrics can be broadly categorized into three groups based on the underlying cluster comparison method:

 

A dark blue "Clustering metrics" box points to a green "Based on?" capsule, which points to three light blue boxes. The first, "Pair counting," has "Rand index" and "Adjusted Rand index" underneath it. The second, "Information theory," has "Normalized mutual information" and "Variation of information" underneath it. The last, "Set overlap," has "Maximum matching measure" and "F-measure" underneath it.

 

In this article, we only touch on a few of many metrics available, but our examples will serve to help define the three clustering metric groups.

Pair-counting

Pair-counting requires examining all pairs of samples, then counting pairs where the clusterings agree and disagree. Each pair of samples can belong to one of four sets, where the set element counts (NijNij) are obtained from the confusion matrix:

  • S11S11, with N11N11 elements: the pair’s elements are in the same cluster under both CC and C′C′
    • A pair of two red shirts would fall under S11S11 when comparing cc and c′′c″
  • S00S00, with N00N00 elements: the pair’s elements are in different clusters under both CC and C′C′
    • A pair of a red shirt and black trousers would fall under S00S00 when comparing cc and c′′c″
  • S10S10, with N10N10 elements: the pair’s elements are in the same cluster in CC and different clusters in C′C′
    • A pair of a red shirt and a black shirt would fall under S10S10 when comparing cc and c′′c″
  • S01S01, with N01N01 elements: the pair’s elements are in different clusters in CC and the same cluster in C′C′
    • S01S01 has no elements (N01=0N01=0) when comparing cc and c′′c″

The Rand index is defined as (N00+N11)/(n(n−1)/2)(N00+N11)/(n(n−1)/2), where nn represents the number of samples; it can also be read as (number of similarly treated pairs)/(total number of pairs). Although theoretically its value ranges between 0 and 1, its range is often much narrower in practice. A higher value means more similarity between the clusterings. (A Rand index of 1 would represent a perfect match where two clusterings have identical clusters.)

One limitation of the Rand index is its behavior when the number of clusters increases to approach the number of elements; in this case, it converges toward 1, creating challenges in accurately measuring clustering similarity. Several improved or modified versions of the Rand index have been introduced to address this issue. One variation is the adjusted Rand index; however, it assumes that two clusterings are drawn randomly with a fixed number of clusters and cluster elements.

Information Theory

These metrics are based on generic notions of information theory. We will discuss two of them: entropy and mutual information (MI).

Entropy describes how much information there is in a clustering. If the entropy associated with a clustering is 0, then there is no uncertainty about the cluster of a randomly picked sample, which is true when there is only one cluster.

MI describes how much information one clustering gives about the other. MI can indicate how much knowing the cluster of a sample in CC reduces the uncertainty about the cluster of the sample in C′C′.

Normalized mutual information is MI that is normalized by the geometric or arithmetic mean of the entropies of clusterings. Standard MI is not bound by a constant value, so normalized mutual information provides a more interpretable clustering metric.

Another popular metric in this category is variation of information (VI) that depends on both the entropy and MI of clusterings. Let H(C)H(C) be the entropy of a clustering and I(C,C′)I(C,C′) be the MI between two clusterings. VI between two clusterings can be defined as VI(C,C′)=H(C)+H(C′)−2I(C,C′)VI(C,C′)=H(C)+H(C′)−2I(C,C′). A VI of 0 represents a perfect match between two clusterings.

Set Overlap

Set overlap metrics involve determining the best match for clusters in CC with clusters in C′C′ based on maximum overlap between the clusters. For all metrics in this category, a 1 means the clusterings are identical.

The maximum matching measure scans the confusion matrix in decreasing order and matches the largest entry of the confusion matrix first. It then removes the matched clusters and repeats the process sequentially until the clusters are exhausted.

The F-measure is another set overlap metric. Unlike the maximum matching measure, the F-measure is frequently used to compare a clustering to an optimal solution, instead of comparing two clusterings.

Applying Clustering Metrics With F-measure

Because of the F-measure’s common use in machine learning models and important applications such as search engines, we’ll explore the F-measure in more detail with an example.

F-measure Definition

Let’s assume that CC is our ground truth, or optimal, solution. For any kkth cluster in CC, where k∈[1,K]k∈[1,K], we’ll calculate an individual F-measure with every cluster in clustering result C′C′. This individual F-measure indicates how well the cluster C′k′Ck′′ describes the cluster CkCk and can be determined through the precision and recall (two model evaluation metrics) for these clusters. Let’s define Ikk′Ikk′ as the intersection of elements in CC’s kkth cluster and C′C′’s k′k′th cluster, and |Ck||Ck| as the number of elements in the kkth cluster.

Precision p=Ikk′|C′k′|p=Ikk′|Ck′′|

Recall r=Ikk′|Ck|r=Ikk′|Ck|

Then, the individual F-measure of the kkth and k′k′th cluster can be calculated as the harmonic mean of the precision and recall for these clusters:

Fkk′=2rpr+p=2Ikk′|Ck|+|C′k′|Fkk′=2rpr+p=2Ikk′|Ck|+|Ck′′|

Now, to compare CC and C′C′, let’s look at the overall F-measure. First, we will create a matrix similar to a contingency table whose values are the individual F-measures of the clusters. Let’s assume that we’ve mapped CC’s clusters as rows of a table and C′C′’s clusters as columns, with table values corresponding to individual F-measures. Identify the cluster pair with the maximum individual F-measure, and remove the row and column corresponding to these clusters. Repeat this until the clusters are exhausted. Finally, we can define the overall F-measure:

F(C,C′)=1nK∑i=1nimax(F(Ci,C′j))∀j∈1,K′F(C,C′)=1n∑i=1Knimax(F(Ci,Cj′))∀j∈1,K′

As you can see, the overall F-measure is the weighted sum of our maximum individual F-measures for the clusters.

Data Setup and Expected Results

Any Python notebook suitable for machine learning, such as a Jupyter notebook, will work as our environment. Before we start, you may want to examine my GitHub repository’s README, extended readme_help_example.ipynb example file, and requirements.txt file (the required libraries).

We’ll use the sample data in the GitHub repository, which is made up of news articles. The data is arranged with information including category, headline, date, and short_description:

 categoryheadlinedateshort_description
49999THE WORLDPOSTDrug War Deaths Climb To 1,800 In Philippines2016-08-22In the last seven weeks alone.
49966TASTEYes, You Can Make Real Cuban-Style Coffee At Home2016-08-22It’s all about the crema.
49965STYLEKFC’s Fried Chicken-Scented Sunscreen Will Kee…2016-08-22For when you want to make yourself smell finge…
49964POLITICSHUFFPOLLSTER: Democrats Have A Solid Chance Of…2016-08-22HuffPost’s poll-based model indicates Senate R…

We can use pandas to read, analyze, and manipulate the data. We’ll sort the data by date and select a small sample (10,000 news headlines) for our demo since the full data set is large:

import pandas as pd
df = pd.read_json("./sample_data/example_news_data.json", lines=True)
df.sort_values(by='date', inplace=True)
df = df[:10000]
len(df['category'].unique())

Upon running, you should see the notebook output the result 30, since there are 30 categories in this data sample. You may also run df.head(4) to see how the data is stored. (It should match the table displayed in this section.)

Optimizing Clustering Features

Before applying the clustering, we should first preprocess the text to reduce redundant features of our model, including:

  • Updating the text to have a uniform case.
  • Removing numeric or special characters.
  • Performing lemmatization.
  • Removing stop words.
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()
nltk.download('stopwords')
stop_words = stopwords.words('english')
nltk.download('wordnet')
nltk.download('omw-1.4')

def preprocess(text: str) -> str:
    text = text.lower()
    text = re.sub('[^a-z]',' ',text)
    text = re.sub('\s+', ' ', text)
    text = text.split(" ")
    words = [wordnet_lemmatizer.lemmatize(word, 'v') for word in text if word not in stop_words]    
    return " ".join(words)

df['processed_input'] = df['headline'].apply(preprocess)

The resulting preprocessed headlines are shown as processed_input, which you can observe by again running df.head(4):

 categoryheadlinedateshort_descriptionprocessed_input
49999THE WORLDPOSTDrug War Deaths Climb To 1,800 In Philippines2016-08-22In the last seven weeks alone.drug war deaths climb philippines
49966TASTEYes, You Can Make Real Cuban-Style Coffee At Home2016-08-22It’s all about the crema.yes make real cuban style coffee home
49965STYLEKFC’s Fried Chicken-Scented Sunscreen Will Kee…2016-08-22For when you want to make yourself smell finge…kfc fry chicken scent sunscreen keep skin get …
49964POLITICSHUFFPOLLSTER: Democrats Have A Solid Chance Of…2016-08-22HuffPost’s poll-based model indicates Senate R…huffpollster democrats solid chance retake senate

Now, we need to represent each headline as a numeric vector to be able to apply any machine learning model to it. There are various feature extraction techniques to achieve this; we will be using TF-IDF (term frequency-inverse document frequency). This technique reduces the effect of words occurring with high frequency in documents (in our example, news headlines), as these clearly should not be the deciding features in clustering or classifying them.

from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=300, tokenizer=lambda x: x.split(' '))
tfidf_mat = vectorizer.fit_transform(df['processed_input'])
X = tfidf_mat.todense()
X[X==0]=0.00001

Next, we will try out our first clustering method, agglomerative clustering, on these feature vectors.

Clustering Method 1: Agglomerative Clustering

Considering the given news categories as the optimal solution, let’s compare these results to those of agglomerative clustering (with the desired number of clusters as 30 since there are 30 categories in the data set):

clusters_agg = AgglomerativeClustering(n_clusters=30).fit_predict(X)
df['class_prd'] = clusters_agg.astype(int) 

We will identify the resulting clusters by integer labels; headlines belonging to the same cluster are assigned the same integer label. The cluster_measure function from the compare_clusters module of our GitHub repository returns the aggregate F-measure and number of perfectly matching clusters so we can see how accurate our clustering result was:

from clustering.compare_clusters import cluster_measure
# ‘cluster_measure` requires given text categories to be in the column ‘text_category`
df['text_category'] = df['category']
res_df, fmeasure_aggregate, true_matches = cluster_measure(df, gt_column='class_gt')
fmeasure_aggregate, len(true_matches)  

# Outputs: (0.19858339749319176, 0)

On comparing these cluster results with the optimal solution, we get a low F-measure of 0.198 and 0 clusters matching with actual class groups, depicting that the agglomerative clusters do not align with the headline categories we chose. Let’s check out a cluster in the result to see what it looks like.

df[df['class_prd'] == 0]['category'].value_counts()

Upon examining the results, we see that this cluster contains headlines from all the categories:

POLITICS          1268
ENTERTAINMENT      712
THE WORLDPOST      373
HEALTHY LIVING     272
QUEER VOICES       251
PARENTS            212
BLACK VOICES       211
...
FIFTY               24
EDUCATION           23
COLLEGE             14
ARTS                13

So, our low F-measure makes sense considering that our result’s clusters do not align with the optimal solution. However, it is important to recall that the given category classification we chose reflects only one possible division of the data set. A low F-measure here doesn’t imply that the clustering result is wrong, but that the clustering result didn’t match our desired method of partitioning the data.

Clustering Method 2: K-means

Let’s try another popular clustering algorithm on the same data set: k-means clustering. We will create a new dataframe and use the cluster_measure function again:

kmeans = KMeans(n_clusters=30, random_state=0).fit(X)
df2 = df.copy()
df2['class_prd'] = kmeans.predict(X).astype(int)
res_df, fmeasure_aggregate, true_matches = cluster_measure(df2)
fmeasure_aggregate, len(true_matches)  

# Outputs: (0.18332960871141976, 0)

Like the agglomerative clustering result, our k-means clustering result has formed clusters that are dissimilar to our given categories: It has an F-measure of 0.18 when compared to the optimal solution. Since the two clustering results have similar F-measures, it would be interesting to compare them to each other. We already have the clusterings, so we just need to calculate the F-measure. First, we’ll bring both results into one column, with class_gt having the agglomerative clustering output, and class_prd having the k-means clustering output:

df1 = df2.copy()
df1['class_gt'] = df['class_prd']   
res_df, fmeasure_aggregate, true_matches = cluster_measure(df1, gt_column='class_gt')
fmeasure_aggregate, len(true_matches)

# Outputs: (0.4030316435020922, 0)

With a higher F-measure of 0.4, we can observe that the clusterings of the two algorithms are more similar to each other than they are to the optimal solution.

Discover More About Enhanced Clustering Results

An understanding of the available clustering comparison metrics will expand your machine learning model analysis. We have seen the F-measure clustering metric in action, and gave you the basics you need to apply these learnings to your next clustering result. To learn even more, here are my top picks for further reading:


Further Reading on the Toptal Engineering Blog:

The Toptal Engineering Blog extends its gratitude to Luis Bronchal for reviewing the code samples presented in this article.

Original article source at: https://www.toptal.com/

#machinelearning #clustering 

How to Identifying The Unknown with Clustering Metrics
Rupert  Beatty

Rupert Beatty

1667874180

Cluster: Easy Map annotation Clustering

Cluster

Cluster is an easy map annotation clustering library. This repository uses an efficient method (QuadTree) to aggregate pins into a cluster.

Demo Screenshots

Features

  •  Adding/Removing Annotations
  •  Clustering Annotations
  •  Multiple Managers
  •  Dynamic Cluster Disabling
  •  Custom Cell Size
  •  Custom Annotation Views
  •  Animation Support
  •  Documentation

Requirements

  • iOS 8.0+
  • Xcode 9.0+
  • Swift 5 (Cluster 3.x), Swift 4 (Cluster 2.x), Swift 3 (Cluster 1.x)

Demo

The Example is a great place to get started. It demonstrates how to:

  • integrate the library
  • add/remove annotations
  • reload annotations
  • configure the annotation view
  • configure the manager

Demo GIF

Demo Video

$ pod try Cluster

Installation

Cluster is available via CocoaPods and Carthage.

CocoaPods

To install Cluster with CocoaPods, add this to your Podfile:

pod "Cluster"

Carthage

To install Cluster with Carthage, add this to your Cartfile:

github "efremidze/Cluster"

Usage

The Basics

The ClusterManager class generates, manages and displays annotation clusters.

let clusterManager = ClusterManager()

Adding an Annotation

Create an object that conforms to the MKAnnotation protocol, or extend an existing one. Next, add the annotation object to an instance of ClusterManager with add(annotation:).

let annotation = Annotation(coordinate: CLLocationCoordinate2D(latitude: 21.283921, longitude: -157.831661))
manager.add(annotation)

Configuring the Annotation View

Implement the map view’s mapView(_:viewFor:) delegate method to configure the annotation view. Return an instance of MKAnnotationView to display as a visual representation of the annotations.

To display clusters, return an instance of ClusterAnnotationView.

extension ViewController: MKMapViewDelegate {
    func mapView(_ mapView: MKMapView, viewFor annotation: MKAnnotation) -> MKAnnotationView? {
        if let annotation = annotation as? ClusterAnnotation {
            return CountClusterAnnotationView(annotation: annotation, reuseIdentifier: "cluster")
        } else {
            return MKPinAnnotationView(annotation: annotation, reuseIdentifier: "pin")
        }
    }
}

For performance reasons, you should generally reuse MKAnnotationView objects in your map views. See the Example to learn more.

Customizing the Appearance

The ClusterAnnotationView class exposes a countLabel property. You can subclass ClusterAnnotationView to provide custom behavior as needed. Here's an example of subclassing the ClusterAnnotationView and customizing the layer borderColor.

class CountClusterAnnotationView: ClusterAnnotationView {
    override func configure() {
        super.configure()

        self.layer.cornerRadius = self.frame.width / 2
        self.layer.masksToBounds = true
        self.layer.borderColor = UIColor.white.cgColor
        self.layer.borderWidth = 1.5
    }
}

See the AnnotationView to learn more.

Annotation Styling

You can customize the appearance of the StyledClusterAnnotationView by setting the style property of the annotation.

let annotation = Annotation(coordinate: CLLocationCoordinate2D(latitude: 21.283921, longitude: -157.831661))
annotation.style = .color(color, radius: 25)
manager.add(annotation)

Several styles are available in the ClusterAnnotationStyle enum:

  • color(UIColor, radius: CGFloat) - Displays the annotations as a circle.
  • image(UIImage?) - Displays the annotation as an image.

Once you have added the annotation, you need to return an instance of the StyledClusterAnnotationView to display the styled annotation.

func mapView(_ mapView: MKMapView, viewFor annotation: MKAnnotation) -> MKAnnotationView? {
    if let annotation = annotation as? ClusterAnnotation {
        return StyledClusterAnnotationView(annotation: annotation, reuseIdentifier: identifier, style: style)
    }
}

Removing Annotations

To remove annotations, you can call remove(annotation:). However the annotations will still display until you call reload().

manager.remove(annotation)

In the case that shouldRemoveInvisibleAnnotations is set to false, annotations that have been removed may still appear on map until calling reload() on visible region.

Reloading Annotations

Implement the map view’s mapView(_:regionDidChangeAnimated:) delegate method to reload the ClusterManager when the region changes.

func mapView(_ mapView: MKMapView, regionDidChangeAnimated animated: Bool) {
    clusterManager.reload(mapView: mapView) { finished in
        // handle completion
    }
}

You should call reload() anytime you add or remove annotations.

Configuring the Manager

The ClusterManager class exposes several properties to configure clustering:

var zoomLevel: Double // The current zoom level of the visible map region.
var maxZoomLevel: Double // The maximum zoom level before disabling clustering.
var minCountForClustering: Int // The minimum number of annotations for a cluster. The default is `2`.
var shouldRemoveInvisibleAnnotations: Bool // Whether to remove invisible annotations. The default is `true`.
var shouldDistributeAnnotationsOnSameCoordinate: Bool // Whether to arrange annotations in a circle if they have the same coordinate. The default is `true`.
var distanceFromContestedLocation: Double // The distance in meters from contested location when the annotations have the same coordinate. The default is `3`.
var clusterPosition: ClusterPosition // The position of the cluster annotation. The default is `.nearCenter`.

ClusterManagerDelegate

The ClusterManagerDelegate protocol provides a number of functions to manage clustering and configure cells.

// The size of each cell on the grid at a given zoom level.
func cellSize(for zoomLevel: Double) -> Double? { ... }

// Whether to cluster the given annotation.
func shouldClusterAnnotation(_ annotation: MKAnnotation) -> Bool { ... }

Communication

  • If you found a bug, open an issue.
  • If you have a feature request, open an issue.
  • If you want to contribute, submit a pull request.

Mentions

Credits

Download Details:

Author: Efremidze
Source Code: https://github.com/efremidze/Cluster 
License: MIT license

#swift #ios #apple #clustering #map 

Cluster: Easy Map annotation Clustering

LMCLUS.jl: The Julia Package for Linear Manifold Clustering

LMCLUS

A Julia package for linear manifold clustering.

Installation

Prior to Julia v0.7.0

Pkg.clone("https://github.com/wildart/LMCLUS.jl.git")

For Julia v0.7.0/1.0.0

pkg> add https://github.com/wildart/LMCLUS.jl.git#0.4.0

For Julia 1.1+, add BoffinStuff registry in the package manager before installing the package.

pkg> registry add https://github.com/wildart/BoffinStuff.git
pkg> add LMCLUS

Julia Compatibility

Julia VersionLMCLUS version
v0.3.*v0.0.2
v0.4.*v0.1.2
v0.5.*v0.2.0
v0.6.*v0.3.0
≥v0.7.*v0.4.0
≥v1.1.*≥v0.4.1

Resources

Download Details:

Author: Wildart
Source Code: https://github.com/wildart/LMCLUS.jl 
License: MIT license

#julia #clustering 

LMCLUS.jl: The Julia Package for Linear Manifold Clustering

Clustersql: A Clustering SQL Driver in Go

clustersql froked and allow multi clustered datasource

Go Clustering SQL Driver - A clustering, implementation-agnostic "meta"-driver for any backend implementing "database/sql/driver".

It does (latency-based) load-balancing and error-recovery over all registered nodes.

It is assumed that database-state is transparently replicated over all nodes by some database-side clustering solution. This driver ONLY handles the client side of such a cluster.

This package simply multiplexes the driver.Open() function of sql/driver to every attached node. The function is called on each node, returning the first successfully opened connection. (Any connections opening subsequently will be closed.) If opening does not succeed for any node, the latest error gets returned. Any other errors will be masked by default. However, any given latest error for any attached node will remain exposed through expvar, as well as some basic counters and timestamps.

To make use of this kind of clustering, use this package with any backend driver implementing "database/sql/driver" like so:

import "database/sql"
import "github.com/go-sql-driver/mysql"
import "github.com/EnumApps/clustersql"

const ( WriteDriver = "write_conn" ReadDriver = "read_conn" SessDriver = "sess_conn" ) There is currently no way around instanciating the backend driver explicitly

mysqlDriver := mysql.MySQLDriver{}

You can perform backend-driver specific settings such as

err := mysql.SetLogger(mylogger)

Create a new clustering driver with the backend driver

readerDriver := clustersql.NewDriver(mysqlDriver, ReadDriver)

Add nodes, including driver-specific name format, in this case Go-MySQL DSN. Here, we add three nodes belonging to a galera cluster

readerDriver.AddNode("galera1", "reader:password@tcp(dbhost1:3306)/db")
readerDriver.AddNode("galera2", "reader:password@tcp(dbhost2:3306)/db")
readerDriver.AddNode("galera3", "reader:password@tcp(dbhost3:3306)/db")

Make the clusterDriver available to the go sql interface under an arbitrary name

sql.Register(ReadDriver, readerDriver)

Create a new clustering driver with the backend driver

sessionDriver := clustersql.NewDriver(mysqlDriver, SessDriver)

Add nodes, including driver-specific name format, in this case Go-MySQL DSN. Here, we add three nodes belonging to a galera cluster

sessionDriver.AddNode("galera1", "sess_user:password@tcp(dbhost1:3306)/sessdb")
sessionDriver.AddNode("galera2", "sess_user:password@tcp(dbhost2:3306)/sessdb")
sessionDriver.AddNode("galera3", "sess_user:password@tcp(dbhost3:3306)/sessdb")

Make the clusterDriver available to the go sql interface under an arbitrary name

sql.Register(SessDriver, sessionDriver)

Open the registered clusterDriver with an arbitrary DSN string (not used)

db, err := sql.Open(WriteDriver, "")

readonly_db, err := sql.Open(ReadDriver, "")

session_db, err := sql.Open(SessDriver, "")

Continue to use the sql interface as documented at http://golang.org/pkg/database/sql/

Before using this in production, you should configure your cluster details in config.toml and run

go test -v .

Note however, that non-failure of the above is no guarantee for a correctly set-up cluster.

Download Details:

Author: EnumApps
Source Code: https://github.com/EnumApps/clustersql 
License: BSD-2-Clause license

#go #golang #clustering #sql 

Clustersql: A Clustering SQL Driver in Go
Monty  Boehm

Monty Boehm

1658384229

QuickShiftClustering.jl: Fast Hierarchical Medoid Clustering

QuickShiftClustering 

QuickShift [1] is a fast method for hierarchical clustering, which first constructs the clustering tree, and subsequently allows to quickly cut links in the tree which exceed a specified length. This second step can be performed for different link-lengths without having to re-run the clustering itself. Care has been taken to provide a high-performance implementation.

[1] Quick Shift and Kernel Methods for Mode Seeking

Functions

a = quickshift(data)
a = quickshift(data, sigma)
# cluster ndim x nsamplex matrix data.
# sigma: Gaussian kernel width, see paper

labels = quickshiftlabels(a::QuickShift)
labels = quickshiftlabels(a::QuickShift, maxlinklength)
# cut links in the tree with length > maxlinklength
# return cluster labels for data points.

quickshiftplot(a, data, labels)
# plot data points and hierarchical links
# needs PyPlot installed, only for 2D

Performance

data 2 x NRuntime quickshiftRuntime quickshiftlabels
10000.06 sec0.0002 sec
100000.27 sec0.004 sec
1000009.67 sec0.04 sec

For larger numbers of data points, you might want to use KShiftsClustering.jl to cluster the N data points to e.g. 10.000 cluster centers, and then perform QuickShift on those.

Comparison with kmedoids for 20.000 points:

using Clustering, QuickShiftClustering, FunctionalDataUtils

data = rand(2,20000)
@time a = kmedoids(1-exp(-distance(data,data)*10),10)
#  =>  elapsed time: 56.666481916 seconds (41126243444 bytes allocated, 15.31% gc time)

@time labels = quickshiftlabels(quickshift(data))
#  =>  elapsed time: 1.187448525 seconds (277816624 bytes allocated, 28.79% gc time)

Example

using FunctionalData
data = @p map unstack(1:10) (x->10*randn(2,1).+randn(2,100)) | flatten

using QuickShiftClustering
a = quickshift(data)           
labels = quickshiftlabels(a)   

quickshiftplot(a, data, labels)

Author: rened
Source Code: https://github.com/rened/QuickShiftClustering.jl 
License: View license

#julia #clustering 

QuickShiftClustering.jl: Fast Hierarchical Medoid Clustering
Monty  Boehm

Monty Boehm

1655942280

Clustering.jl: A Julia Package for Data Clustering

Clustering.jl

Methods for data clustering and evaluation of clustering quality. 

Installation

Pkg.add("Clustering")

Features

Clustering Algorithms

  • K-means
  • K-medoids
  • Affinity Propagation
  • Density-based spatial clustering of applications with noise (DBSCAN)
  • Markov Clustering Algorithm (MCL)
  • Fuzzy C-Means Clustering
  • Hierarchical Clustering
    • Single Linkage
    • Average Linkage
    • Complete Linkage
    • Ward's Linkage

Clustering Validation

  • Silhouettes
  • Variation of Information
  • Rand index
  • V-Measure

See Also

Julia packages providing other clustering methods:

Documentation:

Author: JuliaStats
Source Code: https://github.com/JuliaStats/Clustering.jl 
License: View license

#machinelearning #julia #clustering 

Clustering.jl: A Julia Package for Data Clustering
Sheldon  Grant

Sheldon Grant

1646860440

Bottleneck: Rate Limiter That Makes Throttling Easy

bottleneck

Bottleneck is a lightweight and zero-dependency Task Scheduler and Rate Limiter for Node.js and the browser.

Bottleneck is an easy solution as it adds very little complexity to your code. It is battle-hardened, reliable and production-ready and used on a large scale in private companies and open source software.

It supports Clustering: it can rate limit jobs across multiple Node.js instances. It uses Redis and strictly atomic operations to stay reliable in the presence of unreliable clients and networks. It also supports Redis Cluster and Redis Sentinel.

Upgrading from version 1?

Install

npm install --save bottleneck
import Bottleneck from "bottleneck";

// Note: To support older browsers and Node <6.0, you must import the ES5 bundle instead.
var Bottleneck = require("bottleneck/es5");

Quick Start

Step 1 of 3

Most APIs have a rate limit. For example, to execute 3 requests per second:

const limiter = new Bottleneck({
  minTime: 333
});

If there's a chance some requests might take longer than 333ms and you want to prevent more than 1 request from running at a time, add maxConcurrent: 1:

const limiter = new Bottleneck({
  maxConcurrent: 1,
  minTime: 333
});

minTime and maxConcurrent are enough for the majority of use cases. They work well together to ensure a smooth rate of requests. If your use case requires executing requests in bursts or every time a quota resets, look into Reservoir Intervals.

Step 2 of 3

➤ Using promises?

Instead of this:

myFunction(arg1, arg2)
.then((result) => {
  /* handle result */
});

Do this:

limiter.schedule(() => myFunction(arg1, arg2))
.then((result) => {
  /* handle result */
});

Or this:

const wrapped = limiter.wrap(myFunction);

wrapped(arg1, arg2)
.then((result) => {
  /* handle result */
});

➤ Using async/await?

Instead of this:

const result = await myFunction(arg1, arg2);

Do this:

const result = await limiter.schedule(() => myFunction(arg1, arg2));

Or this:

const wrapped = limiter.wrap(myFunction);

const result = await wrapped(arg1, arg2);

➤ Using callbacks?

Instead of this:

someAsyncCall(arg1, arg2, callback);

Do this:

limiter.submit(someAsyncCall, arg1, arg2, callback);

Step 3 of 3

Remember...

Bottleneck builds a queue of jobs and executes them as soon as possible. By default, the jobs will be executed in the order they were received.

Read the 'Gotchas' and you're good to go. Or keep reading to learn about all the fine tuning and advanced options available. If your rate limits need to be enforced across a cluster of computers, read the Clustering docs.

Need help debugging your application?

Instead of throttling maybe you want to batch up requests into fewer calls?

Gotchas & Common Mistakes

  • Make sure the function you pass to schedule() or wrap() only returns once all the work it does has completed.

Instead of this:

limiter.schedule(() => {
  tasksArray.forEach(x => processTask(x));
  // BAD, we return before our processTask() functions are finished processing!
});

Do this:

limiter.schedule(() => {
  const allTasks = tasksArray.map(x => processTask(x));
  // GOOD, we wait until all tasks are done.
  return Promise.all(allTasks);
});
  • If you're passing an object's method as a job, you'll probably need to bind() the object:
// instead of this:
limiter.schedule(object.doSomething);
// do this:
limiter.schedule(object.doSomething.bind(object));
// or, wrap it in an arrow function instead:
limiter.schedule(() => object.doSomething());

Bottleneck requires Node 6+ to function. However, an ES5 build is included: var Bottleneck = require("bottleneck/es5");.

Make sure you're catching "error" events emitted by your limiters!

Consider setting a maxConcurrent value instead of leaving it null. This can help your application's performance, especially if you think the limiter's queue might become very long.

If you plan on using priorities, make sure to set a maxConcurrent value.

When using submit(), if a callback isn't necessary, you must pass null or an empty function instead. It will not work otherwise.

When using submit(), make sure all the jobs will eventually complete by calling their callback, or set an expiration. Even if you submitted your job with a null callback , it still needs to call its callback. This is particularly important if you are using a maxConcurrent value that isn't null (unlimited), otherwise those not completed jobs will be clogging up the limiter and no new jobs will be allowed to run. It's safe to call the callback more than once, subsequent calls are ignored.

Using tools like mockdate in your tests to change time in JavaScript will likely result in undefined behavior from Bottleneck.

Docs

Constructor

const limiter = new Bottleneck({/* options */});

Basic options:

OptionDefaultDescription
maxConcurrentnull (unlimited)How many jobs can be executing at the same time. Consider setting a value instead of leaving it null, it can help your application's performance, especially if you think the limiter's queue might get very long.
minTime0 msHow long to wait after launching a job before launching another one.
highWaternull (unlimited)How long can the queue be? When the queue length exceeds that value, the selected strategy is executed to shed the load.
strategyBottleneck.strategy.LEAKWhich strategy to use when the queue gets longer than the high water mark. Read about strategies. Strategies are never executed if highWater is null.
penalty15 * minTime, or 5000 when minTime is 0The penalty value used by the BLOCK strategy.
reservoirnull (unlimited)How many jobs can be executed before the limiter stops executing jobs. If reservoir reaches 0, no jobs will be executed until it is no longer 0. New jobs will still be queued up.
reservoirRefreshIntervalnull (disabled)Every reservoirRefreshInterval milliseconds, the reservoir value will be automatically updated to the value of reservoirRefreshAmount. The reservoirRefreshInterval value should be a multiple of 250 (5000 for Clustering).
reservoirRefreshAmountnull (disabled)The value to set reservoir to when reservoirRefreshInterval is in use.
reservoirIncreaseIntervalnull (disabled)Every reservoirIncreaseInterval milliseconds, the reservoir value will be automatically incremented by reservoirIncreaseAmount. The reservoirIncreaseInterval value should be a multiple of 250 (5000 for Clustering).
reservoirIncreaseAmountnull (disabled)The increment applied to reservoir when reservoirIncreaseInterval is in use.
reservoirIncreaseMaximumnull (disabled)The maximum value that reservoir can reach when reservoirIncreaseInterval is in use.
PromisePromise (built-in)This lets you override the Promise library used by Bottleneck.

Reservoir Intervals

Reservoir Intervals let you execute requests in bursts, by automatically controlling the limiter's reservoir value. The reservoir is simply the number of jobs the limiter is allowed to execute. Once the value reaches 0, it stops starting new jobs.

There are 2 types of Reservoir Intervals: Refresh Intervals and Increase Intervals.

Refresh Interval

In this example, we throttle to 100 requests every 60 seconds:

const limiter = new Bottleneck({
  reservoir: 100, // initial value
  reservoirRefreshAmount: 100,
  reservoirRefreshInterval: 60 * 1000, // must be divisible by 250

  // also use maxConcurrent and/or minTime for safety
  maxConcurrent: 1,
  minTime: 333 // pick a value that makes sense for your use case
});

reservoir is a counter decremented every time a job is launched, we set its initial value to 100. Then, every reservoirRefreshInterval (60000 ms), reservoir is automatically updated to be equal to the reservoirRefreshAmount (100).

Increase Interval

In this example, we throttle jobs to meet the Shopify API Rate Limits. Users are allowed to send 40 requests initially, then every second grants 2 more requests up to a maximum of 40.

const limiter = new Bottleneck({
  reservoir: 40, // initial value
  reservoirIncreaseAmount: 2,
  reservoirIncreaseInterval: 1000, // must be divisible by 250
  reservoirIncreaseMaximum: 40,

  // also use maxConcurrent and/or minTime for safety
  maxConcurrent: 5,
  minTime: 250 // pick a value that makes sense for your use case
});

Warnings

Reservoir Intervals are an advanced feature, please take the time to read and understand the following warnings.

Reservoir Intervals are not a replacement for minTime and maxConcurrent. It's strongly recommended to also use minTime and/or maxConcurrent to spread out the load. For example, suppose a lot of jobs are queued up because the reservoir is 0. Every time the Refresh Interval is triggered, a number of jobs equal to reservoirRefreshAmount will automatically be launched, all at the same time! To prevent this flooding effect and keep your application running smoothly, use minTime and maxConcurrent to stagger the jobs.

The Reservoir Interval starts from the moment the limiter is created. Let's suppose we're using reservoirRefreshAmount: 5. If you happen to add 10 jobs just 1ms before the refresh is triggered, the first 5 will run immediately, then 1ms later it will refresh the reservoir value and that will make the last 5 also run right away. It will have run 10 jobs in just over 1ms no matter what your reservoir interval was!

Reservoir Intervals prevent a limiter from being garbage collected. Call limiter.disconnect() to clear the interval and allow the memory to be freed. However, it's not necessary to call .disconnect() to allow the Node.js process to exit.

submit()

Adds a job to the queue. This is the callback version of schedule().

limiter.submit(someAsyncCall, arg1, arg2, callback);

You can pass null instead of an empty function if there is no callback, but someAsyncCall still needs to call its callback to let the limiter know it has completed its work.

submit() can also accept advanced options.

schedule()

Adds a job to the queue. This is the Promise and async/await version of submit().

const fn = function(arg1, arg2) {
  return httpGet(arg1, arg2); // Here httpGet() returns a promise
};

limiter.schedule(fn, arg1, arg2)
.then((result) => {
  /* ... */
});

In other words, schedule() takes a function fn and a list of arguments. schedule() returns a promise that will be executed according to the rate limits.

schedule() can also accept advanced options.

Here's another example:

// suppose that `client.get(url)` returns a promise

const url = "https://wikipedia.org";

limiter.schedule(() => client.get(url))
.then(response => console.log(response.body));

wrap()

Takes a function that returns a promise. Returns a function identical to the original, but rate limited.

const wrapped = limiter.wrap(fn);

wrapped()
.then(function (result) {
  /* ... */
})
.catch(function (error) {
  // Bottleneck might need to fail the job even if the original function can never fail.
  // For example, your job is taking longer than the `expiration` time you've set.
});

Job Options

submit(), schedule(), and wrap() all accept advanced options.

// Submit
limiter.submit({/* options */}, someAsyncCall, arg1, arg2, callback);

// Schedule
limiter.schedule({/* options */}, fn, arg1, arg2);

// Wrap
const wrapped = limiter.wrap(fn);
wrapped.withOptions({/* options */}, arg1, arg2);
OptionDefaultDescription
priority5A priority between 0 and 9. A job with a priority of 4 will be queued ahead of a job with a priority of 5. Important: You must set a low maxConcurrent value for priorities to work, otherwise there is nothing to queue because jobs will be be scheduled immediately!
weight1Must be an integer equal to or higher than 0. The weight is what increases the number of running jobs (up to maxConcurrent) and decreases the reservoir value.
expirationnull (unlimited)The number of milliseconds a job is given to complete. Jobs that execute for longer than expiration ms will be failed with a BottleneckError.
id<no-id>You should give an ID to your jobs, it helps with debugging.

Strategies

A strategy is a simple algorithm that is executed every time adding a job would cause the number of queued jobs to exceed highWater. Strategies are never executed if highWater is null.

Bottleneck.strategy.LEAK

When adding a new job to a limiter, if the queue length reaches highWater, drop the oldest job with the lowest priority. This is useful when jobs that have been waiting for too long are not important anymore. If all the queued jobs are more important (based on their priority value) than the one being added, it will not be added.

Bottleneck.strategy.OVERFLOW_PRIORITY

Same as LEAK, except it will only drop jobs that are less important than the one being added. If all the queued jobs are as or more important than the new one, it will not be added.

Bottleneck.strategy.OVERFLOW

When adding a new job to a limiter, if the queue length reaches highWater, do not add the new job. This strategy totally ignores priority levels.

Bottleneck.strategy.BLOCK

When adding a new job to a limiter, if the queue length reaches highWater, the limiter falls into "blocked mode". All queued jobs are dropped and no new jobs will be accepted until the limiter unblocks. It will unblock after penalty milliseconds have passed without receiving a new job. penalty is equal to 15 * minTime (or 5000 if minTime is 0) by default. This strategy is ideal when bruteforce attacks are to be expected. This strategy totally ignores priority levels.

Jobs lifecycle

  1. Received. Your new job has been added to the limiter. Bottleneck needs to check whether it can be accepted into the queue.
  2. Queued. Bottleneck has accepted your job, but it can not tell at what exact timestamp it will run yet, because it is dependent on previous jobs.
  3. Running. Your job is not in the queue anymore, it will be executed after a delay that was computed according to your minTime setting.
  4. Executing. Your job is executing its code.
  5. Done. Your job has completed.

Note: By default, Bottleneck does not keep track of DONE jobs, to save memory. You can enable this feature by passing trackDoneStatus: true as an option when creating a limiter.

counts()

const counts = limiter.counts();

console.log(counts);
/*
{
  RECEIVED: 0,
  QUEUED: 0,
  RUNNING: 0,
  EXECUTING: 0,
  DONE: 0
}
*/

Returns an object with the current number of jobs per status in the limiter.

jobStatus()

console.log(limiter.jobStatus("some-job-id"));
// Example: QUEUED

Returns the status of the job with the provided job id in the limiter. Returns null if no job with that id exist.

jobs()

console.log(limiter.jobs("RUNNING"));
// Example: ['id1', 'id2']

Returns an array of all the job ids with the specified status in the limiter. Not passing a status string returns all the known ids.

queued()

const count = limiter.queued(priority);

console.log(count);

priority is optional. Returns the number of QUEUED jobs with the given priority level. Omitting the priority argument returns the total number of queued jobs in the limiter.

clusterQueued()

const count = await limiter.clusterQueued();

console.log(count);

Returns the number of QUEUED jobs in the Cluster.

empty()

if (limiter.empty()) {
  // do something...
}

Returns a boolean which indicates whether there are any RECEIVED or QUEUED jobs in the limiter.

running()

limiter.running()
.then((count) => console.log(count));

Returns a promise that returns the total weight of the RUNNING and EXECUTING jobs in the Cluster.

done()

limiter.done()
.then((count) => console.log(count));

Returns a promise that returns the total weight of DONE jobs in the Cluster. Does not require passing the trackDoneStatus: true option.

check()

limiter.check()
.then((wouldRunNow) => console.log(wouldRunNow));

Checks if a new job would be executed immediately if it was submitted now. Returns a promise that returns a boolean.

Events

'error'

limiter.on("error", function (error) {
  /* handle errors here */
});

The two main causes of error events are: uncaught exceptions in your event handlers, and network errors when Clustering is enabled.

'failed'

limiter.on("failed", function (error, jobInfo) {
  // This will be called every time a job fails.
});

'retry'

See Retries to learn how to automatically retry jobs.

limiter.on("retry", function (message, jobInfo) {
  // This will be called every time a job is retried.
});

'empty'

limiter.on("empty", function () {
  // This will be called when `limiter.empty()` becomes true.
});

'idle'

limiter.on("idle", function () {
  // This will be called when `limiter.empty()` is `true` and `limiter.running()` is `0`.
});

'dropped'

limiter.on("dropped", function (dropped) {
  // This will be called when a strategy was triggered.
  // The dropped request is passed to this event listener.
});

'depleted'

limiter.on("depleted", function (empty) {
  // This will be called every time the reservoir drops to 0.
  // The `empty` (boolean) argument indicates whether `limiter.empty()` is currently true.
});

'debug'

limiter.on("debug", function (message, data) {
  // Useful to figure out what the limiter is doing in real time
  // and to help debug your application
});

'received' 'queued' 'scheduled' 'executing' 'done'

limiter.on("queued", function (info) {
  // This event is triggered when a job transitions from one Lifecycle stage to another
});

See Jobs Lifecycle for more information.

These Lifecycle events are not triggered for jobs located on another limiter in a Cluster, for performance reasons.

Other event methods

Use removeAllListeners() with an optional event name as first argument to remove listeners.

Use .once() instead of .on() to only receive a single event.

Retries

The following example:

const limiter = new Bottleneck();

// Listen to the "failed" event
limiter.on("failed", async (error, jobInfo) => {
  const id = jobInfo.options.id;
  console.warn(`Job ${id} failed: ${error}`);

  if (jobInfo.retryCount === 0) { // Here we only retry once
    console.log(`Retrying job ${id} in 25ms!`);
    return 25;
  }
});

// Listen to the "retry" event
limiter.on("retry", (error, jobInfo) => console.log(`Now retrying ${jobInfo.options.id}`));

const main = async function () {
  let executions = 0;

  // Schedule one job
  const result = await limiter.schedule({ id: 'ABC123' }, async () => {
    executions++;
    if (executions === 1) {
      throw new Error("Boom!");
    } else {
      return "Success!";
    }
  });

  console.log(`Result: ${result}`);
}

main();

will output

Job ABC123 failed: Error: Boom!
Retrying job ABC123 in 25ms!
Now retrying ABC123
Result: Success!

To re-run your job, simply return an integer from the 'failed' event handler. The number returned is how many milliseconds to wait before retrying it. Return 0 to retry it immediately.

IMPORTANT: When you ask the limiter to retry a job it will not send it back into the queue. It will stay in the EXECUTING state until it succeeds or until you stop retrying it. This means that it counts as a concurrent job for maxConcurrent even while it's just waiting to be retried. The number of milliseconds to wait ignores your minTime settings.

updateSettings()

limiter.updateSettings(options);

The options are the same as the limiter constructor.

Note: Changes don't affect SCHEDULED jobs.

incrementReservoir()

limiter.incrementReservoir(incrementBy);

Returns a promise that returns the new reservoir value.

currentReservoir()

limiter.currentReservoir()
.then((reservoir) => console.log(reservoir));

Returns a promise that returns the current reservoir value.

stop()

The stop() method is used to safely shutdown a limiter. It prevents any new jobs from being added to the limiter and waits for all EXECUTING jobs to complete.

limiter.stop(options)
.then(() => {
  console.log("Shutdown completed!")
});

stop() returns a promise that resolves once all the EXECUTING jobs have completed and, if desired, once all non-EXECUTING jobs have been dropped.

OptionDefaultDescription
dropWaitingJobstrueWhen true, drop all the RECEIVED, QUEUED and RUNNING jobs. When false, allow those jobs to complete before resolving the Promise returned by this method.
dropErrorMessageThis limiter has been stopped.The error message used to drop jobs when dropWaitingJobs is true.
enqueueErrorMessageThis limiter has been stopped and cannot accept new jobs.The error message used to reject a job added to the limiter after stop() has been called.

chain()

Tasks that are ready to be executed will be added to that other limiter. Suppose you have 2 types of tasks, A and B. They both have their own limiter with their own settings, but both must also follow a global limiter G:

const limiterA = new Bottleneck( /* some settings */ );
const limiterB = new Bottleneck( /* some different settings */ );
const limiterG = new Bottleneck( /* some global settings */ );

limiterA.chain(limiterG);
limiterB.chain(limiterG);

// Requests added to limiterA must follow the A and G rate limits.
// Requests added to limiterB must follow the B and G rate limits.
// Requests added to limiterG must follow the G rate limits.

To unchain, call limiter.chain(null);.

Group

The Group feature of Bottleneck manages many limiters automatically for you. It creates limiters dynamically and transparently.

Let's take a DNS server as an example of how Bottleneck can be used. It's a service that sees a lot of abuse and where incoming DNS requests need to be rate limited. Bottleneck is so tiny, it's acceptable to create one limiter for each origin IP, even if it means creating thousands of limiters. The Group feature is perfect for this use case. Create one Group and use the origin IP to rate limit each IP independently. Each call with the same key (IP) will be routed to the same underlying limiter. A Group is created like a limiter:

const group = new Bottleneck.Group(options);

The options object will be used for every limiter created by the Group.

The Group is then used with the .key(str) method:

// In this example, the key is an IP
group.key("77.66.54.32").schedule(() => {
  /* process the request */
});

key()

  • str : The key to use. All jobs added with the same key will use the same underlying limiter. Default: ""

The return value of .key(str) is a limiter. If it doesn't already exist, it is generated for you. Calling key() is how limiters are created inside a Group.

Limiters that have been idle for longer than 5 minutes are deleted to avoid memory leaks, this value can be changed by passing a different timeout option, in milliseconds.

on("created")

group.on("created", (limiter, key) => {
  console.log("A new limiter was created for key: " + key)

  // Prepare the limiter, for example we'll want to listen to its "error" events!
  limiter.on("error", (err) => {
    // Handle errors here
  })
});

Listening for the "created" event is the recommended way to set up a new limiter. Your event handler is executed before key() returns the newly created limiter.

updateSettings()

const group = new Bottleneck.Group({ maxConcurrent: 2, minTime: 250 });
group.updateSettings({ minTime: 500 });

After executing the above commands, new limiters will be created with { maxConcurrent: 2, minTime: 500 }.

deleteKey()

  • str: The key for the limiter to delete.

Manually deletes the limiter at the specified key. When using Clustering, the Redis data is immediately deleted and the other Groups in the Cluster will eventually delete their local key automatically, unless it is still being used.

keys()

Returns an array containing all the keys in the Group.

clusterKeys()

Same as group.keys(), but returns all keys in this Group ID across the Cluster.

limiters()

const limiters = group.limiters();

console.log(limiters);
// [ { key: "some key", limiter: <limiter> }, { key: "some other key", limiter: <some other limiter> } ]

Batching

Some APIs can accept multiple operations in a single call. Bottleneck's Batching feature helps you take advantage of those APIs:

const batcher = new Bottleneck.Batcher({
  maxTime: 1000,
  maxSize: 10
});

batcher.on("batch", (batch) => {
  console.log(batch); // ["some-data", "some-other-data"]

  // Handle batch here
});

batcher.add("some-data");
batcher.add("some-other-data");

batcher.add() returns a Promise that resolves once the request has been flushed to a "batch" event.

OptionDefaultDescription
maxTimenull (unlimited)Maximum acceptable time (in milliseconds) a request can have to wait before being flushed to the "batch" event.
maxSizenull (unlimited)Maximum number of requests in a batch.

Batching doesn't throttle requests, it only groups them up optimally according to your maxTime and maxSize settings.

Clustering

Clustering lets many limiters access the same shared state, stored in Redis. Changes to the state are Atomic, Consistent and Isolated (and fully ACID with the right Durability configuration), to eliminate any chances of race conditions or state corruption. Your settings, such as maxConcurrent, minTime, etc., are shared across the whole cluster, which means —for example— that { maxConcurrent: 5 } guarantees no more than 5 jobs can ever run at a time in the entire cluster of limiters. 100% of Bottleneck's features are supported in Clustering mode. Enabling Clustering is as simple as changing a few settings. It's also a convenient way to store or export state for later use.

Bottleneck will attempt to spread load evenly across limiters.

Enabling Clustering

First, add redis or ioredis to your application's dependencies:

# NodeRedis (https://github.com/NodeRedis/node_redis)
npm install --save redis

# or ioredis (https://github.com/luin/ioredis)
npm install --save ioredis

Then create a limiter or a Group:

const limiter = new Bottleneck({
  /* Some basic options */
  maxConcurrent: 5,
  minTime: 500
  id: "my-super-app" // All limiters with the same id will be clustered together

  /* Clustering options */
  datastore: "redis", // or "ioredis"
  clearDatastore: false,
  clientOptions: {
    host: "127.0.0.1",
    port: 6379

    // Redis client options
    // Using NodeRedis? See https://github.com/NodeRedis/node_redis#options-object-properties
    // Using ioredis? See https://github.com/luin/ioredis/blob/master/API.md#new-redisport-host-options
  }
});
OptionDefaultDescription
datastore"local"Where the limiter stores its internal state. The default ("local") keeps the state in the limiter itself. Set it to "redis" or "ioredis" to enable Clustering.
clearDatastorefalseWhen set to true, on initial startup, the limiter will wipe any existing Bottleneck state data on the Redis db.
clientOptions{}This object is passed directly to the redis client library you've selected.
clusterNodesnullioredis only. When clusterNodes is not null, the client will be instantiated by calling new Redis.Cluster(clusterNodes, clientOptions) instead of new Redis(clientOptions).
timeoutnull (no TTL)The Redis TTL in milliseconds (TTL) for the keys created by the limiter. When timeout is set, the limiter's state will be automatically removed from Redis after timeout milliseconds of inactivity.
RedisnullOverrides the import/require of the redis/ioredis library. You shouldn't need to set this option unless your application is failing to start due to a failure to require/import the client library.

Note: When using Groups, the timeout option has a default of 300000 milliseconds and the generated limiters automatically receive an id with the pattern ${group.id}-${KEY}.

Note: If you are seeing a runtime error due to the require() function not being able to load redis/ioredis, then directly pass the module as the Redis option. Example:

import Redis from "ioredis"

const limiter = new Bottleneck({
  id: "my-super-app",
  datastore: "ioredis",
  clientOptions: { host: '12.34.56.78', port: 6379 },
  Redis
});

Unfortunately, this is a side effect of having to disable inlining, which is necessary to make Bottleneck easy to use in the browser.

Important considerations when Clustering

The first limiter connecting to Redis will store its constructor options on Redis and all subsequent limiters will be using those settings. You can alter the constructor options used by all the connected limiters by calling updateSettings(). The clearDatastore option instructs a new limiter to wipe any previous Bottleneck data (for that id), including previously stored settings.

Queued jobs are NOT stored on Redis. They are local to each limiter. Exiting the Node.js process will lose those jobs. This is because Bottleneck has no way to propagate the JS code to run a job across a different Node.js process than the one it originated on. Bottleneck doesn't keep track of the queue contents of the limiters on a cluster for performance and reliability reasons. You can use something like BeeQueue in addition to Bottleneck to get around this limitation.

Due to the above, functionality relying on the queue length happens purely locally:

  • Priorities are local. A higher priority job will run before a lower priority job on the same limiter. Another limiter on the cluster might run a lower priority job before our higher priority one.
  • Assuming constant priority levels, Bottleneck guarantees that jobs will be run in the order they were received on the same limiter. Another limiter on the cluster might run a job received later before ours runs.
  • highWater and load shedding (strategies) are per limiter. However, one limiter entering Blocked mode will put the entire cluster in Blocked mode until penalty milliseconds have passed. See Strategies.
  • The "empty" event is triggered when the (local) queue is empty.
  • The "idle" event is triggered when the (local) queue is empty and no jobs are currently running anywhere in the cluster.

You must work around these limitations in your application code if they are an issue to you. The publish() method could be useful here.

The current design guarantees reliability, is highly performant and lets limiters come and go. Your application can scale up or down, and clients can be disconnected at any time without issues.

It is strongly recommended that you give an id to every limiter and Group since it is used to build the name of your limiter's Redis keys! Limiters with the same id inside the same Redis db will be sharing the same datastore.

It is strongly recommended that you set an expiration (See Job Options) on every job, since that lets the cluster recover from crashed or disconnected clients. Otherwise, a client crashing while executing a job would not be able to tell the cluster to decrease its number of "running" jobs. By using expirations, those lost jobs are automatically cleared after the specified time has passed. Using expirations is essential to keeping a cluster reliable in the face of unpredictable application bugs, network hiccups, and so on.

Network latency between Node.js and Redis is not taken into account when calculating timings (such as minTime). To minimize the impact of latency, Bottleneck only performs a single Redis call per lifecycle transition. Keeping the Redis server close to your limiters will help you get a more consistent experience. Keeping the system time consistent across all clients will also help.

It is strongly recommended to set up an "error" listener on all your limiters and on your Groups.

Clustering Methods

The ready(), publish() and clients() methods also exist when using the local datastore, for code compatibility reasons: code written for redis/ioredis won't break with local.

ready()

This method returns a promise that resolves once the limiter is connected to Redis.

As of v2.9.0, it's no longer necessary to wait for .ready() to resolve before issuing commands to a limiter. The commands will be queued until the limiter successfully connects. Make sure to listen to the "error" event to handle connection errors.

const limiter = new Bottleneck({/* options */});

limiter.on("error", (err) => {
  // handle network errors
});

limiter.ready()
.then(() => {
  // The limiter is ready
});

publish(message)

This method broadcasts the message string to every limiter in the Cluster. It returns a promise.

const limiter = new Bottleneck({/* options */});

limiter.on("message", (msg) => {
  console.log(msg); // prints "this is a string"
});

limiter.publish("this is a string");

To send objects, stringify them first:

limiter.on("message", (msg) => {
  console.log(JSON.parse(msg).hello) // prints "world"
});

limiter.publish(JSON.stringify({ hello: "world" }));

clients()

If you need direct access to the redis clients, use .clients():

console.log(limiter.clients());
// { client: <Redis Client>, subscriber: <Redis Client> }

Additional Clustering information

  • Bottleneck is compatible with Redis Clusters, but you must use the ioredis datastore and the clusterNodes option.
  • Bottleneck is compatible with Redis Sentinel, but you must use the ioredis datastore.
  • Bottleneck's data is stored in Redis keys starting with b_. It also uses pubsub channels starting with b_ It will not interfere with any other data stored on the server.
  • Bottleneck loads a few Lua scripts on the Redis server using the SCRIPT LOAD command. These scripts only take up a few Kb of memory. Running the SCRIPT FLUSH command will cause any connected limiters to experience critical errors until a new limiter connects to Redis and loads the scripts again.
  • The Lua scripts are highly optimized and designed to use as few resources as possible.

Managing Redis Connections

Bottleneck needs to create 2 Redis Clients to function, one for normal operations and one for pubsub subscriptions. These 2 clients are kept in a Bottleneck.RedisConnection (NodeRedis) or a Bottleneck.IORedisConnection (ioredis) object, referred to as the Connection object.

By default, every Group and every standalone limiter (a limiter not created by a Group) will create their own Connection object, but it is possible to manually control this behavior. In this example, every Group and limiter is sharing the same Connection object and therefore the same 2 clients:

const connection = new Bottleneck.RedisConnection({
  clientOptions: {/* NodeRedis/ioredis options */}
  // ioredis also accepts `clusterNodes` here
});


const limiter = new Bottleneck({ connection: connection });
const group = new Bottleneck.Group({ connection: connection });

You can access and reuse the Connection object of any Group or limiter:

const group = new Bottleneck.Group({ connection: limiter.connection });

When a Connection object is created manually, the connectivity "error" events are emitted on the Connection itself.

connection.on("error", (err) => { /* handle connectivity errors here */ });

If you already have a NodeRedis/ioredis client, you can ask Bottleneck to reuse it, although currently the Connection object will still create a second client for pubsub operations:

import Redis from "redis";
const client = new Redis.createClient({/* options */});

const connection = new Bottleneck.RedisConnection({
  // `clientOptions` and `clusterNodes` will be ignored since we're passing a raw client
  client: client
});

const limiter = new Bottleneck({ connection: connection });
const group = new Bottleneck.Group({ connection: connection });

Depending on your application, using more clients can improve performance.

Use the disconnect(flush) method to close the Redis clients.

limiter.disconnect();
group.disconnect();

If you created the Connection object manually, you need to call connection.disconnect() instead, for safety reasons.

Debugging your application

Debugging complex scheduling logic can be difficult, especially when priorities, weights, and network latency all interact with one another.

If your application is not behaving as expected, start by making sure you're catching "error" events emitted by your limiters and your Groups. Those errors are most likely uncaught exceptions from your application code.

Make sure you've read the 'Gotchas' section.

To see exactly what a limiter is doing in real time, listen to the "debug" event. It contains detailed information about how the limiter is executing your code. Adding job IDs to all your jobs makes the debug output more readable.

When Bottleneck has to fail one of your jobs, it does so by using BottleneckError objects. This lets you tell those errors apart from your own code's errors:

limiter.schedule(fn)
.then((result) => { /* ... */ } )
.catch((error) => {
  if (error instanceof Bottleneck.BottleneckError) {
    /* ... */
  }
});

Upgrading to v2

The internal algorithms essentially haven't changed from v1, but many small changes to the interface were made to introduce new features.

All the breaking changes:

  • Bottleneck v2 requires Node 6+ or a modern browser. Use require("bottleneck/es5") if you need ES5 support in v2. Bottleneck v1 will continue to use ES5 only.
  • The Bottleneck constructor now takes an options object. See Constructor.
  • The Cluster feature is now called Group. This is to distinguish it from the new v2 Clustering feature.
  • The Group constructor takes an options object to match the limiter constructor.
  • Jobs take an optional options object. See Job options.
  • Removed submitPriority(), use submit() with an options object instead.
  • Removed schedulePriority(), use schedule() with an options object instead.
  • The rejectOnDrop option is now true by default. It can be set to false if you wish to retain v1 behavior. However this option is left undocumented as enabling it is considered to be a poor practice.
  • Use null instead of 0 to indicate an unlimited maxConcurrent value.
  • Use null instead of -1 to indicate an unlimited highWater value.
  • Renamed changeSettings() to updateSettings(), it now returns a promise to indicate completion. It takes the same options object as the constructor.
  • Renamed nbQueued() to queued().
  • Renamed nbRunning to running(), it now returns its result using a promise.
  • Removed isBlocked().
  • Changing the Promise library is now done through the options object like any other limiter setting.
  • Removed changePenalty(), it is now done through the options object like any other limiter setting.
  • Removed changeReservoir(), it is now done through the options object like any other limiter setting.
  • Removed stopAll(). Use the new stop() method.
  • check() now accepts an optional weight argument, and returns its result using a promise.
  • Removed the Group changeTimeout() method. Instead, pass a timeout option when creating a Group.

Version 2 is more user-friendly and powerful.

After upgrading your code, please take a minute to read the Debugging your application chapter.

Contributing

This README is always in need of improvements. If wording can be clearer and simpler, please consider forking this repo and submitting a Pull Request, or simply opening an issue.

Suggestions and bug reports are also welcome.

To work on the Bottleneck code, simply clone the repo, makes your changes to the files located in src/ only, then run ./scripts/build.sh && npm test to ensure that everything is set up correctly.

To speed up compilation time during development, run ./scripts/build.sh dev instead. Make sure to build and test without dev before submitting a PR.

The tests must also pass in Clustering mode and using the ES5 bundle. You'll need a Redis server running locally (latency needs to be minimal to run the tests). If the server isn't using the default hostname and port, you can set those in the .env file. Then run ./scripts/build.sh && npm run test-all.

All contributions are appreciated and will be considered.

Author: SGrondin
Source Code: https://github.com/SGrondin/bottleneck 
License: MIT License

#node #clustering #scheduling 

Bottleneck: Rate Limiter That Makes Throttling Easy
Dexter  Goodwin

Dexter Goodwin

1643303820

Bottleneck: A powerful rate limiter that makes throttling easy

bottleneck

Bottleneck is a lightweight and zero-dependency Task Scheduler and Rate Limiter for Node.js and the browser.

Bottleneck is an easy solution as it adds very little complexity to your code. It is battle-hardened, reliable and production-ready and used on a large scale in private companies and open source software.

It supports Clustering: it can rate limit jobs across multiple Node.js instances. It uses Redis and strictly atomic operations to stay reliable in the presence of unreliable clients and networks. It also supports Redis Cluster and Redis Sentinel.

Upgrading from version 1?

Install

npm install --save bottleneck
import Bottleneck from "bottleneck";

// Note: To support older browsers and Node <6.0, you must import the ES5 bundle instead.
var Bottleneck = require("bottleneck/es5");

Quick Start

Step 1 of 3

Most APIs have a rate limit. For example, to execute 3 requests per second:

const limiter = new Bottleneck({
  minTime: 333
});

If there's a chance some requests might take longer than 333ms and you want to prevent more than 1 request from running at a time, add maxConcurrent: 1:

const limiter = new Bottleneck({
  maxConcurrent: 1,
  minTime: 333
});

minTime and maxConcurrent are enough for the majority of use cases. They work well together to ensure a smooth rate of requests. If your use case requires executing requests in bursts or every time a quota resets, look into Reservoir Intervals.

Step 2 of 3

➤ Using promises?

Instead of this:

myFunction(arg1, arg2)
.then((result) => {
  /* handle result */
});

Do this:

limiter.schedule(() => myFunction(arg1, arg2))
.then((result) => {
  /* handle result */
});

Or this:

const wrapped = limiter.wrap(myFunction);

wrapped(arg1, arg2)
.then((result) => {
  /* handle result */
});

➤ Using async/await?

Instead of this:

const result = await myFunction(arg1, arg2);

Do this:

const result = await limiter.schedule(() => myFunction(arg1, arg2));

Or this:

const wrapped = limiter.wrap(myFunction);

const result = await wrapped(arg1, arg2);

➤ Using callbacks?

Instead of this:

someAsyncCall(arg1, arg2, callback);

Do this:

limiter.submit(someAsyncCall, arg1, arg2, callback);

Step 3 of 3

Remember...

Bottleneck builds a queue of jobs and executes them as soon as possible. By default, the jobs will be executed in the order they were received.

Read the 'Gotchas' and you're good to go. Or keep reading to learn about all the fine tuning and advanced options available. If your rate limits need to be enforced across a cluster of computers, read the Clustering docs.

Need help debugging your application?

Instead of throttling maybe you want to batch up requests into fewer calls?

Gotchas & Common Mistakes

  • Make sure the function you pass to schedule() or wrap() only returns once all the work it does has completed.

Instead of this:

limiter.schedule(() => {
  tasksArray.forEach(x => processTask(x));
  // BAD, we return before our processTask() functions are finished processing!
});

Do this:

limiter.schedule(() => {
  const allTasks = tasksArray.map(x => processTask(x));
  // GOOD, we wait until all tasks are done.
  return Promise.all(allTasks);
});
  • If you're passing an object's method as a job, you'll probably need to bind() the object:
// instead of this:
limiter.schedule(object.doSomething);
// do this:
limiter.schedule(object.doSomething.bind(object));
// or, wrap it in an arrow function instead:
limiter.schedule(() => object.doSomething());

Bottleneck requires Node 6+ to function. However, an ES5 build is included: var Bottleneck = require("bottleneck/es5");.

Make sure you're catching "error" events emitted by your limiters!

Consider setting a maxConcurrent value instead of leaving it null. This can help your application's performance, especially if you think the limiter's queue might become very long.

If you plan on using priorities, make sure to set a maxConcurrent value.

When using submit(), if a callback isn't necessary, you must pass null or an empty function instead. It will not work otherwise.

When using submit(), make sure all the jobs will eventually complete by calling their callback, or set an expiration. Even if you submitted your job with a null callback , it still needs to call its callback. This is particularly important if you are using a maxConcurrent value that isn't null (unlimited), otherwise those not completed jobs will be clogging up the limiter and no new jobs will be allowed to run. It's safe to call the callback more than once, subsequent calls are ignored.

Using tools like mockdate in your tests to change time in JavaScript will likely result in undefined behavior from Bottleneck.

Docs

Constructor

const limiter = new Bottleneck({/* options */});

Basic options:

OptionDefaultDescription
maxConcurrentnull (unlimited)How many jobs can be executing at the same time. Consider setting a value instead of leaving it null, it can help your application's performance, especially if you think the limiter's queue might get very long.
minTime0 msHow long to wait after launching a job before launching another one.
highWaternull (unlimited)How long can the queue be? When the queue length exceeds that value, the selected strategy is executed to shed the load.
strategyBottleneck.strategy.LEAKWhich strategy to use when the queue gets longer than the high water mark. Read about strategies. Strategies are never executed if highWater is null.
penalty15 * minTime, or 5000 when minTime is 0The penalty value used by the BLOCK strategy.
reservoirnull (unlimited)How many jobs can be executed before the limiter stops executing jobs. If reservoir reaches 0, no jobs will be executed until it is no longer 0. New jobs will still be queued up.
reservoirRefreshIntervalnull (disabled)Every reservoirRefreshInterval milliseconds, the reservoir value will be automatically updated to the value of reservoirRefreshAmount. The reservoirRefreshInterval value should be a multiple of 250 (5000 for Clustering).
reservoirRefreshAmountnull (disabled)The value to set reservoir to when reservoirRefreshInterval is in use.
reservoirIncreaseIntervalnull (disabled)Every reservoirIncreaseInterval milliseconds, the reservoir value will be automatically incremented by reservoirIncreaseAmount. The reservoirIncreaseInterval value should be a multiple of 250 (5000 for Clustering).
reservoirIncreaseAmountnull (disabled)The increment applied to reservoir when reservoirIncreaseInterval is in use.
reservoirIncreaseMaximumnull (disabled)The maximum value that reservoir can reach when reservoirIncreaseInterval is in use.
PromisePromise (built-in)This lets you override the Promise library used by Bottleneck.

Reservoir Intervals

Reservoir Intervals let you execute requests in bursts, by automatically controlling the limiter's reservoir value. The reservoir is simply the number of jobs the limiter is allowed to execute. Once the value reaches 0, it stops starting new jobs.

There are 2 types of Reservoir Intervals: Refresh Intervals and Increase Intervals.

Refresh Interval

In this example, we throttle to 100 requests every 60 seconds:

const limiter = new Bottleneck({
  reservoir: 100, // initial value
  reservoirRefreshAmount: 100,
  reservoirRefreshInterval: 60 * 1000, // must be divisible by 250

  // also use maxConcurrent and/or minTime for safety
  maxConcurrent: 1,
  minTime: 333 // pick a value that makes sense for your use case
});

reservoir is a counter decremented every time a job is launched, we set its initial value to 100. Then, every reservoirRefreshInterval (60000 ms), reservoir is automatically updated to be equal to the reservoirRefreshAmount (100).

Increase Interval

In this example, we throttle jobs to meet the Shopify API Rate Limits. Users are allowed to send 40 requests initially, then every second grants 2 more requests up to a maximum of 40.

const limiter = new Bottleneck({
  reservoir: 40, // initial value
  reservoirIncreaseAmount: 2,
  reservoirIncreaseInterval: 1000, // must be divisible by 250
  reservoirIncreaseMaximum: 40,

  // also use maxConcurrent and/or minTime for safety
  maxConcurrent: 5,
  minTime: 250 // pick a value that makes sense for your use case
});

Warnings

Reservoir Intervals are an advanced feature, please take the time to read and understand the following warnings.

Reservoir Intervals are not a replacement for minTime and maxConcurrent. It's strongly recommended to also use minTime and/or maxConcurrent to spread out the load. For example, suppose a lot of jobs are queued up because the reservoir is 0. Every time the Refresh Interval is triggered, a number of jobs equal to reservoirRefreshAmount will automatically be launched, all at the same time! To prevent this flooding effect and keep your application running smoothly, use minTime and maxConcurrent to stagger the jobs.

The Reservoir Interval starts from the moment the limiter is created. Let's suppose we're using reservoirRefreshAmount: 5. If you happen to add 10 jobs just 1ms before the refresh is triggered, the first 5 will run immediately, then 1ms later it will refresh the reservoir value and that will make the last 5 also run right away. It will have run 10 jobs in just over 1ms no matter what your reservoir interval was!

Reservoir Intervals prevent a limiter from being garbage collected. Call limiter.disconnect() to clear the interval and allow the memory to be freed. However, it's not necessary to call .disconnect() to allow the Node.js process to exit.

submit()

Adds a job to the queue. This is the callback version of schedule().

limiter.submit(someAsyncCall, arg1, arg2, callback);

You can pass null instead of an empty function if there is no callback, but someAsyncCall still needs to call its callback to let the limiter know it has completed its work.

submit() can also accept advanced options.

schedule()

Adds a job to the queue. This is the Promise and async/await version of submit().

const fn = function(arg1, arg2) {
  return httpGet(arg1, arg2); // Here httpGet() returns a promise
};

limiter.schedule(fn, arg1, arg2)
.then((result) => {
  /* ... */
});

In other words, schedule() takes a function fn and a list of arguments. schedule() returns a promise that will be executed according to the rate limits.

schedule() can also accept advanced options.

Here's another example:

// suppose that `client.get(url)` returns a promise

const url = "https://wikipedia.org";

limiter.schedule(() => client.get(url))
.then(response => console.log(response.body));

wrap()

Takes a function that returns a promise. Returns a function identical to the original, but rate limited.

const wrapped = limiter.wrap(fn);

wrapped()
.then(function (result) {
  /* ... */
})
.catch(function (error) {
  // Bottleneck might need to fail the job even if the original function can never fail.
  // For example, your job is taking longer than the `expiration` time you've set.
});

Job Options

submit(), schedule(), and wrap() all accept advanced options.

// Submit
limiter.submit({/* options */}, someAsyncCall, arg1, arg2, callback);

// Schedule
limiter.schedule({/* options */}, fn, arg1, arg2);

// Wrap
const wrapped = limiter.wrap(fn);
wrapped.withOptions({/* options */}, arg1, arg2);
OptionDefaultDescription
priority5A priority between 0 and 9. A job with a priority of 4 will be queued ahead of a job with a priority of 5. Important: You must set a low maxConcurrent value for priorities to work, otherwise there is nothing to queue because jobs will be be scheduled immediately!
weight1Must be an integer equal to or higher than 0. The weight is what increases the number of running jobs (up to maxConcurrent) and decreases the reservoir value.
expirationnull (unlimited)The number of milliseconds a job is given to complete. Jobs that execute for longer than expiration ms will be failed with a BottleneckError.
id<no-id>You should give an ID to your jobs, it helps with debugging.

Strategies

A strategy is a simple algorithm that is executed every time adding a job would cause the number of queued jobs to exceed highWater. Strategies are never executed if highWater is null.

Bottleneck.strategy.LEAK

When adding a new job to a limiter, if the queue length reaches highWater, drop the oldest job with the lowest priority. This is useful when jobs that have been waiting for too long are not important anymore. If all the queued jobs are more important (based on their priority value) than the one being added, it will not be added.

Bottleneck.strategy.OVERFLOW_PRIORITY

Same as LEAK, except it will only drop jobs that are less important than the one being added. If all the queued jobs are as or more important than the new one, it will not be added.

Bottleneck.strategy.OVERFLOW

When adding a new job to a limiter, if the queue length reaches highWater, do not add the new job. This strategy totally ignores priority levels.

Bottleneck.strategy.BLOCK

When adding a new job to a limiter, if the queue length reaches highWater, the limiter falls into "blocked mode". All queued jobs are dropped and no new jobs will be accepted until the limiter unblocks. It will unblock after penalty milliseconds have passed without receiving a new job. penalty is equal to 15 * minTime (or 5000 if minTime is 0) by default. This strategy is ideal when bruteforce attacks are to be expected. This strategy totally ignores priority levels.

Jobs lifecycle

  1. Received. Your new job has been added to the limiter. Bottleneck needs to check whether it can be accepted into the queue.
  2. Queued. Bottleneck has accepted your job, but it can not tell at what exact timestamp it will run yet, because it is dependent on previous jobs.
  3. Running. Your job is not in the queue anymore, it will be executed after a delay that was computed according to your minTime setting.
  4. Executing. Your job is executing its code.
  5. Done. Your job has completed.

Note: By default, Bottleneck does not keep track of DONE jobs, to save memory. You can enable this feature by passing trackDoneStatus: true as an option when creating a limiter.

counts()

const counts = limiter.counts();

console.log(counts);
/*
{
  RECEIVED: 0,
  QUEUED: 0,
  RUNNING: 0,
  EXECUTING: 0,
  DONE: 0
}
*/

Returns an object with the current number of jobs per status in the limiter.

jobStatus()

console.log(limiter.jobStatus("some-job-id"));
// Example: QUEUED

Returns the status of the job with the provided job id in the limiter. Returns null if no job with that id exist.

jobs()

console.log(limiter.jobs("RUNNING"));
// Example: ['id1', 'id2']

Returns an array of all the job ids with the specified status in the limiter. Not passing a status string returns all the known ids.

queued()

const count = limiter.queued(priority);

console.log(count);

priority is optional. Returns the number of QUEUED jobs with the given priority level. Omitting the priority argument returns the total number of queued jobs in the limiter.

clusterQueued()

const count = await limiter.clusterQueued();

console.log(count);

Returns the number of QUEUED jobs in the Cluster.

empty()

if (limiter.empty()) {
  // do something...
}

Returns a boolean which indicates whether there are any RECEIVED or QUEUED jobs in the limiter.

running()

limiter.running()
.then((count) => console.log(count));

Returns a promise that returns the total weight of the RUNNING and EXECUTING jobs in the Cluster.

done()

limiter.done()
.then((count) => console.log(count));

Returns a promise that returns the total weight of DONE jobs in the Cluster. Does not require passing the trackDoneStatus: true option.

check()

limiter.check()
.then((wouldRunNow) => console.log(wouldRunNow));

Checks if a new job would be executed immediately if it was submitted now. Returns a promise that returns a boolean.

Events

'error'

limiter.on("error", function (error) {
  /* handle errors here */
});

The two main causes of error events are: uncaught exceptions in your event handlers, and network errors when Clustering is enabled.

'failed'

limiter.on("failed", function (error, jobInfo) {
  // This will be called every time a job fails.
});

'retry'

See Retries to learn how to automatically retry jobs.

limiter.on("retry", function (message, jobInfo) {
  // This will be called every time a job is retried.
});

'empty'

limiter.on("empty", function () {
  // This will be called when `limiter.empty()` becomes true.
});

'idle'

limiter.on("idle", function () {
  // This will be called when `limiter.empty()` is `true` and `limiter.running()` is `0`.
});

'dropped'

limiter.on("dropped", function (dropped) {
  // This will be called when a strategy was triggered.
  // The dropped request is passed to this event listener.
});

'depleted'

limiter.on("depleted", function (empty) {
  // This will be called every time the reservoir drops to 0.
  // The `empty` (boolean) argument indicates whether `limiter.empty()` is currently true.
});

'debug'

limiter.on("debug", function (message, data) {
  // Useful to figure out what the limiter is doing in real time
  // and to help debug your application
});

'received' 'queued' 'scheduled' 'executing' 'done'

limiter.on("queued", function (info) {
  // This event is triggered when a job transitions from one Lifecycle stage to another
});

See Jobs Lifecycle for more information.

These Lifecycle events are not triggered for jobs located on another limiter in a Cluster, for performance reasons.

Other event methods

Use removeAllListeners() with an optional event name as first argument to remove listeners.

Use .once() instead of .on() to only receive a single event.

Retries

The following example:

const limiter = new Bottleneck();

// Listen to the "failed" event
limiter.on("failed", async (error, jobInfo) => {
  const id = jobInfo.options.id;
  console.warn(`Job ${id} failed: ${error}`);

  if (jobInfo.retryCount === 0) { // Here we only retry once
    console.log(`Retrying job ${id} in 25ms!`);
    return 25;
  }
});

// Listen to the "retry" event
limiter.on("retry", (error, jobInfo) => console.log(`Now retrying ${jobInfo.options.id}`));

const main = async function () {
  let executions = 0;

  // Schedule one job
  const result = await limiter.schedule({ id: 'ABC123' }, async () => {
    executions++;
    if (executions === 1) {
      throw new Error("Boom!");
    } else {
      return "Success!";
    }
  });

  console.log(`Result: ${result}`);
}

main();

will output

Job ABC123 failed: Error: Boom!
Retrying job ABC123 in 25ms!
Now retrying ABC123
Result: Success!

To re-run your job, simply return an integer from the 'failed' event handler. The number returned is how many milliseconds to wait before retrying it. Return 0 to retry it immediately.

IMPORTANT: When you ask the limiter to retry a job it will not send it back into the queue. It will stay in the EXECUTING state until it succeeds or until you stop retrying it. This means that it counts as a concurrent job for maxConcurrent even while it's just waiting to be retried. The number of milliseconds to wait ignores your minTime settings.

updateSettings()

limiter.updateSettings(options);

The options are the same as the limiter constructor.

Note: Changes don't affect SCHEDULED jobs.

incrementReservoir()

limiter.incrementReservoir(incrementBy);

Returns a promise that returns the new reservoir value.

currentReservoir()

limiter.currentReservoir()
.then((reservoir) => console.log(reservoir));

Returns a promise that returns the current reservoir value.

stop()

The stop() method is used to safely shutdown a limiter. It prevents any new jobs from being added to the limiter and waits for all EXECUTING jobs to complete.

limiter.stop(options)
.then(() => {
  console.log("Shutdown completed!")
});

stop() returns a promise that resolves once all the EXECUTING jobs have completed and, if desired, once all non-EXECUTING jobs have been dropped.

OptionDefaultDescription
dropWaitingJobstrueWhen true, drop all the RECEIVED, QUEUED and RUNNING jobs. When false, allow those jobs to complete before resolving the Promise returned by this method.
dropErrorMessageThis limiter has been stopped.The error message used to drop jobs when dropWaitingJobs is true.
enqueueErrorMessageThis limiter has been stopped and cannot accept new jobs.The error message used to reject a job added to the limiter after stop() has been called.

chain()

Tasks that are ready to be executed will be added to that other limiter. Suppose you have 2 types of tasks, A and B. They both have their own limiter with their own settings, but both must also follow a global limiter G:

const limiterA = new Bottleneck( /* some settings */ );
const limiterB = new Bottleneck( /* some different settings */ );
const limiterG = new Bottleneck( /* some global settings */ );

limiterA.chain(limiterG);
limiterB.chain(limiterG);

// Requests added to limiterA must follow the A and G rate limits.
// Requests added to limiterB must follow the B and G rate limits.
// Requests added to limiterG must follow the G rate limits.

To unchain, call limiter.chain(null);.

Group

The Group feature of Bottleneck manages many limiters automatically for you. It creates limiters dynamically and transparently.

Let's take a DNS server as an example of how Bottleneck can be used. It's a service that sees a lot of abuse and where incoming DNS requests need to be rate limited. Bottleneck is so tiny, it's acceptable to create one limiter for each origin IP, even if it means creating thousands of limiters. The Group feature is perfect for this use case. Create one Group and use the origin IP to rate limit each IP independently. Each call with the same key (IP) will be routed to the same underlying limiter. A Group is created like a limiter:

const group = new Bottleneck.Group(options);

The options object will be used for every limiter created by the Group.

The Group is then used with the .key(str) method:

// In this example, the key is an IP
group.key("77.66.54.32").schedule(() => {
  /* process the request */
});

key()

  • str : The key to use. All jobs added with the same key will use the same underlying limiter. Default: ""

The return value of .key(str) is a limiter. If it doesn't already exist, it is generated for you. Calling key() is how limiters are created inside a Group.

Limiters that have been idle for longer than 5 minutes are deleted to avoid memory leaks, this value can be changed by passing a different timeout option, in milliseconds.

on("created")

group.on("created", (limiter, key) => {
  console.log("A new limiter was created for key: " + key)

  // Prepare the limiter, for example we'll want to listen to its "error" events!
  limiter.on("error", (err) => {
    // Handle errors here
  })
});

Listening for the "created" event is the recommended way to set up a new limiter. Your event handler is executed before key() returns the newly created limiter.

updateSettings()

const group = new Bottleneck.Group({ maxConcurrent: 2, minTime: 250 });
group.updateSettings({ minTime: 500 });

After executing the above commands, new limiters will be created with { maxConcurrent: 2, minTime: 500 }.

deleteKey()

  • str: The key for the limiter to delete.

Manually deletes the limiter at the specified key. When using Clustering, the Redis data is immediately deleted and the other Groups in the Cluster will eventually delete their local key automatically, unless it is still being used.

keys()

Returns an array containing all the keys in the Group.

clusterKeys()

Same as group.keys(), but returns all keys in this Group ID across the Cluster.

limiters()

const limiters = group.limiters();

console.log(limiters);
// [ { key: "some key", limiter: <limiter> }, { key: "some other key", limiter: <some other limiter> } ]

Batching

Some APIs can accept multiple operations in a single call. Bottleneck's Batching feature helps you take advantage of those APIs:

const batcher = new Bottleneck.Batcher({
  maxTime: 1000,
  maxSize: 10
});

batcher.on("batch", (batch) => {
  console.log(batch); // ["some-data", "some-other-data"]

  // Handle batch here
});

batcher.add("some-data");
batcher.add("some-other-data");

batcher.add() returns a Promise that resolves once the request has been flushed to a "batch" event.

OptionDefaultDescription
maxTimenull (unlimited)Maximum acceptable time (in milliseconds) a request can have to wait before being flushed to the "batch" event.
maxSizenull (unlimited)Maximum number of requests in a batch.

Batching doesn't throttle requests, it only groups them up optimally according to your maxTime and maxSize settings.

Clustering

Clustering lets many limiters access the same shared state, stored in Redis. Changes to the state are Atomic, Consistent and Isolated (and fully ACID with the right Durability configuration), to eliminate any chances of race conditions or state corruption. Your settings, such as maxConcurrent, minTime, etc., are shared across the whole cluster, which means —for example— that { maxConcurrent: 5 } guarantees no more than 5 jobs can ever run at a time in the entire cluster of limiters. 100% of Bottleneck's features are supported in Clustering mode. Enabling Clustering is as simple as changing a few settings. It's also a convenient way to store or export state for later use.

Bottleneck will attempt to spread load evenly across limiters.

Enabling Clustering

First, add redis or ioredis to your application's dependencies:

# NodeRedis (https://github.com/NodeRedis/node_redis)
npm install --save redis

# or ioredis (https://github.com/luin/ioredis)
npm install --save ioredis

Then create a limiter or a Group:

const limiter = new Bottleneck({
  /* Some basic options */
  maxConcurrent: 5,
  minTime: 500
  id: "my-super-app" // All limiters with the same id will be clustered together

  /* Clustering options */
  datastore: "redis", // or "ioredis"
  clearDatastore: false,
  clientOptions: {
    host: "127.0.0.1",
    port: 6379

    // Redis client options
    // Using NodeRedis? See https://github.com/NodeRedis/node_redis#options-object-properties
    // Using ioredis? See https://github.com/luin/ioredis/blob/master/API.md#new-redisport-host-options
  }
});
OptionDefaultDescription
datastore"local"Where the limiter stores its internal state. The default ("local") keeps the state in the limiter itself. Set it to "redis" or "ioredis" to enable Clustering.
clearDatastorefalseWhen set to true, on initial startup, the limiter will wipe any existing Bottleneck state data on the Redis db.
clientOptions{}This object is passed directly to the redis client library you've selected.
clusterNodesnullioredis only. When clusterNodes is not null, the client will be instantiated by calling new Redis.Cluster(clusterNodes, clientOptions) instead of new Redis(clientOptions).
timeoutnull (no TTL)The Redis TTL in milliseconds (TTL) for the keys created by the limiter. When timeout is set, the limiter's state will be automatically removed from Redis after timeout milliseconds of inactivity.
RedisnullOverrides the import/require of the redis/ioredis library. You shouldn't need to set this option unless your application is failing to start due to a failure to require/import the client library.

Note: When using Groups, the timeout option has a default of 300000 milliseconds and the generated limiters automatically receive an id with the pattern ${group.id}-${KEY}.

Note: If you are seeing a runtime error due to the require() function not being able to load redis/ioredis, then directly pass the module as the Redis option. Example:

import Redis from "ioredis"

const limiter = new Bottleneck({
  id: "my-super-app",
  datastore: "ioredis",
  clientOptions: { host: '12.34.56.78', port: 6379 },
  Redis
});

Unfortunately, this is a side effect of having to disable inlining, which is necessary to make Bottleneck easy to use in the browser.

Important considerations when Clustering

The first limiter connecting to Redis will store its constructor options on Redis and all subsequent limiters will be using those settings. You can alter the constructor options used by all the connected limiters by calling updateSettings(). The clearDatastore option instructs a new limiter to wipe any previous Bottleneck data (for that id), including previously stored settings.

Queued jobs are NOT stored on Redis. They are local to each limiter. Exiting the Node.js process will lose those jobs. This is because Bottleneck has no way to propagate the JS code to run a job across a different Node.js process than the one it originated on. Bottleneck doesn't keep track of the queue contents of the limiters on a cluster for performance and reliability reasons. You can use something like BeeQueue in addition to Bottleneck to get around this limitation.

Due to the above, functionality relying on the queue length happens purely locally:

  • Priorities are local. A higher priority job will run before a lower priority job on the same limiter. Another limiter on the cluster might run a lower priority job before our higher priority one.
  • Assuming constant priority levels, Bottleneck guarantees that jobs will be run in the order they were received on the same limiter. Another limiter on the cluster might run a job received later before ours runs.
  • highWater and load shedding (strategies) are per limiter. However, one limiter entering Blocked mode will put the entire cluster in Blocked mode until penalty milliseconds have passed. See Strategies.
  • The "empty" event is triggered when the (local) queue is empty.
  • The "idle" event is triggered when the (local) queue is empty and no jobs are currently running anywhere in the cluster.

You must work around these limitations in your application code if they are an issue to you. The publish() method could be useful here.

The current design guarantees reliability, is highly performant and lets limiters come and go. Your application can scale up or down, and clients can be disconnected at any time without issues.

It is strongly recommended that you give an id to every limiter and Group since it is used to build the name of your limiter's Redis keys! Limiters with the same id inside the same Redis db will be sharing the same datastore.

It is strongly recommended that you set an expiration (See Job Options) on every job, since that lets the cluster recover from crashed or disconnected clients. Otherwise, a client crashing while executing a job would not be able to tell the cluster to decrease its number of "running" jobs. By using expirations, those lost jobs are automatically cleared after the specified time has passed. Using expirations is essential to keeping a cluster reliable in the face of unpredictable application bugs, network hiccups, and so on.

Network latency between Node.js and Redis is not taken into account when calculating timings (such as minTime). To minimize the impact of latency, Bottleneck only performs a single Redis call per lifecycle transition. Keeping the Redis server close to your limiters will help you get a more consistent experience. Keeping the system time consistent across all clients will also help.

It is strongly recommended to set up an "error" listener on all your limiters and on your Groups.

Clustering Methods

The ready(), publish() and clients() methods also exist when using the local datastore, for code compatibility reasons: code written for redis/ioredis won't break with local.

ready()

This method returns a promise that resolves once the limiter is connected to Redis.

As of v2.9.0, it's no longer necessary to wait for .ready() to resolve before issuing commands to a limiter. The commands will be queued until the limiter successfully connects. Make sure to listen to the "error" event to handle connection errors.

const limiter = new Bottleneck({/* options */});

limiter.on("error", (err) => {
  // handle network errors
});

limiter.ready()
.then(() => {
  // The limiter is ready
});

publish(message)

This method broadcasts the message string to every limiter in the Cluster. It returns a promise.

const limiter = new Bottleneck({/* options */});

limiter.on("message", (msg) => {
  console.log(msg); // prints "this is a string"
});

limiter.publish("this is a string");

To send objects, stringify them first:

limiter.on("message", (msg) => {
  console.log(JSON.parse(msg).hello) // prints "world"
});

limiter.publish(JSON.stringify({ hello: "world" }));

clients()

If you need direct access to the redis clients, use .clients():

console.log(limiter.clients());
// { client: <Redis Client>, subscriber: <Redis Client> }

Additional Clustering information

  • Bottleneck is compatible with Redis Clusters, but you must use the ioredis datastore and the clusterNodes option.
  • Bottleneck is compatible with Redis Sentinel, but you must use the ioredis datastore.
  • Bottleneck's data is stored in Redis keys starting with b_. It also uses pubsub channels starting with b_ It will not interfere with any other data stored on the server.
  • Bottleneck loads a few Lua scripts on the Redis server using the SCRIPT LOAD command. These scripts only take up a few Kb of memory. Running the SCRIPT FLUSH command will cause any connected limiters to experience critical errors until a new limiter connects to Redis and loads the scripts again.
  • The Lua scripts are highly optimized and designed to use as few resources as possible.

Managing Redis Connections

Bottleneck needs to create 2 Redis Clients to function, one for normal operations and one for pubsub subscriptions. These 2 clients are kept in a Bottleneck.RedisConnection (NodeRedis) or a Bottleneck.IORedisConnection (ioredis) object, referred to as the Connection object.

By default, every Group and every standalone limiter (a limiter not created by a Group) will create their own Connection object, but it is possible to manually control this behavior. In this example, every Group and limiter is sharing the same Connection object and therefore the same 2 clients:

const connection = new Bottleneck.RedisConnection({
  clientOptions: {/* NodeRedis/ioredis options */}
  // ioredis also accepts `clusterNodes` here
});


const limiter = new Bottleneck({ connection: connection });
const group = new Bottleneck.Group({ connection: connection });

You can access and reuse the Connection object of any Group or limiter:

const group = new Bottleneck.Group({ connection: limiter.connection });

When a Connection object is created manually, the connectivity "error" events are emitted on the Connection itself.

connection.on("error", (err) => { /* handle connectivity errors here */ });

If you already have a NodeRedis/ioredis client, you can ask Bottleneck to reuse it, although currently the Connection object will still create a second client for pubsub operations:

import Redis from "redis";
const client = new Redis.createClient({/* options */});

const connection = new Bottleneck.RedisConnection({
  // `clientOptions` and `clusterNodes` will be ignored since we're passing a raw client
  client: client
});

const limiter = new Bottleneck({ connection: connection });
const group = new Bottleneck.Group({ connection: connection });

Depending on your application, using more clients can improve performance.

Use the disconnect(flush) method to close the Redis clients.

limiter.disconnect();
group.disconnect();

If you created the Connection object manually, you need to call connection.disconnect() instead, for safety reasons.

Debugging your application

Debugging complex scheduling logic can be difficult, especially when priorities, weights, and network latency all interact with one another.

If your application is not behaving as expected, start by making sure you're catching "error" events emitted by your limiters and your Groups. Those errors are most likely uncaught exceptions from your application code.

Make sure you've read the 'Gotchas' section.

To see exactly what a limiter is doing in real time, listen to the "debug" event. It contains detailed information about how the limiter is executing your code. Adding job IDs to all your jobs makes the debug output more readable.

When Bottleneck has to fail one of your jobs, it does so by using BottleneckError objects. This lets you tell those errors apart from your own code's errors:

limiter.schedule(fn)
.then((result) => { /* ... */ } )
.catch((error) => {
  if (error instanceof Bottleneck.BottleneckError) {
    /* ... */
  }
});

Upgrading to v2

The internal algorithms essentially haven't changed from v1, but many small changes to the interface were made to introduce new features.

All the breaking changes:

  • Bottleneck v2 requires Node 6+ or a modern browser. Use require("bottleneck/es5") if you need ES5 support in v2. Bottleneck v1 will continue to use ES5 only.
  • The Bottleneck constructor now takes an options object. See Constructor.
  • The Cluster feature is now called Group. This is to distinguish it from the new v2 Clustering feature.
  • The Group constructor takes an options object to match the limiter constructor.
  • Jobs take an optional options object. See Job options.
  • Removed submitPriority(), use submit() with an options object instead.
  • Removed schedulePriority(), use schedule() with an options object instead.
  • The rejectOnDrop option is now true by default. It can be set to false if you wish to retain v1 behavior. However this option is left undocumented as enabling it is considered to be a poor practice.
  • Use null instead of 0 to indicate an unlimited maxConcurrent value.
  • Use null instead of -1 to indicate an unlimited highWater value.
  • Renamed changeSettings() to updateSettings(), it now returns a promise to indicate completion. It takes the same options object as the constructor.
  • Renamed nbQueued() to queued().
  • Renamed nbRunning to running(), it now returns its result using a promise.
  • Removed isBlocked().
  • Changing the Promise library is now done through the options object like any other limiter setting.
  • Removed changePenalty(), it is now done through the options object like any other limiter setting.
  • Removed changeReservoir(), it is now done through the options object like any other limiter setting.
  • Removed stopAll(). Use the new stop() method.
  • check() now accepts an optional weight argument, and returns its result using a promise.
  • Removed the Group changeTimeout() method. Instead, pass a timeout option when creating a Group.

Version 2 is more user-friendly and powerful.

After upgrading your code, please take a minute to read the Debugging your application chapter.

Contributing

This README is always in need of improvements. If wording can be clearer and simpler, please consider forking this repo and submitting a Pull Request, or simply opening an issue.

Suggestions and bug reports are also welcome.

To work on the Bottleneck code, simply clone the repo, makes your changes to the files located in src/ only, then run ./scripts/build.sh && npm test to ensure that everything is set up correctly.

To speed up compilation time during development, run ./scripts/build.sh dev instead. Make sure to build and test without dev before submitting a PR.

The tests must also pass in Clustering mode and using the ES5 bundle. You'll need a Redis server running locally (latency needs to be minimal to run the tests). If the server isn't using the default hostname and port, you can set those in the .env file. Then run ./scripts/build.sh && npm run test-all.

All contributions are appreciated and will be considered.

Author: SGrondin
Source Code: https://github.com/SGrondin/bottleneck 
License: MIT License

#javascript #node #clustering  

Bottleneck: A powerful rate limiter that makes throttling easy
Reid  Rohan

Reid Rohan

1641506040

A Python Library for Accurate and Scalable Fuzzy Matching

Dedupe Python Library

dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data.

dedupe will help you:

  • remove duplicate entries from a spreadsheet of names and addresses
  • link a list with customer information to another with order history, even without unique customer IDs
  • take a database of campaign contributions and figure out which ones were made by the same person, even if the names were entered slightly differently for each record

dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

Important links

dedupe library consulting

If you or your organization would like professional assistance in working with the dedupe library, Dedupe.io LLC offers consulting services. Read more about pricing and available services here.

Tools built with dedupe

Dedupe.io

A cloud service powered by the dedupe library for de-duplicating and finding matches in your data. It provides a step-by-step wizard for uploading your data, setting up a model, training, clustering and reviewing the results.

Dedupe.io also supports record linkage across data sources and continuous matching and training through an API.

For more, see the Dedupe.io product site, tutorials on how to use it, and differences between it and the dedupe library.

Dedupe is well adopted by the Python community. Check out this blogpost, a YouTube video on how to use Dedupe with Python and a Youtube video on how to apply Dedupe at scale using Spark.

csvdedupe

Command line tool for de-duplicating and linking CSV files. Read about it on Source Knight-Mozilla OpenNews.

Installation

Using dedupe

If you only want to use dedupe, install it this way:

pip install dedupe

Familiarize yourself with dedupe's API, and get started on your project. Need inspiration? Have a look at some examples.

Developing dedupe

We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.

Once you have virtualenvwrapper set up,

mkvirtualenv dedupe
git clone git://github.com/dedupeio/dedupe.git
cd dedupe
pip install "numpy>=1.9"
pip install -r requirements.txt
cython src/*.pyx
pip install -e .

If these tests pass, then everything should have been installed correctly!

pytest

Afterwards, whenever you want to work on dedupe,

workon dedupe

Testing

Unit tests of core dedupe functions

pytest

Test using canonical dataset from Bilenko's research

Using Deduplication

python tests/canonical.py

Using Record Linkage

python tests/canonical_matching.py

Team

  • Forest Gregg, DataMade
  • Derek Eder, DataMade

Credits

Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2019 Forest Gregg and Derek Eder. Released under the MIT License.

Third-party copyright in this distribution is noted where applicable.

Citing Dedupe

If you use Dedupe in an academic work, please give this citation:

Forest Gregg and Derek Eder. 2019. Dedupe. https://github.com/dedupeio/dedupe.

Author: Dedupeio
Source Code: https://github.com/dedupeio/dedupe 
License: MIT License

#python #clustering 

A Python Library for Accurate and Scalable Fuzzy Matching

How to Deploy a Spring Boot App on AWS ECS Cluster Step by Step 2021

Amazon Elastic Container Service is a managed container orchestration service which allows you to deploy and scale containerized applications. An overview of the features and pricing can be found at the AWS website.

ECS consists out of a few components:

  • Elastic Container Repository (ECR): A Docker repository to store your Docker images (similar as DockerHub but now provisioned by AWS).
  • Task Definition: A versioned template of a task which you would like to run. Here you will specify the Docker image to be used, memory, CPU, etc. for your container.
  • ECS Cluster: The Cluster definition itself where you will specify how many instances you would like to have and how it should scale.
  • Service: Based on a Task Definition, you will deploy the container by means of a Service into your Cluster.

Here you will specify the Docker image to be used, memory, CPU, etc. for your container. You will create a Docker image for a basic Spring Boot Application, upload it to ECR, create a Task Definition for the image, create a Cluster and deploy the container by means of a Service to the Cluster.

#aws #clustering #spring-boot 

How to Deploy a Spring Boot App on AWS ECS Cluster Step by Step 2021
Edureka Fan

Edureka Fan

1627911476

Clustering Algorithms Used in Data Science

This Edureka video on "Clustering Algorithms" will help you understand the various aspects of clustering using K Means in Python.

#clustering #algorithms #datascience #python #kmeans 

 

Clustering Algorithms Used in Data Science
Nabunya  Jane

Nabunya Jane

1624333080

K-Means Clustering Algorithm

Clustering

K-means is one of the simplest unsupervised machine learning algorithms that solve the well-known data clustering problem. Clustering is one of the most common data analysis tasks used to get an intuition about data structure. It is defined as finding the subgroups in the data such that each data points in different clusters are very different. We are trying to find the homogeneous subgroups within the data. Each group’s data points are similarly based on similarity metrics like a Euclidean-based distance or correlation-based distance.

The algorithm can do clustering analysis based on features or samples. We try to find the subcategory of sampling based on attributes or try to find the subcategory of parts based on samples. The practical applications of such a procedure are many: the best use of clustering in amazon and Netflix recommended system, given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some structure of data that might indicate that the data is separable.

What is K-Means Clustering?

K-means the clustering algorithm whose primary goal is to group similar elements or data points into a cluster.

K in k-means represents the number of clusters.

A cluster refers to a collection of data points aggregated together because of certain similarities.

K-means clustering is an iterative algorithm that starts with k random numbers used as mean values to define clusters. Data points belong to the group represented by the mean value to which they are closest. This mean value co-ordinates called the centroid.

Iteratively, the mean value of each cluster’s data points is computed, and the new mean values are used to restart the process till the mean stops changing. The disadvantage of k-means is that it a local search procedure and could miss global patterns.

The k initial centroids can be randomly selected. Another approach of determining k is to compute the entire dataset’s mean and add _k _random co-ordinates to it to make k initial points. Another method is to determine the principal component of the data and divide it into _k _equal partitions. The mean of each section can be used as initial centroids.

#data-science #algorithms #clustering #k-means #machine-learning

K-Means Clustering Algorithm
Lina  Biyinzika

Lina Biyinzika

1623087480

Key Data Science Algorithms Explained: From k-means to k-medoids clustering

The k-means clustering algorithm is a foundational algorithm that every data scientist should know. It is popular because it is simple, fast, and efficient. It works by dividing all the points into a preselected number (k) of clusters based on the distance between the point and the center of each cluster. The original k-means algorithm is limited because it works only in the Euclidean space and results in suboptimal cluster assignments when the real clusters are unequal in size. Despite its shortcomings, k-means remains one of the most powerful tools for clustering and has been used in healthcare, natural language processing, and physical sciences.

Extensions of the k-means algorithms include smarter starting positions for its k centers, allowing variable cluster sizes, and including more distances than Euclidean distance. In this article, we will focus on methods like PAMCLARA, and CLARANS, which incorporate distance measures beyond the Euclidean distance. These methods are yet to enjoy the fame of k-means because they are slower than k-means for large datasets without a comparable gain in optimality. However, as we will see in this article, researchers have developed newer versions of these algorithms that promise to provide better accuracy and speeds than k-means.

What are the shortcomings of k-means clustering?

For anyone who needs a quick reminder, StatQuest has a great video on k-means clustering.

For this article, we will focus on where k-means fails. Vanilla k-means, as explained in the video, has several disadvantages:

  1. It is difficult to predict the correct number of centroids (k) to partition the data.
  2. The algorithm always divides the space into k clusters, even when the partitions don’t make sense.
  3. The initial positions of the k centroids can affect the results significantly.
  4. It does not work well when the expected clusters differ in size and density.
  5. Since it is a centroid-based approach, outliers in the data can drag the centroids to inaccurate centers.
  6. Since it is a hard clustering method, clusters cannot overlap.
  7. It is sensitive to the scale of the dimensions, and rescaling the data can change the results significantly.
  8. It uses the Euclidean distance to divide points. The Euclidean distance becomes ineffective in high dimensional spaces since all points tend to become uniformly distant from each other. Read a great explanation here.
  9. The centroid is an imaginary point in the dataset and may be meaningless.
  10. Categorical variables cannot be defined by a mean and should be described by their mode.

The above figure shows an example of k-means clustering of the mouse data set using k-means, where k-means performs poorly due to varying cluster sizes.

Introducing Partitioning Around Medoids (PAM) algorithm

Instead of using the mean of the cluster to partition, the medoid, or the most centrally located data point in the cluster can be used to partition the data points; The medoid is the least dissimilar point to all points in the cluster. The medoid is also less sensitive to outliers in the data. These partitions can also use arbitrary distances instead of relying on the Euclidean distance. This is the crux of the clustering algorithm named Partition Around Medoids (PAM), and its extensions CLARA and CLARANS. Watch this video for a succinct explanation of the method.

In short, the following are the steps involved in the PAM method (reference):

Improving PAM with sampling

The time complexity of the PAM algorithm is in the order of O(k(n - k)2), which makes it much slower than the k-means algorithm. Kaufman and Rousseeuw (1990) proposed an improvement that traded optimality for speed, named CLARA (Clustering For Large Applications). In CLARA, the main dataset is split into several smaller, randomly sampled subsets of the data. The PAM algorithm is applied to each subset to obtain the medoids for each set, and the set of medoids that give the best performance on the main dataset are kept. Dudoit and Fridlyand (2003) improve the CLARA workflow by combining the medoids from different samples by voting or bagging, which aims to reduce the variability that would come from applying CLARA.

Another variation named CLARANS (Clustering Large Applications based upon RANdomized Search) (Ng and Han 2002) works by combining sampling and searching on a graph. In this graph, each node represents a set of k medoids. Each node is connected to another node if the set of k medoids in each node differs by one. The graph can be traversed until a local minimum is reached, and that minimum provides the best estimate for the medoids of the dataset.

Making PAM faster

Schubert and Rousseeuw (2019) proposed a faster version of PAM, which can be extended to CLARA, by changing how the algorithm caches the distance values. They summarize it well here:

“This caching was enabled by changing the nesting order of the loops in the algorithm, showing once more how much seemingly minor-looking implementation details can matter (Kriegel et al., 2017). As a second improvement, we propose to find the best swap for each medoid and execute as many as possible in each iteration, which reduces the number of iterations needed for convergence without loss of quality, as demonstrated in the experiments, and as supported by theoretical considerations. In this article, we proposed a modification of the popular PAM algorithm that typically yields an O(k) fold speedup, by clever caching of partial results in order to avoid recomputation.”

In another variation, Yue et al. (2016) proposed a MapReduce framework for speeding up the calculations of the k-medoids algorithm and named it the K-Medoids++ algorithm.

More recently, Tiwari et al. (2020) cast the problem of choosing k medoids into a multi-arm bandit problem and solved it using the Upper Confidence Bound algorithm. This variation was faster than PAM and matched its accuracy.

#2020 dec tutorials #overviews #algorithms #clustering #explained

Key Data Science Algorithms Explained: From k-means to k-medoids clustering

Clustering in Machine Learning: 3 Types of Clustering Explained

Introduction

Machine Learning is one of the hottest technologies in 2020, as the data is increasing day by day the need of Machine Learning is also increasing exponentially. Machine Learning is a very vast topic that has different algorithms and use cases in each domain and Industry. One of which is Unsupervised Learning in which we can see the use of Clustering.

Unsupervised learning is a technique in which the machine learns from unlabeled data. As we do not know the labels there is no right answer given for the machine to learn from it, but the machine itself finds some patterns out of the given data to come up with the answers to the business problem.

Clustering is a Machine Learning Unsupervised Learning technique that involves the grouping of given unlabeled data. In each cleaned data set, by using Clustering Algorithm we can cluster the given data points into each group. The clustering Algorithm assumes that the data points that are in the same cluster should have similar properties, while data points in different clusters should have highly dissimilar properties.

In this article, we are going to learn the need of clustering, different types of clustering along with their pros and cons.

What is the need of Clustering?

Clustering is a widely used ML Algorithm which allows us to find hidden relationships between the data points in our dataset.

Examples:

1)    Customers are segmented according to similarities of the previous customers and can be used for recommendations.

2)    Based on a collection of text data, we can organize the data according to the content similarities in order to create a topic hierarchy.

3)    Image processing mainly in biology research for identifying the underlying patterns.

4)    Spam filtering.

5)    Identifying Fraudulent and Criminal activities.

6)    It can also be used for fantasy football and sports.

**Types of Clustering     **

There are many types of Clustering Algorithms in Machine learning. We are going to discuss the below three algorithms in this article:

1)    K-Means Clustering.

2)    Mean-Shift Clustering.

3)    DBSCAN.

1. K-Means Clustering

K-Means is the most popular clustering algorithm among the other clustering algorithms in Machine Learning. We can see this algorithm used in many top industries or even in a lot of introduction courses. It is one of the easiest models to start with both in implementation and understanding.

Step-1 We first select a random number of k to use and randomly initialize their respective center points.

Step-2 Each data point is then classified by calculating the distance (Euclidean or Manhattan) between that point and each group center, and then clustering the data point to be in the cluster whose center is closest to it.

Step-3 We recompute the group center by taking the mean of all the vectors in the group.

Step-4 We repeat all these steps for a n number of iterations or until the group centers don’t change much.

#artificial intelligence #clustering

Clustering in Machine Learning: 3 Types of Clustering Explained