Royce  Reinger

Royce Reinger


Five Video Classification Methods

Five video classification methods

The five video classification methods:

  1. Classify one frame at a time with a ConvNet
  2. Extract features from each frame with a ConvNet, passing the sequence to an RNN, in a separate network
  3. Use a time-dstirbuted ConvNet, passing the features to an RNN, much like #2 but all in one network (this is the lrcn network in the code).
  4. Extract features from each frame with a ConvNet and pass the sequence to an MLP
  5. Use a 3D convolutional network (has two versions of 3d conv to choose from)

See the accompanying blog post for full details:


This code requires you have Keras 2 and TensorFlow 1 or greater installed. Please see the requirements.txt file. To ensure you're up to date, run:

pip install -r requirements.txt

You must also have ffmpeg installed in order to extract the video files. If ffmpeg isn't in your system path (ie. which ffmpeg doesn't return its path, or you're on an OS other than *nix), you'll need to update the path to ffmpeg in data/

Getting the data

First, download the dataset from UCF into the data folder:

cd data && wget

Then extract it with unrar e UCF101.rar.

Next, create folders (still in the data folder) with mkdir train && mkdir test && mkdir sequences && mkdir checkpoints.

Now you can run the scripts in the data folder to move the videos to the appropriate place, extract their frames and make the CSV file the rest of the code references. You need to run these in order. Example:



Extracting features

Before you can run the lstm and mlp, you need to extract features from the images with the CNN. This is done by running On my Dell with a GeFore 960m GPU, this takes about 8 hours. If you want to limit to just the first N classes, you can set that option in the file.

Training models

The CNN-only method (method #1 in the blog post) is run from

The rest of the models are run from There are configuration options you can set in that file to choose which model you want to run.

The models are all defined in Reference that file to see which models you are able to run in

Training logs are saved to CSV and also to TensorBoard files. To see progress while training, run tensorboard --logdir=data/logs from the project root folder.

Demo/Using models

I have not yet implemented a demo where you can pass a video file to a model and get a prediction. Pull requests are welcome if you'd like to help out!


  •  Add data augmentation to fight overfitting
  •  Support multiple workers in the data generator for faster training
  •  Add a demo script
  •  Support other datasets
  •  Implement optical flow
  •  Implement more complex network architectures, like optical flow/CNN fusion

UCF101 Citation

Khurram Soomro, Amir Roshan Zamir and Mubarak Shah, UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild., CRCV-TR-12-01, November, 2012.

Download Details:

Author: Harvitronix
Source Code: 
License: MIT license

#machinelearning #deeplearning #tensorflow #keras #classification 

Five Video Classification Methods
Monty  Boehm

Monty Boehm


Ros2-tensorflow: ROS2 Nodes for Computer Vision Tasks in Tensorflow


Use Tensorflow to load pretrained neural networks and perform inference through ROS2 interfaces.

Rviz2 detection output 

The output can be directly visualized through Rviz


In order to build the ros2-tensorflow package, the following dependencies are needed

Required dependencies:

Rosdep dependencies:

Optional dependencies:

The provided Dockerfile contains an Ubuntu 18.04 environment with all the dependencies and this package already installed.

To use the Dockerfile:

$ git clone
$ cd ros2-tensorflow/docker
$ bash
$ bash


This section describes how to build the ros2-tensorflow package and the required depenencies in case you are not using the provided Dockerfile.

Get the source code and create the ROS 2 workspace

$ git clone $HOME/ros2-tensorflow
$ mkdir -p $HOME/tf_ws/src
$ cd $HOME/tf_ws
$ ln -s $HOME/ros2-tensorflow/ros2-tensorflow src

Install required dependencies using rosdep

$ rosdep install --from-paths src --ignore-src --rosdistro foxy -y

Install the Tensorflow Object Detection Models (optional). Make sure to specify the correct Python version according to your system.

$ sudo apt-get install -y protobuf-compiler python-lxml python-tk
$ pip install --user Cython contextlib2 jupyter matplotlib Pillow
$ git clone /usr/local/lib/python3.8/dist-packages/tensorflow/models
$ cd usr/local/lib/python3.8/dist-packages/tensorflow/models/research
$ protoc object_detection/protos/*.proto --python_out=.
$ echo 'export PYTHONPATH=$PYTHONPATH:/usr/local/lib/python3.8/dist-packages/tensorflow/models/research' >> $HOME/.bashrc

Install Tensorflow Slim (optional)

$ pip install tf_slim

Build and install the ros2-tensorflow package

$ colcon build
$ source install/


The basic usage consists in creating a ROS 2 node which loads a Tensorflow model and another ROS 2 node that acts as a client and receives the result of the inference.

It is possible to specify which model a node should load. Note that if the model is specified via url, as it is by default, the first time the node is executed a network connection will be required in order to download the model.

Object Detection Task

Test the object detection server by running in separate terminals

$ ros2 run tf_detection_py server
$ ros2 run tf_detection_py client_test

Setup a real object detection pipeline using a stream of images coming from a ROS 2 camera node

$ rviz2
$ ros2 run tf_detection_py server
$ ros2 run image_tools cam2image --ros-args -p frequency:=2.0

Image Classification Task

Test the image classification server by running in separate terminals

$ ros2 run tf_classification_py server
$ ros2 run tf_classification_py client_test

Loading different models

The repository contains convenient APIs for loading Tensorflow models into the ROS 2 nodes.

Models are defined using the ModelDescriptor class, which contains all the information required for loading a model and performing inference on it. It can either contain a path where the model can be found on the machine or an URL where the model can be downloaded the first time.

Different model formats are also supported, such as frozen models and saved models.

Some known supported models are already present as examples. See classification models and detection models

The Tensorflow models repository contains many pretrained models that can be used. For example, you can get additional Tensorflow model for object detection from the detection model zoo.

Download Details:

Author: Alsora
Source Code: 
License: Apache-2.0 license

#tensorflow #computervision #image #classification 

Ros2-tensorflow: ROS2 Nodes for Computer Vision Tasks in Tensorflow
Royce  Reinger

Royce Reinger


Lightning-flash: Your PyTorch AI Factory


Your PyTorch AI Factory

Flash makes complex AI recipes for over 15 tasks across 7 data domains accessible to all. In a nutshell, Flash is the production grade research framework you always dreamed of but didn't have time to build.

Getting Started

From PyPI:

pip install lightning-flash

See our installation guide for more options.

Flash in 3 Steps

Step 1. Load your data

All data loading in Flash is performed via a from_* classmethod on a DataModule. To decide which DataModule to use and which from_* methods are available, it depends on the task you want to perform. For example, for image segmentation where your data is stored in folders, you would use the from_folders method of the SemanticSegmentationData class:

from flash.image import SemanticSegmentationData

dm = SemanticSegmentationData.from_folders(
    image_size=(256, 256),

Step 2: Configure your model

Our tasks come loaded with pre-trained backbones and (where applicable) heads. You can view the available backbones to use with your task using available_backbones. Once you've chosen one, create the model:

from flash.image import SemanticSegmentation

# ['deeplabv3', 'deeplabv3plus', 'fpn', ..., 'unetplusplus']

# ['densenet121', ..., 'xception'] # + 113 models

# ['imagenet', 'advprop']

model = SemanticSegmentation(
  head="fpn", backbone='efficientnet-b0', pretrained="advprop", num_classes=dm.num_classes)

Step 3: Finetune!

from flash import Trainer

trainer = Trainer(max_epochs=3)
trainer.finetune(model, datamodule=datamodule, strategy="freeze")

PyTorch Recipes

Make predictions with Flash!

Serve in just 2 lines:

from flash.image import SemanticSegmentation

model = SemanticSegmentation.load_from_checkpoint("")

or make predictions from raw data directly.

from flash import Trainer

trainer = Trainer(strategy='ddp', accelerator="gpu", gpus=2)
dm = SemanticSegmentationData.from_folders(predict_folder="data/CameraRGB")
predictions = trainer.predict(model, dm)

Flash Training Strategies

Training strategies are PyTorch SOTA Training Recipes which can be utilized with a given task.

Check out this example where the ImageClassifier supports 4 Meta Learning Algorithms from Learn2Learn. This is particularly useful if you use this model in production and want to make sure the model adapts quickly to its new environment with minimal labelled data.

from flash.image import ImageClassifier

model = ImageClassifier(
    optimizer_kwargs={"lr": 0.001},
        "epoch_length": 10 * 16,
        "meta_batch_size": 4,
        "num_tasks": 200,
        "test_num_tasks": 2000,
        "ways": datamodule.num_classes,
        "shots": 1,
        "test_ways": 5,
        "test_shots": 1,
        "test_queries": 15,

In detail, the following methods are currently implemented:

Flash Optimizers / Schedulers

With Flash, swapping among 40+ optimizers and 15+ schedulers recipes are simple. Find the list of available optimizers, schedulers as follows:

from flash.image import ImageClassifier

# ['A2GradExp', ..., 'Yogi']

# ['CosineAnnealingLR', 'CosineAnnealingWarmRestarts', ..., 'polynomial_decay_schedule_with_warmup']

Once you've chosen, create the model:

#### The optimizer of choice can be passed as
from flash.image import ImageClassifier

# - String value
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler=None)

# - Callable
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer=functools.partial(torch.optim.Adadelta, eps=0.5), lr_scheduler=None)

# - Tuple[string, dict]: (The dict takes in the optimizer kwargs)
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer=("Adadelta", {"epa": 0.5}), lr_scheduler=None)

#### The scheduler of choice can be passed as a
# - String value
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler="constant_schedule")

# - Callable
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler=functools.partial(CyclicLR, step_size_up=1500, mode='exp_range', gamma=0.5))

# - Tuple[string, dict]: (The dict takes in the scheduler kwargs)
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler=("StepLR", {"step_size": 10}))

You can also register you own custom scheduler recipes beforeahand and use them shown as above:

from flash.image import ImageClassifier

def my_steplr_recipe(optimizer):
    return torch.optim.lr_scheduler.StepLR(optimizer, step_size=10)

model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler="my_steplr_recipe")

Flash Transforms

Flash includes some simple augmentations for each task by default, however, you will often want to override these and control your own augmentation recipe. To this end, Flash supports custom transformations with the InputTransform. The InputTransform is like a callback for transforms, with hooks that can be used to apply transforms to samples or batches, on and off the device / accelerator. In addition, hooks can be specialized to apply transforms only to the input or target. With these hooks, complex transforms like MixUp can be implemented with ease. Here's an example (with an albumentations transform thrown in too!):

import torch
import numpy as np
import albumentations
from flash import InputTransform
from flash.image import ImageClassificationData
from flash.image.classification.input_transform import AlbumentationsAdapter

def mixup(batch, alpha=1.0):
    images = batch["input"]
    targets = batch["target"].float().unsqueeze(1)

    lam = np.random.beta(alpha, alpha)
    perm = torch.randperm(images.size(0))

    batch["input"] = images * lam + images[perm] * (1 - lam)
    batch["target"] = targets * lam + targets[perm] * (1 - lam)
    return batch

class MixUpInputTransform(InputTransform):

    def train_input_per_sample_transform(self):
        return AlbumentationsAdapter(albumentations.HorizontalFlip(p=0.5))

    # This will be applied after transferring the batch to the device!
    def train_per_batch_transform_on_device(self):
        return mixup

datamodule = ImageClassificationData.from_folders(

Flash Zero - PyTorch Recipes from the Command Line!


Flash Zero is a zero-code machine learning platform built directly into lightning-flash using the Lightning CLI.

To get started and view the available tasks, run:

  flash --help

For example, to train an image classifier for 10 epochs with a resnet50 backbone on 2 GPUs using your own data, you can do:

  flash image_classification --trainer.max_epochs 10 --trainer.gpus 2 --model.backbone resnet50 from_folders --train_folder {PATH_TO_DATA}

Kaggle Notebook Examples


The lightning + Flash team is hard at work building more tasks for common deep-learning use cases. But we're looking for incredible contributors like you to submit new tasks!

Join our Slack and/or read our CONTRIBUTING guidelines to get help becoming a contributor!

Note: Flash is currently being tested on real-world use cases and is in active development. Please open an issue if you find anything that isn't working as expected.


Flash is maintained by our core contributors.

For help or questions, join our huge community on Slack!


We’re excited to continue the strong legacy of opensource software and have been inspired over the years by Caffe, Theano, Keras, PyTorch, torchbearer, and When/if additional papers are written about this, we’ll be happy to cite these frameworks and the corresponding authors.

Flash leverages models from many different frameworks in order to cover such a wide range of domains and tasks. The full list of providers can be found in our documentation.

Download Details:

Author: Lightning-Universe
Source Code: 
License: Apache-2.0 license

#machinelearning #deeplearning #pytorch #classification 

Lightning-flash: Your PyTorch AI Factory

Как решить ключевые вопросы, связанные с точностью классификации

Как решить ключевые вопросы, связанные с точностью классификации

В этом блоге мы расскажем о ключевых проблемах, связанных с точностью классификации, таких как несбалансированные классы, переоснащение и смещение данных, а также о проверенных способах успешного решения этих проблем.

Несбалансированные классы

Точность может быть обманчивой, если набор данных содержит неравномерные классификации. Например, модель, которая просто предсказывает мажоритарный класс, будет точной на 99 %, если доминирующий класс включает 99 % данных. К сожалению, он не сможет должным образом классифицировать класс меньшинства. Для решения этой проблемы следует использовать другие показатели, включая точность, отзыв и оценку F1.

5 наиболее распространенных методов, которые можно использовать для решения проблемы несбалансированного класса точности классификации: 

Ключевые вопросы, связанные с точностью классификации

Несбалансированный класс | Инженерия знаний

  1. Повышение дискретизации класса меньшинства: в этом методе мы дублируем примеры в классе меньшинства, чтобы сбалансировать распределение классов. 
  2. Понижение дискретизации мажоритарного класса: в этом методе мы удаляем примеры из мажоритарного класса, чтобы сбалансировать распределение классов. 
  3. Генерация синтетических данных: метод, используемый для создания новых выборок класса меньшинства. Когда случайный шум вводится в существующие примеры или путем создания новых примеров посредством интерполяции или экстраполяции, происходит генерация синтетических данных. 
  4. Обнаружение аномалий: в этом методе класс меньшинства рассматривается как аномалия, тогда как класс большинства рассматривается как нормальные данные. 
  5. Изменение порога принятия решения: этот метод настраивает порог принятия решения классификатора, чтобы повысить чувствительность к классу меньшинства. 


Когда модель переобучается на обучающих данных и плохо работает на тестовых данных, говорят, что она переобучена. В результате точность может быть высокой на тренировочном наборе, но плохой на тестовом. Для решения этой проблемы следует применять такие методы, как перекрестная проверка и регуляризация .


Ключевые вопросы, связанные с точностью классификации

Переоснащение | Фрипик

Существует несколько методов, которые можно использовать для устранения переобучения. 

  1. Обучите модель с большим количеством данных: это позволяет алгоритму лучше обнаруживать сигнал и минимизировать ошибки. 
  2. Регуляризация: это включает в себя добавление штрафного члена к функции стоимости во время обучения, что помогает ограничить сложность модели и уменьшить переоснащение. 
  3. Перекрестная проверка. Этот метод помогает оценить производительность модели, разделив данные на обучающие и проверочные наборы, а затем обучив и оценив модель на каждом наборе. 
  4. Методы ансамбля. Это метод, который включает в себя обучение нескольких моделей, а затем объединение их прогнозов, что помогает уменьшить дисперсию и погрешность модели. 

Предвзятые данные 

Модель будет давать смещенные прогнозы, если набор обучающих данных смещен. Это может привести к высокой точности на обучающих данных, но производительность на необученных данных может быть ниже среднего. Для решения этой проблемы следует использовать такие методы, как увеличение данных и повторная выборка. Некоторые другие способы решения этой проблемы перечислены ниже: 

Ключевые вопросы, связанные с точностью классификации

Предвзятость данных | Эксплориум

  1. Один из методов заключается в обеспечении того, чтобы используемые данные были репрезентативными для совокупности, которую они предназначены для моделирования. Это можно сделать путем случайной выборки данных из совокупности или с помощью таких методов, как избыточная или недостаточная выборка, чтобы сбалансировать данные. 
  2. Тщательно протестируйте и оцените модели, измеряя уровни точности для различных демографических категорий и чувствительных групп. Это может помочь выявить любые погрешности в данных и модели и устранить их. 
  3. Помните о предвзятости наблюдателя, которая возникает, когда вы сознательно или случайно навязываете свои мнения или желания данным. Этого можно добиться, зная о возможной предвзятости и принимая меры для ее минимизации. 
  4. Используйте методы предварительной обработки, чтобы удалить или исправить смещение данных. Например, используя такие методы, как очистка данных, нормализация данных и масштабирование данных. 

Матрица путаницы 



Ключевые вопросы, связанные с точностью классификации

Изображение автора

Производительность алгоритма классификации описывается с помощью матрицы путаницы. Это макет таблицы, в котором реальные значения сопоставляются с ожидаемыми значениями в матрице, чтобы определить производительность алгоритма классификации. Некоторые способы решения этой проблемы: 

  1. Проанализируйте значения в матрице и определите любые закономерности или тенденции в ошибках. Например, если имеется много ложноотрицательных результатов, это может указывать на то, что модель недостаточно чувствительна к определенным классам. 
  2. Используйте такие показатели, как точность, полнота и F1-оценка, чтобы оценить производительность модели. Эти метрики обеспечивают более подробное понимание того, как работает модель, и могут помочь определить любые конкретные области, в которых модель испытывает трудности. 
  3. Отрегулируйте порог модели, если порог слишком высок или слишком низок, это может привести к тому, что модель будет делать больше ложных положительных или ложных отрицательных результатов. 
  4. Используйте ансамблевые методы, такие как бэггинг и бустинг, которые могут помочь улучшить производительность модели за счет объединения прогнозов нескольких моделей. 

Вклад точности классификации в машинное обучение 

В заключение, точность классификации является полезным показателем для оценки производительности модели машинного обучения, но она может быть обманчивой. Чтобы получить более полное представление о производительности модели, следует также использовать дополнительные показатели, включая точность, полноту, оценку F1 и матрицу путаницы. Чтобы преодолеть такие проблемы, как несбалансированные классы, переоснащение и смещение данных, следует применять методы, включая перекрестную проверку, нормализацию, увеличение данных и повторную выборку.

Оригинальный источник статьи:

#machinelearning #classification #accuracy #key #issues 

Как решить ключевые вопросы, связанные с точностью классификации
佐藤  桃子

佐藤 桃子






如果数据集包含不均匀的分类,则准确性可能具有欺骗性。例如,如果主导类包含 99% 的数据,则仅预测多数类的模型将达到 99% 的准确率。不幸的是,它将无法对少数类进行适当的分类。应使用其他指标(包括精度、召回率和 F1 分数)来解决此问题。

可用于解决分类精度中 类别不平衡问题的5 种最常用技术是:


不平衡类 | 知识工程

  1. 对少数类进行上采样:在这种技术中,我们复制少数类中的示例以平衡类分布。 
  2. 对多数类进行下采样:在这种技术中,我们从多数类中删除示例以平衡类分布。 
  3. 合成数据生成:一种用于生成少数类新样本的技术。当随机噪声被引入现有示例或通过插值或外推生成新示例时,就会发生合成数据生成。 
  4. 异常检测:少数类在该技术中被视为异常,而多数类被视为正常数据。 
  5. 改变决策阈值:该技术调整分类器的决策阈值以增加对少数类的敏感性。 





过拟合 | Freepik


  1. 使用更多数据训练模型:这允许算法更好地检测信号并最大限度地减少错误。 
  2. 正则化:这涉及在训练期间向成本函数添加惩罚项,这有助于约束模型的复杂度并减少过度拟合。 
  3. 交叉验证:该技术通过将数据划分为训练集和验证集,然后在每个集上训练和评估模型来帮助评估模型的性能。 
  4. 集成方法。这是一种涉及训练多个模型然后组合它们的预测的技术,这有助于减少模型的方差和偏差。 




数据偏差 | 探索馆

  1. 一种技术是确保所使用的数据代表它打算建模的人群。这可以通过从总体中随机抽样数据,或使用过抽样或欠抽样等技术来平衡数据来完成。 
  2. 通过测量不同人口统计类别和敏感群体的准确性水平来仔细测试和评估模型。这有助于识别数据和模型中的任何偏差并加以解决。 
  3. 注意观察者偏见,当您有意或无意地将自己的观点或愿望强加于数据时,就会发生这种情况。这可以通过意识到偏见的可能性并采取措施将其最小化来实现。 
  4. 使用预处理技术消除或纠正数据偏差。例如,使用数据清理、数据规范化和数据缩放等技术。 







  1. 分析矩阵中的值并识别错误中的任何模式或趋势。例如,如果有很多假阴性,则可能表明模型对某些类别不够敏感。 
  2. 使用精度、召回率和 F1 分数等指标来评估模型的性能。这些指标可以更详细地了解模型的执行情况,并有助于识别模型存在问题的任何特定领域。 
  3. 调整模型的阈值,如果阈值过高或过低,都会导致模型产生更多的误报或漏报。 
  4. 使用集成方法,例如 bagging 和 boosting,这可以通过组合多个模型的预测来帮助提高模型的性能。 


总之,分类准确性是评估机器学习模型性能的有用指标,但它可能具有欺骗性。为了更全面地了解模型的性能,还应使用其他指标,包括精度、召回率、F1 分数和混淆矩阵。为了克服类不平衡、过度拟合和数据偏差等问题,应应用交叉验证、归一化、数据扩充和重采样等技术。

文章原文出处:https: //

#machinelearning #classification #accuracy #key #issues 

Desmond  Gerber

Desmond Gerber


How to Key Issues Associated with Classification Accuracy

How to Key Issues Associated with Classification Accuracy

In this blog, we will unfold the key problems associated with classification accuracies, such as imbalanced classes, overfitting, and data bias, and proven ways to address those issues successfully.

Imbalanced Classes

The accuracy may be deceptive if the dataset contains classifications that are uneven. For instance, a model that merely predicts the majority class will be 99% accurate if the dominant class comprises 99% of the data. Unfortunately, it will not be able to appropriately classify the minority class. Other metrics including precision, recall, and F1-score should be used to address this issue.

The 5 most common techniques that can be used to address the problem of imbalanced class in classification accuracy are: 

Key Issues Associated with Classification Accuracy

Imbalanced class | Knowledge Engineering

  1. Upsampling the minority class: In this technique, we duplicate the examples in the minority class to balance the class distribution. 
  2. Downsampling the majority class: In this technique we remove examples from the majority class to balance the class distribution. 
  3. Synthetic data generation: A technique used to generate new samples of the minority class. When random noise is introduced to the existing examples or by generating new examples through interpolation or extrapolation then synthetic data generation takes place. 
  4. Anomaly detection: The minority class is treated as an anomaly in this technique whereas the majority class is treated  as the normal data. 
  5. Changing the decision threshold: This technique adjusts the decision threshold of the classifier to increase the sensitivity to the minority class. 


When a model is overtrained on the training data and underperforms on the test data, it is said to be overfit. As a result, the accuracy may be high on the training set but poor on the test set. Techniques like cross-validation and regularisation should be applied to solve this issue.


Key Issues Associated with Classification Accuracy

Overfitting | Freepik

There are several techniques that can be used to address overfitting. 

  1. Train the model with more data: This allows the algorithm to detect the signal better and minimize errors. 
  2. Regularization: This involves adding a penalty term to the cost function during training, which helps to constrain the model's complexity and reduce overfitting. 
  3. Cross-validation: This technique helps evaluate the model's performance by dividing the data into training and validation sets, and then training and evaluating the model on each set. 
  4. Ensemble methods. This is a technique that involves training multiple models and then combining their predictions, which helps to reduce the variance and bias of the model. 

Data Bias 

The model will produce biased predictions if the training dataset is biassed. High accuracy on the training data may result from this, but performance on untrained data may be subpar. Techniques like data augmentation and resampling should be utilised to address this issue. Some other ways to address this problem are listed below: 

Key Issues Associated with Classification Accuracy

Data Bias | Explorium

  1. One technique is to ensure that the data used is representative of the population it is intended to model. This can be done by randomly sampling data from the population, or by using techniques such as oversampling or under sampling to balance the data. 
  2. Test and evaluate the models carefully by measuring accuracy levels for different demographic categories and sensitive groups. This can help identify any biases in the data and the model and address them. 
  3. Be aware of observer bias, which happens when you impose your opinions or desires on data, whether consciously or accidentally. This can be done by being aware of the potential for bias, and by taking steps to minimize it. 
  4. Use preprocessing techniques to remove or correct data bias. For example, using techniques such as data cleaning, data normalization, and data scaling. 

Confusion Matrix 



Key Issues Associated with Classification Accuracy

Image by Author

A classification algorithm's performance is described using a confusion matrix. It is a table layout where real values are contrasted with anticipated values in the matrix to define the performance of a classification algorithm. Some ways to address this problem are: 

  1. Analyze the values in the matrix and identify any patterns or trends in the errors. For example, if there are many false negatives, it might indicate that the model is not sensitive enough to certain   classes. 
  2. Use metrics like precision, recall, and F1-score to evaluate the model's performance. These metrics provide a more detailed understanding of how the model is performing and can help to identify any specific areas where the model is struggling. 
  3. Adjust the threshold of the model, if the threshold is too high or too low, this can cause the model to make more false positives or false negatives. 
  4. Use ensemble methods, such as bagging and boosting, which can help improve the model's performance by combining the predictions of multiple models. 

Contribution of Classification Accuracy in Machine Learning 

In conclusion, classification accuracy is a helpful metric for assessing a machine learning model's performance, but it can be deceptive. To acquire a more thorough perspective of the model's performance, additional metrics including precision, recall, F1-score, and confusion matrix should also be used. To overcome issues like imbalanced classes, overfitting, and data bias, techniques including cross-validation, normalisation, data augmentation, and re-sampling should be applied.

Original article source at:

#machinelearning  #classification #accuracy #key #issues 

How to Key Issues Associated with Classification Accuracy
Royce  Reinger

Royce Reinger


A Scikit-learn Based Module for Multi-label Et. Al. Classification


scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Python packages (numpy, scipy) and follows a similar API to that of scikit-learn.


Native Python implementation. A native Python implementation for a variety of multi-label classification algorithms. To see the list of all supported classifiers, check this link.

Interface to Meka. A Meka wrapper class is implemented for reference purposes and integration. This provides access to all methods available in MEKA, MULAN, and WEKA — the reference standard in the field.

Builds upon giants! Team-up with the power of numpy and scikit. You can use scikit-learn's base classifiers as scikit-multilearn's classifiers. In addition, the two packages follow a similar API.


In most cases you will want to follow the requirements defined in the requirements/*.txt files in the package.

Base dependencies

liac-arff # for loading ARFF files
requests # for dataset module
networkx # for networkX base community detection clusterers
python-louvain # for networkX base community detection clusterers

GPL-incurring dependencies for two clusterers

python-igraph # for igraph library based clusterers
python-graphtool # for graphtool base clusterers

Note: Installing graphtool is complicated, please see: graphtool install instructions


To install scikit-multilearn, simply type the following command:

$ pip install scikit-multilearn

This will install the latest release from the Python package index. If you wish to install the bleeding-edge version, then clone this repository and run

$ git clone
$ cd scikit-multilearn
$ python

Basic Usage

Before proceeding to classification, this library assumes that you have a dataset with the following matrices:

  • x_train, x_test: training and test feature matrices of size (n_samples, n_features)
  • y_train, y_test: training and test label matrices of size (n_samples, n_labels)

Suppose we wanted to use a problem-transformation method called Binary Relevance, which treats each label as a separate single-label classification problem, to a Support-vector machine (SVM) classifier, we simply perform the following tasks:

# Import BinaryRelevance from skmultilearn
from skmultilearn.problem_transform import BinaryRelevance

# Import SVC classifier from sklearn
from sklearn.svm import SVC

# Setup the classifier
classifier = BinaryRelevance(classifier=SVC(), require_dense=[False,True])

# Train, y_train)

# Predict
y_pred = classifier.predict(X_test)

More examples and use-cases can be seen in the documentation. For using the MEKA wrapper, check this link.


This project is open for contributions. Here are some of the ways for you to contribute:

  • Bug reports/fix
  • Features requests
  • Use-case demonstrations
  • Documentation updates

In case you want to implement your own multi-label classifier, please read our Developer's Guide to help you integrate your implementation in our API.

To make a contribution, just fork this repository, push the changes in your fork, open up an issue, and make a Pull Request!

We're also available in Slack! Just go to our slack group.


If you used scikit-multilearn in your research or project, please cite our work:

   author = {{Szyma{\'n}ski}, P. and {Kajdanowicz}, T.},
   title = "{A scikit-based Python environment for performing multi-label classification}",
   journal = {ArXiv e-prints},
   archivePrefix = "arXiv",
   eprint = {1702.01460},
   year = 2017,
   month = feb

Download Details:

Author: Scikit-multilearn
Source Code: 
License: BSD-2-Clause license

#machinelearning #python #clustering #scikitlearn #classification 

A Scikit-learn Based Module for Multi-label Et. Al. Classification
Nigel  Uys

Nigel Uys


How to Semi-supervised Image Classification With Unlabeled Data

Supervised learning is the key to computer vision and deep learning. However, what happens when you don’t have access to large, human-labeled datasets? In this article, Toptal Computer Vision Developer Urwa Muaz demonstrates the potential of semi-supervised image classification using unlabeled datasets.

Supervised learning has been at the forefront of research in computer vision and deep learning over the past decade.

In a supervised learning setting, humans are required to annotate a large amount of dataset manually. Then, models use this data to learn complex underlying relationships between the data and label and develop the capability to predict the label, given the data. Deep learning models are generally data-hungry and require enormous amounts of datasets to achieve good performance. Ever-improving hardware and the availability of large human-labeled datasets has been the reason for the recent successes of deep learning.

One major drawback of supervised deep learning is that it relies on the presence of an extensive amount of human-labeled datasets for training. This luxury is not available across all domains as it might be logistically difficult and very expensive to get huge datasets annotated by professionals. While the acquisition of labeled data can be a challenging and costly endeavor, we usually have access to large amounts of unlabeled datasets, especially image and text data. Therefore, we need to find a way to tap into these underused datasets and use them for learning.


Labeled and unlabeled images


Transfer Learning from Pretrained Models

In the absence of large amounts of labeled data, we usually resort to using transfer learning. So what is transfer learning?

Transfer learning means using knowledge from a similar task to solve a problem at hand. In practice, it usually means using as initializations the deep neural network weights learned from a similar task, rather than starting from a random initialization of the weights, and then further training the model on the available labeled data to solve the task at hand.

Transfer learning enables us to train models on datasets as small as a few thousand examples, and it can deliver a very good performance. Transfer learning from pretrained models can be performed in three ways:

1. Feature Extraction

Usually, the last layers of the neural network are doing the most abstract and task-specific calculations, which are generally not easily transferable to other tasks. By contrast, the initial layers of the network learn some basic features like edges and common shapes, which are easily transferable across tasks.

The image sets below depict what the convolution kernels at different levels in a convolutional neural network (CNN) are essentially learning. We see a hierarchical representation, with the initial layers learning basic shapes, and progressively, higher layers learning more complex semantic concepts.


Hierarchical representation: initial layers  and higher layers


A common practice is to take a model pretrained on large labeled image datasets (such as ImageNet) and chop off the fully connected layers at the end. New, fully connected layers are then attached and configured according to the required number of classes. Transferred layers are frozen, and the new layers are trained on the available labeled data for your task.

In this setup, the pretrained model is being used as a feature extractor, and the fully connected layers on the top can be considered a shallow classifier. This setup is more robust than overfitting as the number of trainable parameters is relatively small, so this configuration works well when the available labeled data is very scarce. What size of dataset qualifies as a very small dataset is usually a tricky problem with many aspects of consideration, including the problem at hand and the size of the model backbone. Roughly speaking, I would use this strategy for a dataset consisting of a couple of thousand images.

2. Fine-tuning

Alternatively, we can transfer the layers from a pretrained network and train the entire network on the available labeled data. This setup needs a little more labeled data because you are training the entire network and hence a large number of parameters. This setup is more prone to overfitting when there is a scarcity of data.

3. Two-stage Transfer Learning

This approach is my personal favorite and usually yields the best results, at least in my experience. Here, we train the newly attached layers while freezing the transferred layers for a few epochs before fine-tuning the entire network.

Fine-tuning the entire network without giving a few epochs to the final layers can result in the propagation of harmful gradients from randomly initialized layers to the base network. Furthermore, fine-tuning requires a comparatively smaller learning rate, and a two-stage approach is a convenient solution to it.

The Need for Semi-supervised and Unsupervised Methods

This usually works very well for most image classification tasks because we have huge image datasets like ImageNet that cover a good portion of possible image space—and usually, weights learned from it are transferable to custom image classification tasks. Moreover, the pretrained networks are readily available off the shelf, thus facilitating the process.

However, this approach will not work well if the distribution of images in your task is drastically different from the images that the base network was trained on. For example, if you are dealing with grayscale images generated by a medical imaging device, transfer learning from ImageNet weights will not be that effective and you will need more than a couple of thousand labeled images for training your network to satisfactory performance.

In contrast, you might have access to large amounts of unlabeled datasets for your problem. That is why the ability to learn from unlabeled datasets is crucial. Additionally, the unlabeled dataset is typically far greater in variety and volume than even the largest labeled datasets.

Semi-supervised approaches have shown to yield superior performance to supervised approaches on large benchmarks like ImageNet. Yann LeCun’s famous cake analogy stresses the importance of unsupervised learning:


Yann LeCun’s cake analogy


Semi-supervised Learning

This approach leverages both labeled and unlabeled data for learning, hence it is termed semi-supervised learning. This is usually the preferred approach when you have a small amount of labeled data and a large amount of unlabeled data. There are techniques where you learn from labeled and unlabeled data simultaneously, but we will discuss the problem in the context of a two-stage approach: unsupervised learning on unlabeled data, and transfer learning using one of the strategies described above to solve your classification task.

In these cases, unsupervised learning is a rather confusing term. These approaches are not truly unsupervised in the sense that there is a supervision signal that guides the learning of weights, but thus the supervision signal is derived from the data itself. Hence, it is sometimes referred to as self-supervised learning but these terms have been used interchangeably in literature to refer to the same approach.

The major techniques in self-supervised learning can be divided by how they generate this supervision signal from the data, as discussed below.

Generative Methods


Generative Methods - Autoencoders: encoder and decoder networks


Generative methods aim at the accurate reconstruction of data after passing it through a bottleneck. One example of such networks is autoencoders. They reduce the input into a low-dimensional representation space using an encoder network and reconstruct the image using the decoder network.

In this setup, the input itself becomes the supervision signal (label) for training the network. The encoder network can then be extracted and used as a starting point to build your classifier, using one of the transfer learning techniques discussed in the section above.

Similarly, another form of generative networks - Generative Adversarial Networks (GANs) - can be used for pretraining on unlabeled data. Then, a discriminator can be adopted and further fine-tuned for the classification task.

Discriminative Methods

Discriminative approaches train a neural network to learn an auxiliary classification task. An auxiliary task is chosen such that the supervision signal can be derived from the data itself, without human annotation.

Examples of this type of tasks are learning the relative positions of image patches, colorizing grayscale images, or learning the geometric transformations applied on images. We will discuss two of them in further detail.

Learning Relative Positions of Image Patches


Learning Relative Positions of Image Patches


In this technique, image patches are extracted from the source image to form a jigsaw puzzle-like grid. The path positions are shuffled, and shuffled input is fed into the network, which is trained to correctly predict the location of each patch in the grid. Thus, the supervision signal is the actual position of each path in the grid.

In learning to do that, the network learns the relative structure and orientation of objects as well as the continuity of low-level visual features like color. The results show that the features learned by solving this jigsaw puzzle are highly transferable to tasks like image classification and object detection.

Learning Geometric Transformations Applied to Images


Learning Geometric Transformations Applied to Images


These approaches apply a small set of geometric transformations to the input images and train a classifier to predict the applied transformation by looking at the transformed image alone. One example of these approaches is to apply a 2D rotation to the unlabeled images to obtain a set of rotated images and then train the network to predict the rotation of each image.

This simple supervision signal forces the network to learn to localize the objects in an image and understand their orientation. Features learned by these approaches have proven to be highly transferable and yield state of the art performance for classification tasks in semi-supervised settings.

Similarity-based Approaches

These approaches project the images into a fixed-sized representation space where similar images are closer together and different images are further apart. One way to achieve this is to use siamese networks based on triplet loss, which minimizes the distance between semantically similar images. Triplet loss needs an anchor, a positive example, and a negative example and tries to bring positive closer to the anchor than negative in terms of Euclidean distance in latent space. Anchor and positive are from the same class, and the negative example is chosen randomly from the remaining classes.

In unlabeled data, we need to come up with a strategy to produce this triplet of anchor positive and negative examples without knowing the classes of images. One way to do so is to use a random affine transformation of the anchor image as a positive example and randomly select another image as a negative example.


Triplet loss



In this section, I will relate an experiment that empirically establishes the potential of unsupervised pre-training for image classification. This was my semester project for a Deep Learning class I took with Yann LeCun at NYU last spring.

  • Dataset. It is composed of 128K labeled examples, half of which are for training and the other half for validation. Furthermore, we are provided 512K unlabeled images. The data contains 1,000 classes in total.
  • Unsupervised pre-training. AlexNet was trained for rotation classification using extensive data augmentation for 63 epochs. We used the hyperparameters documented by Rotnet in their paper.
  • Classifier training. Features were extracted from the fourth convolution layer, and three fully connected layers were appended to it. These layers were randomly initialized and trained with a scheduled decreasing learning rate, and early stopping was implemented to stop training.
  • Whole network fine-tuning. Eventually, we fine-tuned the network trained on the entire labeled data. Both the feature extractor and the classifier, which were separately trained before, were fine-tuned together with a small learning rate for 15 epochs.

We trained seven models, each using a different number of labeled training examples per class. This was done to understand how the size of the training data influences the performance of our semi-supervised setup.




We were able to get an 82% accuracy rate for pre-training on rotation classification. For classifier training, the top 5% accuracy saturated around the value of 46.24%, and fine-tuning of the entire network yielded the final figure of 50.17%. By leveraging the pre-training, we got better performance than that of supervised training, which gives 40% top 5 accuracy.

As expected, the validation accuracy decreases with the decrease in labeled training data. However, the decrease in performance is not as significant as one would expect in a supervised setting. A 50% decrease in training data from 64 examples per class to 32 examples per class only results in a 15% decrease in the validation accuracy.




By using only 32 examples per class, our semi-supervised model achieves superior performance to the supervised model trained using 64 examples per class. This provides empirical evidence of the potential of semi-supervised approaches for image classification on low-resource labeled datasets.

Wrapping Up

We can conclude that unsupervised learning is a powerful paradigm that has the capability to boost performance for low-resource datasets. Unsupervised learning is currently in its infancy but will gradually expand its share in the computer vision space by enabling learning from cheap and easily accessible unlabeled data.

Original article source at:

#classification #image #data 

How to Semi-supervised Image Classification With Unlabeled Data

Collection Of tools for Chemometrics & Machine Learning


This package contains a collection of tools to perform fundamental and advanced Chemometric analysis' in Julia. It is currently richer than any other free chemometrics package available in any other language. If you are uninformed as to what Chemometrics is; it could nonelegantly be described as the marriage between data science and chemistry. Traditionally it is the symbiosis of applied linear algebra/statistics which is disciplined by the physics and meaning of chemical measurements. This is somewhat orthogonal to most specializations of machine learning where "add more layers" is the modus operandi. Sometimes chemometricians also weigh the pros and cons of black box modelling and break out pure machine learning methods - so some of those techniques are in this package.

Package Status => Closer to Acceptability (v 0.5.8)

ChemometricsTools has been accepted as an official Julia package! Yep, so you can Pkg.add("ChemometricsTools") to install it. A lot of features have been added since the first public release (v 0.2.3 ). In 0.5.8 almost all of the functionality available can be used/abused. If you find a bug or want a new feature don't be shy - file an issue. In v0.5.1 Plots was removed as a dependency, new plot recipes were added, and now the package compiles much faster! Multilinear modeling, univariate modeling, and DOE functions are now available. Making headway into the release plan for v0.6.0. Convenience functions, documentation, bug fixes, refactoring and clean up are in progress bare with me. The git repo's master branch typically has the most advanced version, but the features on it may be less reliable because I like to do development on it.

Seeking Collaborators

So my time and efforts for building this package are constrained. I really would like to find some collaborators to help flesh this package out, use it, find bugs. Even if your interests are more leaning towards machine learning/statistics I'd love to hear from you. Please file an issue if you are interested - or send me a message on Julia Discourse (ckneale)!

Version Release Strategy

  • < 0.3.0 : Mapping functionality, prototyping
  • < 0.5.0 : Testing via actual usage on real data, look for missing essentials
  • < 0.6.0 : Bake in convenience functions for ease of use. Flesh out Documentation.
  • < 0.7.5 : Public input (find those bugs!). Adequate Unit Tests.
  • < 1.0.0 : Focus on performance, stability, generalizability, lock down the package syntax.

Package Highlights


Two design choices introduced in this package are "Transformations" and "Pipelines". We can use transformations to treat data from multiple sources the same way. This helps mitigate user error for cases where test data is scaled based on training data, calibration transfer, etc.

Multiple transformations can easily be chained together and stored using "Pipelines". Pipelines aren't "pipes" like are present in Bash, R and base Julia. They are flexible, yet immutable, convenience objects that allow for sequential preprocessing and data transformations to be reused, chained, or automated for reliable analytic throughput.

Model training

ChemometricsTools offers easy to use iterators for K-folds validation's, and moving window sampling/training. More advanced sampling methods, like Kennard Stone, are just a function call away. Convenience functions for interval selections, weighting regression ensembles, etc are also available. These allow for ensemble models like SIPLS, P-DS, P-OSC, etc to be built quickly. With the tools included both in this package and Base Julia, nothing should stand in your way.

Regression Modeling

This package features dozens of regression performance metrics, and a few built in plots (Bland Altman, QQ, Interval Overlays etc) are included. The list of regression methods currently includes: CLS, Ridge, Kernel Ridge, LS-SVM, PCR, PLS(1/2), ELM's, Regression Trees, Random Forest, Monotone Regression... More to come. Chemometricians love regressions! I've also added some convenience functions for univariate calibrations, standard addition experiments and some automated plot functions for them.

Classification Modeling

In-house classification encodings (one cold/one hot), and easy to retrieve global or multiclass performance statistics. ChemometricsTools currently includes: LDA/PCA with Gaussian discriminants, Hierchical LDA, SIMCA, multinomial softmax/logistic regression, PLS-DA, K-NN, Gaussian Naive Bayes, Classification Trees, Random Forest, Probabilistic Neural Networks, LinearPerceptrons, and more to come. You can also conveniently dump classification statistics to LaTeX/CSV reports!

Multiway/Multilinear Modeling

I've been working to fulfill an obvious gap in the available tooling. Standard methods for Tucker decomposition (HOSVD, and HOOI) have been included. Some preprocessing methods, and even an early view at multilinear PLS. There's a lot that could be done here, please feel free to contribute!

Specialized tools?

This package has tools for specialized fields of analysis'. For instance, fractional derivatives for the electrochemists (and the adventurous), a handful of smoothing methods for spectroscopists, curve resolution (unimodal and nonnegativity constraints available) for forensics, process fault detection methods, etc. There are certainly plans for other tools for analyzing chemical data that packages in other languages have seemingly left out. Stay tuned.

Where's the Data?

Please check out ChemometricsData.jl for access to more publicly available datasets.

Right now the 2002 International Diffuse Reflectance Conference Pharmaceutical NIR, iris, Tecator aka 'meat', and ball gear fault detection (NASA) dataset are included in this package. But, this will be factored out eventually into ChemometricsData.jl.

I'd love for a collaborator to contribute some: spectra, chromatograms, etc. Please reach out to me if you wish to collaborate/contribute. In the mean time you can load in your own datasets using the full extent of Julia ecosystem (XLSX.jl, CSV.jl, JSON.jl, MATLAB.jl, LibPQ.jl, Feather.jl, Arrow.jl, etc).

What about Time Series? Cluster modeling?

Well, I'd love to hammer in some time series methods. That was originally part of the plan. Then I realized OnlineStats.jl already has the essentials for online learning covered, and a there are many efforts for actual time series((TimeSeries.jl)[]) modelling in the works.

Similarly, if clustering methods are important to you, check out Clustering.jl. I may add a few supportive odds and ends in here (or contribute to the packages directly) but really, most of the Julia 1.0+ ecosystem is really reliable, well made, and community supported.


  • Clean up.
  • Performance improvements.
  • Syntax improvements.
  • Documentation improvements.
  • Unit tests.


  • Design of Experiment tools (Partial Factorial design, D/I-optimal, etc...)?
  • Convenience fns propagation of error, multiequilibria, kinetics?
  • Electrochemical simulations and optical simulations (maybe separate packages...)?


Shootouts/Modeling Examples:

Download Details:

Author: Caseykneale
Source Code: 
License: View license

#julia #machinelearning #regression #classification 

Collection Of tools for Chemometrics & Machine Learning
Nat  Grady

Nat Grady


SeaClass: an interactive R tool for Classification Problems

The SeaClass R Package

The Advanced Analytics group at Seagate Technology has decided to share an internal project which helps accelerate development for classification problems. The interactive SeaClass tool is contained in an R based package built using R Shiny and other CRAN packages commonly used for binary classification. The package is free to use and develop further, but any analysis mistakes are the sole responsibility of the user. Checkout the demo video here.


The SeaClass R package provides tools for analyzing classification problems. In particular, specialized tools are available for addressing the problem of imbalanced data sets. The SeaClass application provides an easy to use interface which requires only minimal R programming knowledge to get started, and can be launched using the RStudio Addins menu. The application allows the user to explore numerous methods by simply clicking on the available options and interacting with the generated results. The user can choose to download the codes for any procedures they wish to explore further. SeaClass was designed to jump start the analysis process for both novice and advanced R users. See screenshots below for one demonstration.


Install Instructions

The SeaClass application depends on numerous R packages. To install SeaClass and its dependencies run:


Usage Instructions

Step 1. Begin by loading and preparing your data in R. Some general advice:

  • Your data set must be saved as an R data frame object.
  • The data set must contain a binary response variable (0/1, PASS/FAIL, A/B, etc.)
  • All other variables must be predictor variables.
  • Predictor variables can be numeric, categorical, or factors.
  • Including too many predictors may slow down the application and weaken performance.
  • Categorical predictors are often ignored when the number of levels exceeds 10 since they tend to have improper influences.
  • Missing values are not allowed and will throw a flag. Please remove or impute NAs prior to starting the app.
  • Keep the number of observations (rows) to a medium or small size.
  • Data sets with many rows (>10,000) or many columns (>30) may slow down the app's interactive responses.

Step 2. After data preparation, start the application by either loading SeaClass from the RStudio Addins dropdown menu or by loading the SeaClass function from the command line. For example:


### Make some fake data:
X <- matrix(rnorm(10000,0,1),ncol=10,nrow=1000)
X[1:100,1:2] <- X[1:100,1:2] + 3
Y <- c(rep(1,100), rep(0,900))
Fake_Data <- data.frame(Y = Y , X)

### Load the SeaClass rare failure data:

### Start the interactive GUI:

If the application fails to load, you may need to first specify your favorite browser path. For example:

options(browser = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")

Step 3. The user has various options for configuring their analysis within the GUI. Once the analysis runs, the user can view the results, interact with the results (module dependent), save the underlying R script, or start over. Additional help is provided within the application. See above screenshots for one depiction of these steps.

Step 4. Besides the SeaClass function, several other functions are contained within the library. For example:

### List available functions:
### Note this is a sample data set:
# data(rareFailData)
### Note code_output is a support function for SeaClass, not for general use.

### View help:

### Run example from help file:
### General Use: ###
x <- c(rnorm(100,0,1),rnorm(100,2,1))
group <- c(rep(0,100),rep(2,100))
accuracy_threshold(x=x, group=group, pos_class=2)
accuracy_threshold(x=x, group=group, pos_class=0)
### Bagged Example ###
replicate_function = function(index){accuracy_threshold(x=x[index], group=group[index], pos_class=2)[[2]]}
sample_cuts <- replicate(100, {
  sample_index =,replace=TRUE)
bagged_scores <- sapply(x, function(x) mean(x > sample_cuts))
unbagged_cut    <- accuracy_threshold(x=x, group=group, pos_class=2)[[2]]
unbagged_scores <- ifelse(x > unbagged_cut, 1, 0)
# Compare AUC:
PRROC::roc.curve(scores.class0 = bagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
PRROC::roc.curve(scores.class0 = unbagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
bagged_prediction <- ifelse(bagged_scores > 0.50, 2, 0)
unbagged_prediction <- ifelse(x > unbagged_cut, 2, 0)
# Compare Confusion Matrix:
table(bagged_prediction, group)
table(unbagged_prediction, group)

Download Details:

Author: ChrisDienes
Source Code: 

#r #tool #classification 

SeaClass: an interactive R tool for Classification Problems
Monty  Boehm

Monty Boehm


OBC.jl: Optimal Bayesian Classification for RNA-Seq Data


An optimal Bayesian classification library and runtime for RNA-Seq data.

Installation Instructions



You are now ready to use the OBC Julia library. The core operations look something like the following,

using OBC
data1,data2 = ... # your datasets as integer valued matrices (samples x genes)
d1,d2 = ... # the normalization factors for each dataset (float arrays)
cls = MPM.mpm_classifier(data1, data2, d1=d1, d2=d2)
MPM.sample(cls, 10000)
bemc = MPM.bee_e_mc(cls, (mean(d1),mean(d2)))

For a full example (with code to generate synthetic data) see the run.jl runner script.

Download Details:

Author: Binarybana
Source Code: 
License: View license

#julia #classification #data 

OBC.jl: Optimal Bayesian Classification for RNA-Seq Data
Royce  Reinger

Royce Reinger


Stuff-classifier: Simple Text Classifier(s) Implemetation in Ruby



A library for classifying text into multiple categories.

Currently provided classifiers:

Ran a benchmark of 1345 items that I have previously manually classified with multiple categories. Here's the rate over which the 2 algorithms have correctly detected one of those categories:

  • Bayes: 79.26%
  • Tf-Idf: 81.34%

I prefer the Naive Bayes approach, because while having lower stats on this benchmark, it seems to make better decisions than I did in many cases. For example, an item with title "Paintball Session, 100 Balls and Equipment" was classified as "Activities" by me, but the bayes classifier identified it as "Sports", at which point I had an intellectual orgasm. Also, the Tf-Idf classifier seems to do better on clear-cut cases, but doesn't seem to handle uncertainty so well. Of course, these are just quick tests I made and I have no idea which is really better.


gem install stuff-classifier


You either instantiate one class or the other. Both have the same signature:

require 'stuff-classifier'

# for the naive bayes implementation
cls ="Cats or Dogs")

# for the Tf-Idf based implementation
cls ="Cats or Dogs")

# these classifiers use word stemming by default, but if it has weird
# behavior, then you can disable it on init:
cls ="Cats or Dogs", :stemming => false)

# also by default, the parsing phase filters out stop words, to
# disable or to come up with your own list of stop words, on a
# classifier instance you can do this:
cls.ignore_words = [ 'the', 'my', 'i', 'dont' ]

Training the classifier:

cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")    
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")

And finally, classifying stuff:

cls.classify("This test is about cats.")
#=> :cat
cls.classify("I hate ...")
#=> :cat
cls.classify("The most annoying animal on earth.")
#=> :cat
cls.classify("The preferred company of software developers.")
#=> :cat
cls.classify("My precious, my favorite!")
#=> :cat
cls.classify("Get off my keyboard!")
#=> :cat
cls.classify("Kill that bird!")
#=> :cat

cls.classify("This test is about dogs.")
#=> :dog
cls.classify("Cats or Dogs?") 
#=> :dog
cls.classify("What pet will I love more?")    
#=> :dog
cls.classify("Willy, where the heck are you?")
#=> :dog
cls.classify("I like big buts and I cannot lie.") 
#=> :dog
cls.classify("Why is the front door of our house open?")
#=> :dog
cls.classify("Who is eating my meat?")
#=> :dog


The following layers for saving the training data between sessions are implemented:

  • in memory (by default)
  • on disk
  • Redis
  • (coming soon) in a RDBMS

To persist the data in Redis, you can do this:

# defaults to redis running on localhost on default port
store =

# pass in connection args
store =, {host:'', port: 4829})

To persist the data on disk, you can do this:

store =

# global setting = store

# or alternative local setting on instantiation, by means of an
# optional param ...
cls ="Cats or Dogs", :storage => store)

# after training is done, to persist the data ...

# or you could just do this:"Cats or Dogs") do |cls|
  # when done, save_state is called on END

# to start fresh, deleting the saved training data for this classifier"Cats or Dogs", :purge_state => true)

The name you give your classifier is important, as based on it the data will get loaded and saved. For instance, following 3 classifiers will be stored in different buckets, being independent of each other.

cls1 ="Cats or Dogs")
cls2 ="True or False")
cls3 ="Spam or Ham")    

No longer maintained

This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.

Author: Alexandru
Source Code: 
License: MIT license

#ruby #text #classification 

Stuff-classifier: Simple Text Classifier(s) Implemetation in Ruby
Royce  Reinger

Royce Reinger


Naive Bayes Text Classification Implementation As OmniCat

OmniCat Bayes 

A Naive Bayes text classification implementation as an OmniCat classifier strategy.


Add this line to your application's Gemfile:

gem 'omnicat-bayes'

And then execute:

$ bundle

Or install it yourself as:

$ gem install omnicat-bayes


See rdoc for detailed usage.


Optional configuration sample:

OmniCat.configure do |config|
  # you can enable auto train mode by :unique or :continues
  # unique: only uniq docs will be added to training docs on prediction
  # continues: always add docs to training docs on prediction
  config.auto_train = :off
  config.exclude_tokens = ['something', 'anything'] # exclude token list
  config.token_patterns = {
    # exclude tokens with Regex patterns
    minus: [/[\s\t\n\r]+/, /(@[\w\d]+)/],
    # include tokens with Regex patterns
    plus: [/[\p{L}\-0-9]{2,}/, /[\!\?]/, /[\:\)\(\;\-\|]{2,3}/]

Bayes classifier

Create a classifier object with Bayes strategy.

# If you need to change strategy on runtime, you should prefer this inialization
bayes =


# If you only need to use only Bayes classification, then you can use
bayes =

Create categories

Create a classification category.



Train category with a document.

bayes.train('positive', 'great if you are in a slap happy mood .')
bayes.train('negative', 'bad tracking issue')


Untrain category with a document.

bayes.untrain('positive', 'great if you are in a slap happy mood .')
bayes.untrain('negative', 'bad tracking issue')

Train batch

Train category with multiple documents.

bayes.train_batch('positive', [
  'a feel-good picture in the best sense of the term...',
  'it is a feel-good movie about which you can actually feel good.',
  'love and money both of them are good choises'
bayes.train_batch('negative', [
  'simplistic , silly and tedious .',
  'interesting , but not compelling . ',
  'seems clever but not especially compelling'

Untrain batch

Untrain category with multiple documents.

bayes.untrain_batch('positive', [
  'a feel-good picture in the best sense of the term...',
  'it is a feel-good movie about which you can actually feel good.',
  'love and money both of them are good choises'
bayes.untrain_batch('negative', [
  'simplistic , silly and tedious .',
  'interesting , but not compelling . ',
  'seems clever but not especially compelling'


Classify a document.

result = bayes.classify('I feel so good and happy')
=> #<OmniCat::Result:0x007febb152af68 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb152add8 @key="positive", @value=6.813226744186048e-09, @percentage=58>, "negative"=>#<OmniCat::Score:0x007febb152ac70 @key="negative", @value=4.875003449064939e-09, @percentage=42>}, @total_score=1.1688230193250986e-08>
=> {:top_score_key=>"positive", :scores=>{"positive"=>{:key=>"positive", :value=>6.813226744186048e-09, :percentage=>58}, "negative"=>{:key=>"negative", :value=>4.875003449064939e-09, :percentage=>42}}, :total_score=>1.1688230193250986e-08}
=> #<OmniCat::Score:0x007febb152add8 @key="positive", @value=6.813226744186048e-09, @percentage=58>
=> {:key=>"positive", :value=>6.813226744186048e-09, :percentage=>58}

Classify batch

Classify multiple documents at a time.

results = bayes.classify_batch(
    'the movie is silly so not compelling enough',
    'a good piece of work'
=> [#<OmniCat::Result:0x007febb14f3680 @top_score_key="negative", @scores={"positive"=>#<OmniCat::Score:0x007febb14f34a0 @key="positive", @value=7.971480930520432e-14, @percentage=22>, "negative"=>#<OmniCat::Score:0x007febb14f32c0 @key="negative", @value=2.834304330851709e-13, @percentage=78>}, @total_score=3.6314524239037524e-13>, #<OmniCat::Result:0x007febb14f2aa0 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb14f2960 @key="positive", @value=3.802731206057328e-07, @percentage=72>, "negative"=>#<OmniCat::Score:0x007febb14f2820 @key="negative", @value=1.4625010347194818e-07, @percentage=28>}, @total_score=5.26523224077681e-07>]

Convert to hash

Convert full Bayes object to hash.

# For storing, restoring modal data
bayes_hash = bayes.to_hash
=> {:categories=>{"positive"=>{:doc_count=>4, :docs=>{"28fd29bbf840c86db65e510ff3cd07a9"=>{:content=>"great if you are in a slap happy mood .", :content_md5=>"28fd29bbf840c86db65e510ff3cd07a9", :count=>1, :tokens=>{"great"=>1, "if"=>1, "you"=>1, "are"=>1, "in"=>1, "slap"=>1, "happy"=>1, "mood"=>1}}, "82b4cd9513f448dea0024f2d0e2ccd44"=>{:content=>"a feel-good picture in the best sense of the term...", :content_md5=>"82b4cd9513f448dea0024f2d0e2ccd44", :count=>1, :tokens=>{"feel-good"=>1, "picture"=>1, "in"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>1, "term"=>1}}, "f917bf1cf1256c78c5436d850dab3104"=>{:content=>"it is a feel-good movie about which you can actually feel good.", :content_md5=>"f917bf1cf1256c78c5436d850dab3104", :count=>1, :tokens=>{"it"=>1, "is"=>1, "feel-good"=>1, "movie"=>1, "about"=>1, "which"=>1, "you"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>1}}, "4343bbe84c035733708c3f58136f321e"=>{:content=>"love and money both of them are good choises", :content_md5=>"4343bbe84c035733708c3f58136f321e", :count=>1, :tokens=>{"love"=>1, "and"=>1, "money"=>1, "both"=>1, "of"=>1, "them"=>1, "are"=>1, "good"=>1, "choises"=>1}}}, :name=>"positive", :tokens=>{"great"=>1, "if"=>1, "you"=>2, "are"=>2, "in"=>2, "slap"=>1, "happy"=>1, "mood"=>1, "feel-good"=>2, "picture"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>2, "term"=>1, "it"=>1, "is"=>1, "movie"=>1, "about"=>1, "which"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>2, "love"=>1, "and"=>1, "money"=>1, "both"=>1, "them"=>1, "choises"=>1}, :token_count=>37, :prior=>0.5}, "negative"=>{:doc_count=>4, :docs=>{"89b36e774579662591ea21b3283d9b35"=>{:content=>"bad tracking issue", :content_md5=>"89b36e774579662591ea21b3283d9b35", :count=>1, :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1}}, "b0ec48bc87527e285b26d6cce8e278e7"=>{:content=>"simplistic , silly and tedious .", :content_md5=>"b0ec48bc87527e285b26d6cce8e278e7", :count=>1, :tokens=>{"simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1}}, "ae9d4fbaf40906614ca712a888648c5f"=>{:content=>"interesting , but not compelling . ", :content_md5=>"ae9d4fbaf40906614ca712a888648c5f", :count=>1, :tokens=>{"interesting"=>1, "but"=>1, "not"=>1, "compelling"=>1}}, "0e495f5d88d8049746a1b6961bf3cc90"=>{:content=>"seems clever but not especially compelling", :content_md5=>"0e495f5d88d8049746a1b6961bf3cc90", :count=>1, :tokens=>{"seems"=>1, "clever"=>1, "but"=>1, "not"=>1, "especially"=>1, "compelling"=>1}}}, :name=>"negative", :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1, "simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1, "interesting"=>1, "but"=>2, "not"=>2, "compelling"=>2, "seems"=>1, "clever"=>1, "especially"=>1}, :token_count=>17, :prior=>0.5}}, :category_count=>2, :category_size_limit=>0, :doc_count=>8, :token_count=>54, :unique_token_count=>43, :k_value=>1.0}

Load from hash

Load full Bayes object from hash.

another_bayes_obj =
=> #<OmniCat::Classifiers::Bayes:0x007febb14d15a8 @categories={"positive"=>#<OmniCat::Classifiers::BayesInternals::Category:0x007febb14d1530 @doc_count=4, @docs={"28fd29bbf840c86db65e510ff3cd07a9"=>{:content=>"great if you are in a slap happy mood .", :content_md5=>"28fd29bbf840c86db65e510ff3cd07a9", :count=>1, :tokens=>{"great"=>1, "if"=>1, "you"=>1, "are"=>1, "in"=>1, "slap"=>1, "happy"=>1, "mood"=>1}}, "82b4cd9513f448dea0024f2d0e2ccd44"=>{:content=>"a feel-good picture in the best sense of the term...", :content_md5=>"82b4cd9513f448dea0024f2d0e2ccd44", :count=>1, :tokens=>{"feel-good"=>1, "picture"=>1, "in"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>1, "term"=>1}}, "f917bf1cf1256c78c5436d850dab3104"=>{:content=>"it is a feel-good movie about which you can actually feel good.", :content_md5=>"f917bf1cf1256c78c5436d850dab3104", :count=>1, :tokens=>{"it"=>1, "is"=>1, "feel-good"=>1, "movie"=>1, "about"=>1, "which"=>1, "you"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>1}}, "4343bbe84c035733708c3f58136f321e"=>{:content=>"love and money both of them are good choises", :content_md5=>"4343bbe84c035733708c3f58136f321e", :count=>1, :tokens=>{"love"=>1, "and"=>1, "money"=>1, "both"=>1, "of"=>1, "them"=>1, "are"=>1, "good"=>1, "choises"=>1}}}, @name="positive", @tokens={"great"=>1, "if"=>1, "you"=>2, "are"=>2, "in"=>2, "slap"=>1, "happy"=>1, "mood"=>1, "feel-good"=>2, "picture"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>2, "term"=>1, "it"=>1, "is"=>1, "movie"=>1, "about"=>1, "which"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>2, "love"=>1, "and"=>1, "money"=>1, "both"=>1, "them"=>1, "choises"=>1}, @token_count=37, @prior=0.5>, "negative"=>#<OmniCat::Classifiers::BayesInternals::Category:0x007febb14d14e0 @doc_count=4, @docs={"89b36e774579662591ea21b3283d9b35"=>{:content=>"bad tracking issue", :content_md5=>"89b36e774579662591ea21b3283d9b35", :count=>1, :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1}}, "b0ec48bc87527e285b26d6cce8e278e7"=>{:content=>"simplistic , silly and tedious .", :content_md5=>"b0ec48bc87527e285b26d6cce8e278e7", :count=>1, :tokens=>{"simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1}}, "ae9d4fbaf40906614ca712a888648c5f"=>{:content=>"interesting , but not compelling . ", :content_md5=>"ae9d4fbaf40906614ca712a888648c5f", :count=>1, :tokens=>{"interesting"=>1, "but"=>1, "not"=>1, "compelling"=>1}}, "0e495f5d88d8049746a1b6961bf3cc90"=>{:content=>"seems clever but not especially compelling", :content_md5=>"0e495f5d88d8049746a1b6961bf3cc90", :count=>1, :tokens=>{"seems"=>1, "clever"=>1, "but"=>1, "not"=>1, "especially"=>1, "compelling"=>1}}}, @name="negative", @tokens={"bad"=>1, "tracking"=>1, "issue"=>1, "simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1, "interesting"=>1, "but"=>2, "not"=>2, "compelling"=>2, "seems"=>1, "clever"=>1, "especially"=>1}, @token_count=17, @prior=0.5>}, @category_count=2, @category_size_limit=0, @doc_count=8, @token_count=54, @unique_token_count=43, @k_value=1.0>
another_bayes_obj.classify('best senses')
=> #<OmniCat::Result:0x007febb14c0fc8 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb14c0ed8 @key="positive", @value=0.00029069767441860465, @percentage=52>, "negative"=>#<OmniCat::Score:0x007febb14c0de8 @key="negative", @value=0.0002704164413196322, @percentage=48>}, @total_score=0.0005611141157382368>

Best practices

For bayes classification always try to train same amount of documents for each category. So, do not activate auto training mode, because it make overages on balance of trained docs and makes algorithm go crazy :). To get best results on text classification you should apply some cleaning actions like spellchecking, stemming, stop words cleaning before training and prediction actions.


  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

Author: Mustafaturan
Source Code: 
License: MIT license

#ruby #tokenize #text #classification 

Naive Bayes Text Classification Implementation As OmniCat
Royce  Reinger

Royce Reinger


Omnicat: A Generalized Rack Framework for Text Classifications


A generalized framework for text classifications.


Add this line to your application's Gemfile:

gem 'omnicat'

And then execute:

$ bundle

Or install it yourself as:

$ gem install omnicat


Stand-alone version of omnicat is just a strategy holder for developers. Its aim is providing omnification of methods for text classification gems with loseless conversion of a strategy to another one. End-users should see 'classifier strategies' section and 'changing classifier strategy' sub section.

Changing classifier strategy

OmniCat allows you to change strategy on runtime.

# Declare classifier with Naive Bayes classifier
classifier =
# do some operations like adding category, training, etc...
# make some classification using Bayes
classifier.classify('I am happy :)')
# change strategy to Support Vector Machine (SVM) on runtime
classifier.strategy =
# now you do not need to re-train, add category and so on..
# just classify with new strategy
classifier.classify('I am happy :)')

Classifier strategies

Here is the classifier list avaliable for OmniCat.

Naive Bayes classifier


  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

Author: Mustafaturan
Source Code: 
License: MIT license

#ruby #classification #texts #framework 

Omnicat: A Generalized Rack Framework for Text Classifications
Royce  Reinger

Royce Reinger


Nbayes: A Robust, Full-featured Ruby Implementation Of Naive Bayes


gem install nbayes

NBayes is a full-featured, Ruby implementation of Naive Bayes. Some of the features include:

  • allows prior distribution on classes to be assumed uniform (optional)
  • generic to work with all types of tokens, not just text
  • outputs probabilities, instead of just class w/max probability
  • customizable constant value for Laplacian smoothing
  • optional and customizable purging of low-frequency tokens (for performance)
  • optional binarized mode
  • uses log probabilities to avoid underflow

For more information, view this blog post:

Contributing to nbayes

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
  • Fork the project.
  • Start a feature/bugfix branch.
  • Commit and push until you are happy with your contribution.
  • Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.


This project is supported by the GrammarBot grammar checker

Author: oasic
Source Code: 
License: MIT license

#ruby #classification

Nbayes: A Robust, Full-featured Ruby Implementation Of Naive Bayes