1684313460
The five video classification methods:
lrcn
network in the code).See the accompanying blog post for full details: https://medium.com/@harvitronix/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5
This code requires you have Keras 2 and TensorFlow 1 or greater installed. Please see the requirements.txt
file. To ensure you're up to date, run:
pip install -r requirements.txt
You must also have ffmpeg
installed in order to extract the video files. If ffmpeg
isn't in your system path (ie. which ffmpeg
doesn't return its path, or you're on an OS other than *nix), you'll need to update the path to ffmpeg
in data/2_extract_files.py
.
First, download the dataset from UCF into the data
folder:
cd data && wget http://crcv.ucf.edu/data/UCF101/UCF101.rar
Then extract it with unrar e UCF101.rar
.
Next, create folders (still in the data folder) with mkdir train && mkdir test && mkdir sequences && mkdir checkpoints
.
Now you can run the scripts in the data folder to move the videos to the appropriate place, extract their frames and make the CSV file the rest of the code references. You need to run these in order. Example:
python 1_move_files.py
python 2_extract_files.py
Before you can run the lstm
and mlp
, you need to extract features from the images with the CNN. This is done by running extract_features.py
. On my Dell with a GeFore 960m GPU, this takes about 8 hours. If you want to limit to just the first N classes, you can set that option in the file.
The CNN-only method (method #1 in the blog post) is run from train_cnn.py
.
The rest of the models are run from train.py
. There are configuration options you can set in that file to choose which model you want to run.
The models are all defined in models.py
. Reference that file to see which models you are able to run in train.py
.
Training logs are saved to CSV and also to TensorBoard files. To see progress while training, run tensorboard --logdir=data/logs
from the project root folder.
I have not yet implemented a demo where you can pass a video file to a model and get a prediction. Pull requests are welcome if you'd like to help out!
Khurram Soomro, Amir Roshan Zamir and Mubarak Shah, UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild., CRCV-TR-12-01, November, 2012.
Author: Harvitronix
Source Code: https://github.com/harvitronix/five-video-classification-methods
License: MIT license
#machinelearning #deeplearning #tensorflow #keras #classification
1681791420
Use Tensorflow to load pretrained neural networks and perform inference through ROS2 interfaces.
The output can be directly visualized through Rviz
In order to build the ros2-tensorflow
package, the following dependencies are needed
Required dependencies:
Rosdep dependencies:
Optional dependencies:
The provided Dockerfile contains an Ubuntu 18.04 environment with all the dependencies and this package already installed.
To use the Dockerfile:
$ git clone https://github.com/alsora/ros2-tensorflow.git
$ cd ros2-tensorflow/docker
$ bash build.sh
$ bash run.sh
This section describes how to build the ros2-tensorflow
package and the required depenencies in case you are not using the provided Dockerfile.
Get the source code and create the ROS 2 workspace
$ git clone https://github.com/alsora/ros2-tensorflow.git $HOME/ros2-tensorflow
$ mkdir -p $HOME/tf_ws/src
$ cd $HOME/tf_ws
$ ln -s $HOME/ros2-tensorflow/ros2-tensorflow src
Install required dependencies using rosdep
$ rosdep install --from-paths src --ignore-src --rosdistro foxy -y
Install the Tensorflow Object Detection Models (optional). Make sure to specify the correct Python version according to your system.
$ sudo apt-get install -y protobuf-compiler python-lxml python-tk
$ pip install --user Cython contextlib2 jupyter matplotlib Pillow
$ git clone https://github.com/tensorflow/models.git /usr/local/lib/python3.8/dist-packages/tensorflow/models
$ cd usr/local/lib/python3.8/dist-packages/tensorflow/models/research
$ protoc object_detection/protos/*.proto --python_out=.
$
$ echo 'export PYTHONPATH=$PYTHONPATH:/usr/local/lib/python3.8/dist-packages/tensorflow/models/research' >> $HOME/.bashrc
Install Tensorflow Slim (optional)
$ pip install tf_slim
Build and install the ros2-tensorflow
package
$ colcon build
$ source install/local_setup.sh
The basic usage consists in creating a ROS 2 node which loads a Tensorflow model and another ROS 2 node that acts as a client and receives the result of the inference.
It is possible to specify which model a node should load. Note that if the model is specified via url, as it is by default, the first time the node is executed a network connection will be required in order to download the model.
Test the object detection server by running in separate terminals
$ ros2 run tf_detection_py server
$ ros2 run tf_detection_py client_test
Setup a real object detection pipeline using a stream of images coming from a ROS 2 camera node
$ rviz2
$ ros2 run tf_detection_py server
$ ros2 run image_tools cam2image --ros-args -p frequency:=2.0
Test the image classification server by running in separate terminals
$ ros2 run tf_classification_py server
$ ros2 run tf_classification_py client_test
The repository contains convenient APIs for loading Tensorflow models into the ROS 2 nodes.
Models are defined using the ModelDescriptor
class, which contains all the information required for loading a model and performing inference on it. It can either contain a path where the model can be found on the machine or an URL where the model can be downloaded the first time.
Different model formats are also supported, such as frozen models and saved models.
Some known supported models are already present as examples. See classification models and detection models
The Tensorflow models repository contains many pretrained models that can be used. For example, you can get additional Tensorflow model for object detection from the detection model zoo.
Author: Alsora
Source Code: https://github.com/alsora/ros2-tensorflow
License: Apache-2.0 license
1681392245
Your PyTorch AI Factory
Flash makes complex AI recipes for over 15 tasks across 7 data domains accessible to all. In a nutshell, Flash is the production grade research framework you always dreamed of but didn't have time to build.
From PyPI:
pip install lightning-flash
See our installation guide for more options.
All data loading in Flash is performed via a from_*
classmethod on a DataModule
. To decide which DataModule
to use and which from_*
methods are available, it depends on the task you want to perform. For example, for image segmentation where your data is stored in folders, you would use the from_folders
method of the SemanticSegmentationData
class:
from flash.image import SemanticSegmentationData
dm = SemanticSegmentationData.from_folders(
train_folder="data/CameraRGB",
train_target_folder="data/CameraSeg",
val_split=0.1,
image_size=(256, 256),
num_classes=21,
)
Our tasks come loaded with pre-trained backbones and (where applicable) heads. You can view the available backbones to use with your task using available_backbones
. Once you've chosen one, create the model:
from flash.image import SemanticSegmentation
print(SemanticSegmentation.available_heads())
# ['deeplabv3', 'deeplabv3plus', 'fpn', ..., 'unetplusplus']
print(SemanticSegmentation.available_backbones('fpn'))
# ['densenet121', ..., 'xception'] # + 113 models
print(SemanticSegmentation.available_pretrained_weights('efficientnet-b0'))
# ['imagenet', 'advprop']
model = SemanticSegmentation(
head="fpn", backbone='efficientnet-b0', pretrained="advprop", num_classes=dm.num_classes)
from flash import Trainer
trainer = Trainer(max_epochs=3)
trainer.finetune(model, datamodule=datamodule, strategy="freeze")
trainer.save_checkpoint("semantic_segmentation_model.pt")
Serve in just 2 lines:
from flash.image import SemanticSegmentation
model = SemanticSegmentation.load_from_checkpoint("semantic_segmentation_model.pt")
model.serve()
or make predictions from raw data directly.
from flash import Trainer
trainer = Trainer(strategy='ddp', accelerator="gpu", gpus=2)
dm = SemanticSegmentationData.from_folders(predict_folder="data/CameraRGB")
predictions = trainer.predict(model, dm)
Training strategies are PyTorch SOTA Training Recipes which can be utilized with a given task.
Check out this example where the ImageClassifier
supports 4 Meta Learning Algorithms from Learn2Learn. This is particularly useful if you use this model in production and want to make sure the model adapts quickly to its new environment with minimal labelled data.
from flash.image import ImageClassifier
model = ImageClassifier(
backbone="resnet18",
optimizer=torch.optim.Adam,
optimizer_kwargs={"lr": 0.001},
training_strategy="prototypicalnetworks",
training_strategy_kwargs={
"epoch_length": 10 * 16,
"meta_batch_size": 4,
"num_tasks": 200,
"test_num_tasks": 2000,
"ways": datamodule.num_classes,
"shots": 1,
"test_ways": 5,
"test_shots": 1,
"test_queries": 15,
},
)
In detail, the following methods are currently implemented:
With Flash, swapping among 40+ optimizers and 15+ schedulers recipes are simple. Find the list of available optimizers, schedulers as follows:
from flash.image import ImageClassifier
ImageClassifier.available_optimizers()
# ['A2GradExp', ..., 'Yogi']
ImageClassifier.available_schedulers()
# ['CosineAnnealingLR', 'CosineAnnealingWarmRestarts', ..., 'polynomial_decay_schedule_with_warmup']
Once you've chosen, create the model:
#### The optimizer of choice can be passed as
from flash.image import ImageClassifier
# - String value
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler=None)
# - Callable
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer=functools.partial(torch.optim.Adadelta, eps=0.5), lr_scheduler=None)
# - Tuple[string, dict]: (The dict takes in the optimizer kwargs)
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer=("Adadelta", {"epa": 0.5}), lr_scheduler=None)
#### The scheduler of choice can be passed as a
# - String value
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler="constant_schedule")
# - Callable
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler=functools.partial(CyclicLR, step_size_up=1500, mode='exp_range', gamma=0.5))
# - Tuple[string, dict]: (The dict takes in the scheduler kwargs)
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler=("StepLR", {"step_size": 10}))
You can also register you own custom scheduler recipes beforeahand and use them shown as above:
from flash.image import ImageClassifier
@ImageClassifier.lr_schedulers_registry
def my_steplr_recipe(optimizer):
return torch.optim.lr_scheduler.StepLR(optimizer, step_size=10)
model = ImageClassifier(backbone="resnet18", num_classes=2, optimizer="Adam", lr_scheduler="my_steplr_recipe")
Flash includes some simple augmentations for each task by default, however, you will often want to override these and control your own augmentation recipe. To this end, Flash supports custom transformations with the InputTransform
. The InputTransform
is like a callback for transforms, with hooks that can be used to apply transforms to samples or batches, on and off the device / accelerator. In addition, hooks can be specialized to apply transforms only to the input or target. With these hooks, complex transforms like MixUp can be implemented with ease. Here's an example (with an albumentations transform thrown in too!):
import torch
import numpy as np
import albumentations
from flash import InputTransform
from flash.image import ImageClassificationData
from flash.image.classification.input_transform import AlbumentationsAdapter
def mixup(batch, alpha=1.0):
images = batch["input"]
targets = batch["target"].float().unsqueeze(1)
lam = np.random.beta(alpha, alpha)
perm = torch.randperm(images.size(0))
batch["input"] = images * lam + images[perm] * (1 - lam)
batch["target"] = targets * lam + targets[perm] * (1 - lam)
return batch
class MixUpInputTransform(InputTransform):
def train_input_per_sample_transform(self):
return AlbumentationsAdapter(albumentations.HorizontalFlip(p=0.5))
# This will be applied after transferring the batch to the device!
def train_per_batch_transform_on_device(self):
return mixup
datamodule = ImageClassificationData.from_folders(
train_folder="data/train",
transform=MixUpInputTransform,
batch_size=2,
)
Flash Zero is a zero-code machine learning platform built directly into lightning-flash using the Lightning CLI
.
To get started and view the available tasks, run:
flash --help
For example, to train an image classifier for 10 epochs with a resnet50
backbone on 2 GPUs using your own data, you can do:
flash image_classification --trainer.max_epochs 10 --trainer.gpus 2 --model.backbone resnet50 from_folders --train_folder {PATH_TO_DATA}
The lightning + Flash team is hard at work building more tasks for common deep-learning use cases. But we're looking for incredible contributors like you to submit new tasks!
Join our Slack and/or read our CONTRIBUTING guidelines to get help becoming a contributor!
Note: Flash is currently being tested on real-world use cases and is in active development. Please open an issue if you find anything that isn't working as expected.
Flash is maintained by our core contributors.
For help or questions, join our huge community on Slack!
We’re excited to continue the strong legacy of opensource software and have been inspired over the years by Caffe, Theano, Keras, PyTorch, torchbearer, and fast.ai. When/if additional papers are written about this, we’ll be happy to cite these frameworks and the corresponding authors.
Flash leverages models from many different frameworks in order to cover such a wide range of domains and tasks. The full list of providers can be found in our documentation.
Author: Lightning-Universe
Source Code: https://github.com/Lightning-Universe/lightning-flash
License: Apache-2.0 license
1678203300
В этом блоге мы расскажем о ключевых проблемах, связанных с точностью классификации, таких как несбалансированные классы, переоснащение и смещение данных, а также о проверенных способах успешного решения этих проблем.
Несбалансированные классы
Точность может быть обманчивой, если набор данных содержит неравномерные классификации. Например, модель, которая просто предсказывает мажоритарный класс, будет точной на 99 %, если доминирующий класс включает 99 % данных. К сожалению, он не сможет должным образом классифицировать класс меньшинства. Для решения этой проблемы следует использовать другие показатели, включая точность, отзыв и оценку F1.
5 наиболее распространенных методов, которые можно использовать для решения проблемы несбалансированного класса точности классификации:
Несбалансированный класс | Инженерия знаний
Переоснащение
Когда модель переобучается на обучающих данных и плохо работает на тестовых данных, говорят, что она переобучена. В результате точность может быть высокой на тренировочном наборе, но плохой на тестовом. Для решения этой проблемы следует применять такие методы, как перекрестная проверка и регуляризация .
Переоснащение | Фрипик
Существует несколько методов, которые можно использовать для устранения переобучения.
Предвзятые данные
Модель будет давать смещенные прогнозы, если набор обучающих данных смещен. Это может привести к высокой точности на обучающих данных, но производительность на необученных данных может быть ниже среднего. Для решения этой проблемы следует использовать такие методы, как увеличение данных и повторная выборка. Некоторые другие способы решения этой проблемы перечислены ниже:
Предвзятость данных | Эксплориум
Матрица путаницы
Изображение автора
Производительность алгоритма классификации описывается с помощью матрицы путаницы. Это макет таблицы, в котором реальные значения сопоставляются с ожидаемыми значениями в матрице, чтобы определить производительность алгоритма классификации. Некоторые способы решения этой проблемы:
Вклад точности классификации в машинное обучение
В заключение, точность классификации является полезным показателем для оценки производительности модели машинного обучения, но она может быть обманчивой. Чтобы получить более полное представление о производительности модели, следует также использовать дополнительные показатели, включая точность, полноту, оценку F1 и матрицу путаницы. Чтобы преодолеть такие проблемы, как несбалансированные классы, переоснащение и смещение данных, следует применять методы, включая перекрестную проверку, нормализацию, увеличение данных и повторную выборку.
Оригинальный источник статьи: https://www.kdnuggets.com/
1678195140
在这篇博客中,我们将揭示与分类精度相关的关键问题,例如不平衡类、过度拟合和数据偏差,以及成功解决这些问题的行之有效的方法。
不平衡类
如果数据集包含不均匀的分类,则准确性可能具有欺骗性。例如,如果主导类包含 99% 的数据,则仅预测多数类的模型将达到 99% 的准确率。不幸的是,它将无法对少数类进行适当的分类。应使用其他指标(包括精度、召回率和 F1 分数)来解决此问题。
可用于解决分类精度中 类别不平衡问题的5 种最常用技术是:
不平衡类 | 知识工程
过拟合
当一个模型在训练数据上过度训练而在测试数据上表现不佳时,就被称为过度拟合。因此,训练集上的准确性可能很高,但测试集上的准确性可能很差。应该应用交叉验证和正则化等技术来解决这个问题。
过拟合 | Freepik
有几种技术可用于解决过度拟合问题。
有偏见的数据
如果训练数据集有偏差,该模型将产生有偏差的预测。由此可能导致训练数据的高精度,但未训练数据的性能可能低于标准。应该使用数据增强和重采样等技术来解决这个问题。下面列出了解决此问题的其他一些方法:
数据偏差 | 探索馆
混淆矩阵
图片作者
使用混淆矩阵描述分类算法的性能。它是一种表格布局,其中实际值与矩阵中的预期值进行对比,以定义分类算法的性能。解决这个问题的一些方法是:
机器学习中分类精度的贡献
总之,分类准确性是评估机器学习模型性能的有用指标,但它可能具有欺骗性。为了更全面地了解模型的性能,还应使用其他指标,包括精度、召回率、F1 分数和混淆矩阵。为了克服类不平衡、过度拟合和数据偏差等问题,应应用交叉验证、归一化、数据扩充和重采样等技术。
文章原文出处:https: //www.kdnuggets.com/
1678194792
In this blog, we will unfold the key problems associated with classification accuracies, such as imbalanced classes, overfitting, and data bias, and proven ways to address those issues successfully.
Imbalanced Classes
The accuracy may be deceptive if the dataset contains classifications that are uneven. For instance, a model that merely predicts the majority class will be 99% accurate if the dominant class comprises 99% of the data. Unfortunately, it will not be able to appropriately classify the minority class. Other metrics including precision, recall, and F1-score should be used to address this issue.
The 5 most common techniques that can be used to address the problem of imbalanced class in classification accuracy are:
Imbalanced class | Knowledge Engineering
Overfitting
When a model is overtrained on the training data and underperforms on the test data, it is said to be overfit. As a result, the accuracy may be high on the training set but poor on the test set. Techniques like cross-validation and regularisation should be applied to solve this issue.
Overfitting | Freepik
There are several techniques that can be used to address overfitting.
Data Bias
The model will produce biased predictions if the training dataset is biassed. High accuracy on the training data may result from this, but performance on untrained data may be subpar. Techniques like data augmentation and resampling should be utilised to address this issue. Some other ways to address this problem are listed below:
Data Bias | Explorium
Confusion Matrix
Image by Author
A classification algorithm's performance is described using a confusion matrix. It is a table layout where real values are contrasted with anticipated values in the matrix to define the performance of a classification algorithm. Some ways to address this problem are:
Contribution of Classification Accuracy in Machine Learning
In conclusion, classification accuracy is a helpful metric for assessing a machine learning model's performance, but it can be deceptive. To acquire a more thorough perspective of the model's performance, additional metrics including precision, recall, F1-score, and confusion matrix should also be used. To overcome issues like imbalanced classes, overfitting, and data bias, techniques including cross-validation, normalisation, data augmentation, and re-sampling should be applied.
Original article source at: https://www.kdnuggets.com/
1676790120
scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Python packages (numpy, scipy) and follows a similar API to that of scikit-learn.
Native Python implementation. A native Python implementation for a variety of multi-label classification algorithms. To see the list of all supported classifiers, check this link.
Interface to Meka. A Meka wrapper class is implemented for reference purposes and integration. This provides access to all methods available in MEKA, MULAN, and WEKA — the reference standard in the field.
Builds upon giants! Team-up with the power of numpy and scikit. You can use scikit-learn's base classifiers as scikit-multilearn's classifiers. In addition, the two packages follow a similar API.
In most cases you will want to follow the requirements defined in the requirements/*.txt files in the package.
scipy
numpy
future
scikit-learn
liac-arff # for loading ARFF files
requests # for dataset module
networkx # for networkX base community detection clusterers
python-louvain # for networkX base community detection clusterers
keras
python-igraph # for igraph library based clusterers
python-graphtool # for graphtool base clusterers
Note: Installing graphtool is complicated, please see: graphtool install instructions
To install scikit-multilearn, simply type the following command:
$ pip install scikit-multilearn
This will install the latest release from the Python package index. If you wish to install the bleeding-edge version, then clone this repository and run setup.py
:
$ git clone https://github.com/scikit-multilearn/scikit-multilearn.git
$ cd scikit-multilearn
$ python setup.py
Before proceeding to classification, this library assumes that you have a dataset with the following matrices:
x_train
, x_test
: training and test feature matrices of size (n_samples, n_features)
y_train
, y_test
: training and test label matrices of size (n_samples, n_labels)
Suppose we wanted to use a problem-transformation method called Binary Relevance, which treats each label as a separate single-label classification problem, to a Support-vector machine (SVM) classifier, we simply perform the following tasks:
# Import BinaryRelevance from skmultilearn
from skmultilearn.problem_transform import BinaryRelevance
# Import SVC classifier from sklearn
from sklearn.svm import SVC
# Setup the classifier
classifier = BinaryRelevance(classifier=SVC(), require_dense=[False,True])
# Train
classifier.fit(X_train, y_train)
# Predict
y_pred = classifier.predict(X_test)
More examples and use-cases can be seen in the documentation. For using the MEKA wrapper, check this link.
This project is open for contributions. Here are some of the ways for you to contribute:
In case you want to implement your own multi-label classifier, please read our Developer's Guide to help you integrate your implementation in our API.
To make a contribution, just fork this repository, push the changes in your fork, open up an issue, and make a Pull Request!
We're also available in Slack! Just go to our slack group.
If you used scikit-multilearn in your research or project, please cite our work:
@ARTICLE{2017arXiv170201460S,
author = {{Szyma{\'n}ski}, P. and {Kajdanowicz}, T.},
title = "{A scikit-based Python environment for performing multi-label classification}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1702.01460},
year = 2017,
month = feb
}
Author: Scikit-multilearn
Source Code: https://github.com/scikit-multilearn/scikit-multilearn
License: BSD-2-Clause license
#machinelearning #python #clustering #scikitlearn #classification
1671722220
Supervised learning is the key to computer vision and deep learning. However, what happens when you don’t have access to large, human-labeled datasets? In this article, Toptal Computer Vision Developer Urwa Muaz demonstrates the potential of semi-supervised image classification using unlabeled datasets.
Supervised learning has been at the forefront of research in computer vision and deep learning over the past decade.
In a supervised learning setting, humans are required to annotate a large amount of dataset manually. Then, models use this data to learn complex underlying relationships between the data and label and develop the capability to predict the label, given the data. Deep learning models are generally data-hungry and require enormous amounts of datasets to achieve good performance. Ever-improving hardware and the availability of large human-labeled datasets has been the reason for the recent successes of deep learning.
One major drawback of supervised deep learning is that it relies on the presence of an extensive amount of human-labeled datasets for training. This luxury is not available across all domains as it might be logistically difficult and very expensive to get huge datasets annotated by professionals. While the acquisition of labeled data can be a challenging and costly endeavor, we usually have access to large amounts of unlabeled datasets, especially image and text data. Therefore, we need to find a way to tap into these underused datasets and use them for learning.
In the absence of large amounts of labeled data, we usually resort to using transfer learning. So what is transfer learning?
Transfer learning means using knowledge from a similar task to solve a problem at hand. In practice, it usually means using as initializations the deep neural network weights learned from a similar task, rather than starting from a random initialization of the weights, and then further training the model on the available labeled data to solve the task at hand.
Transfer learning enables us to train models on datasets as small as a few thousand examples, and it can deliver a very good performance. Transfer learning from pretrained models can be performed in three ways:
Usually, the last layers of the neural network are doing the most abstract and task-specific calculations, which are generally not easily transferable to other tasks. By contrast, the initial layers of the network learn some basic features like edges and common shapes, which are easily transferable across tasks.
The image sets below depict what the convolution kernels at different levels in a convolutional neural network (CNN) are essentially learning. We see a hierarchical representation, with the initial layers learning basic shapes, and progressively, higher layers learning more complex semantic concepts.
A common practice is to take a model pretrained on large labeled image datasets (such as ImageNet) and chop off the fully connected layers at the end. New, fully connected layers are then attached and configured according to the required number of classes. Transferred layers are frozen, and the new layers are trained on the available labeled data for your task.
In this setup, the pretrained model is being used as a feature extractor, and the fully connected layers on the top can be considered a shallow classifier. This setup is more robust than overfitting as the number of trainable parameters is relatively small, so this configuration works well when the available labeled data is very scarce. What size of dataset qualifies as a very small dataset is usually a tricky problem with many aspects of consideration, including the problem at hand and the size of the model backbone. Roughly speaking, I would use this strategy for a dataset consisting of a couple of thousand images.
Alternatively, we can transfer the layers from a pretrained network and train the entire network on the available labeled data. This setup needs a little more labeled data because you are training the entire network and hence a large number of parameters. This setup is more prone to overfitting when there is a scarcity of data.
This approach is my personal favorite and usually yields the best results, at least in my experience. Here, we train the newly attached layers while freezing the transferred layers for a few epochs before fine-tuning the entire network.
Fine-tuning the entire network without giving a few epochs to the final layers can result in the propagation of harmful gradients from randomly initialized layers to the base network. Furthermore, fine-tuning requires a comparatively smaller learning rate, and a two-stage approach is a convenient solution to it.
This usually works very well for most image classification tasks because we have huge image datasets like ImageNet that cover a good portion of possible image space—and usually, weights learned from it are transferable to custom image classification tasks. Moreover, the pretrained networks are readily available off the shelf, thus facilitating the process.
However, this approach will not work well if the distribution of images in your task is drastically different from the images that the base network was trained on. For example, if you are dealing with grayscale images generated by a medical imaging device, transfer learning from ImageNet weights will not be that effective and you will need more than a couple of thousand labeled images for training your network to satisfactory performance.
In contrast, you might have access to large amounts of unlabeled datasets for your problem. That is why the ability to learn from unlabeled datasets is crucial. Additionally, the unlabeled dataset is typically far greater in variety and volume than even the largest labeled datasets.
Semi-supervised approaches have shown to yield superior performance to supervised approaches on large benchmarks like ImageNet. Yann LeCun’s famous cake analogy stresses the importance of unsupervised learning:
This approach leverages both labeled and unlabeled data for learning, hence it is termed semi-supervised learning. This is usually the preferred approach when you have a small amount of labeled data and a large amount of unlabeled data. There are techniques where you learn from labeled and unlabeled data simultaneously, but we will discuss the problem in the context of a two-stage approach: unsupervised learning on unlabeled data, and transfer learning using one of the strategies described above to solve your classification task.
In these cases, unsupervised learning is a rather confusing term. These approaches are not truly unsupervised in the sense that there is a supervision signal that guides the learning of weights, but thus the supervision signal is derived from the data itself. Hence, it is sometimes referred to as self-supervised learning but these terms have been used interchangeably in literature to refer to the same approach.
The major techniques in self-supervised learning can be divided by how they generate this supervision signal from the data, as discussed below.
Generative methods aim at the accurate reconstruction of data after passing it through a bottleneck. One example of such networks is autoencoders. They reduce the input into a low-dimensional representation space using an encoder network and reconstruct the image using the decoder network.
In this setup, the input itself becomes the supervision signal (label) for training the network. The encoder network can then be extracted and used as a starting point to build your classifier, using one of the transfer learning techniques discussed in the section above.
Similarly, another form of generative networks - Generative Adversarial Networks (GANs) - can be used for pretraining on unlabeled data. Then, a discriminator can be adopted and further fine-tuned for the classification task.
Discriminative approaches train a neural network to learn an auxiliary classification task. An auxiliary task is chosen such that the supervision signal can be derived from the data itself, without human annotation.
Examples of this type of tasks are learning the relative positions of image patches, colorizing grayscale images, or learning the geometric transformations applied on images. We will discuss two of them in further detail.
In this technique, image patches are extracted from the source image to form a jigsaw puzzle-like grid. The path positions are shuffled, and shuffled input is fed into the network, which is trained to correctly predict the location of each patch in the grid. Thus, the supervision signal is the actual position of each path in the grid.
In learning to do that, the network learns the relative structure and orientation of objects as well as the continuity of low-level visual features like color. The results show that the features learned by solving this jigsaw puzzle are highly transferable to tasks like image classification and object detection.
These approaches apply a small set of geometric transformations to the input images and train a classifier to predict the applied transformation by looking at the transformed image alone. One example of these approaches is to apply a 2D rotation to the unlabeled images to obtain a set of rotated images and then train the network to predict the rotation of each image.
This simple supervision signal forces the network to learn to localize the objects in an image and understand their orientation. Features learned by these approaches have proven to be highly transferable and yield state of the art performance for classification tasks in semi-supervised settings.
These approaches project the images into a fixed-sized representation space where similar images are closer together and different images are further apart. One way to achieve this is to use siamese networks based on triplet loss, which minimizes the distance between semantically similar images. Triplet loss needs an anchor, a positive example, and a negative example and tries to bring positive closer to the anchor than negative in terms of Euclidean distance in latent space. Anchor and positive are from the same class, and the negative example is chosen randomly from the remaining classes.
In unlabeled data, we need to come up with a strategy to produce this triplet of anchor positive and negative examples without knowing the classes of images. One way to do so is to use a random affine transformation of the anchor image as a positive example and randomly select another image as a negative example.
In this section, I will relate an experiment that empirically establishes the potential of unsupervised pre-training for image classification. This was my semester project for a Deep Learning class I took with Yann LeCun at NYU last spring.
We trained seven models, each using a different number of labeled training examples per class. This was done to understand how the size of the training data influences the performance of our semi-supervised setup.
We were able to get an 82% accuracy rate for pre-training on rotation classification. For classifier training, the top 5% accuracy saturated around the value of 46.24%, and fine-tuning of the entire network yielded the final figure of 50.17%. By leveraging the pre-training, we got better performance than that of supervised training, which gives 40% top 5 accuracy.
As expected, the validation accuracy decreases with the decrease in labeled training data. However, the decrease in performance is not as significant as one would expect in a supervised setting. A 50% decrease in training data from 64 examples per class to 32 examples per class only results in a 15% decrease in the validation accuracy.
By using only 32 examples per class, our semi-supervised model achieves superior performance to the supervised model trained using 64 examples per class. This provides empirical evidence of the potential of semi-supervised approaches for image classification on low-resource labeled datasets.
We can conclude that unsupervised learning is a powerful paradigm that has the capability to boost performance for low-resource datasets. Unsupervised learning is currently in its infancy but will gradually expand its share in the computer vision space by enabling learning from cheap and easily accessible unlabeled data.
Original article source at: https://www.toptal.com/
1661380800
This package contains a collection of tools to perform fundamental and advanced Chemometric analysis' in Julia. It is currently richer than any other free chemometrics package available in any other language. If you are uninformed as to what Chemometrics is; it could nonelegantly be described as the marriage between data science and chemistry. Traditionally it is the symbiosis of applied linear algebra/statistics which is disciplined by the physics and meaning of chemical measurements. This is somewhat orthogonal to most specializations of machine learning where "add more layers" is the modus operandi. Sometimes chemometricians also weigh the pros and cons of black box modelling and break out pure machine learning methods - so some of those techniques are in this package.
ChemometricsTools has been accepted as an official Julia package! Yep, so you can Pkg.add("ChemometricsTools")
to install it. A lot of features have been added since the first public release (v 0.2.3 ). In 0.5.8 almost all of the functionality available can be used/abused. If you find a bug or want a new feature don't be shy - file an issue. In v0.5.1 Plots was removed as a dependency, new plot recipes were added, and now the package compiles much faster! Multilinear modeling, univariate modeling, and DOE functions are now available. Making headway into the release plan for v0.6.0. Convenience functions, documentation, bug fixes, refactoring and clean up are in progress bare with me. The git repo's master branch typically has the most advanced version, but the features on it may be less reliable because I like to do development on it.
So my time and efforts for building this package are constrained. I really would like to find some collaborators to help flesh this package out, use it, find bugs. Even if your interests are more leaning towards machine learning/statistics I'd love to hear from you. Please file an issue if you are interested - or send me a message on Julia Discourse (ckneale)!
Package Highlights
Two design choices introduced in this package are "Transformations" and "Pipelines". We can use transformations to treat data from multiple sources the same way. This helps mitigate user error for cases where test data is scaled based on training data, calibration transfer, etc.
Multiple transformations can easily be chained together and stored using "Pipelines". Pipelines aren't "pipes" like are present in Bash, R and base Julia. They are flexible, yet immutable, convenience objects that allow for sequential preprocessing and data transformations to be reused, chained, or automated for reliable analytic throughput.
ChemometricsTools offers easy to use iterators for K-folds validation's, and moving window sampling/training. More advanced sampling methods, like Kennard Stone, are just a function call away. Convenience functions for interval selections, weighting regression ensembles, etc are also available. These allow for ensemble models like SIPLS, P-DS, P-OSC, etc to be built quickly. With the tools included both in this package and Base Julia, nothing should stand in your way.
This package features dozens of regression performance metrics, and a few built in plots (Bland Altman, QQ, Interval Overlays etc) are included. The list of regression methods currently includes: CLS, Ridge, Kernel Ridge, LS-SVM, PCR, PLS(1/2), ELM's, Regression Trees, Random Forest, Monotone Regression... More to come. Chemometricians love regressions! I've also added some convenience functions for univariate calibrations, standard addition experiments and some automated plot functions for them.
In-house classification encodings (one cold/one hot), and easy to retrieve global or multiclass performance statistics. ChemometricsTools currently includes: LDA/PCA with Gaussian discriminants, Hierchical LDA, SIMCA, multinomial softmax/logistic regression, PLS-DA, K-NN, Gaussian Naive Bayes, Classification Trees, Random Forest, Probabilistic Neural Networks, LinearPerceptrons, and more to come. You can also conveniently dump classification statistics to LaTeX/CSV reports!
I've been working to fulfill an obvious gap in the available tooling. Standard methods for Tucker decomposition (HOSVD, and HOOI) have been included. Some preprocessing methods, and even an early view at multilinear PLS. There's a lot that could be done here, please feel free to contribute!
This package has tools for specialized fields of analysis'. For instance, fractional derivatives for the electrochemists (and the adventurous), a handful of smoothing methods for spectroscopists, curve resolution (unimodal and nonnegativity constraints available) for forensics, process fault detection methods, etc. There are certainly plans for other tools for analyzing chemical data that packages in other languages have seemingly left out. Stay tuned.
Please check out ChemometricsData.jl for access to more publicly available datasets.
Right now the 2002 International Diffuse Reflectance Conference Pharmaceutical NIR, iris, Tecator aka 'meat', and ball gear fault detection (NASA) dataset are included in this package. But, this will be factored out eventually into ChemometricsData.jl.
I'd love for a collaborator to contribute some: spectra, chromatograms, etc. Please reach out to me if you wish to collaborate/contribute. In the mean time you can load in your own datasets using the full extent of Julia ecosystem (XLSX.jl, CSV.jl, JSON.jl, MATLAB.jl, LibPQ.jl, Feather.jl, Arrow.jl, etc).
Well, I'd love to hammer in some time series methods. That was originally part of the plan. Then I realized OnlineStats.jl already has the essentials for online learning covered, and a there are many efforts for actual time series((TimeSeries.jl)[https://github.com/JuliaStats/TimeSeries.jl]) modelling in the works.
Similarly, if clustering methods are important to you, check out Clustering.jl. I may add a few supportive odds and ends in here (or contribute to the packages directly) but really, most of the Julia 1.0+ ecosystem is really reliable, well made, and community supported.
Author: Caseykneale
Source Code: https://github.com/caseykneale/ChemometricsTools.jl
License: View license
1660868340
The SeaClass R Package
The Advanced Analytics group at Seagate Technology has decided to share an internal project which helps accelerate development for classification problems. The interactive SeaClass tool is contained in an R based package built using R Shiny and other CRAN packages commonly used for binary classification. The package is free to use and develop further, but any analysis mistakes are the sole responsibility of the user. Checkout the demo video here.
The SeaClass R package provides tools for analyzing classification problems. In particular, specialized tools are available for addressing the problem of imbalanced data sets. The SeaClass application provides an easy to use interface which requires only minimal R programming knowledge to get started, and can be launched using the RStudio Addins menu. The application allows the user to explore numerous methods by simply clicking on the available options and interacting with the generated results. The user can choose to download the codes for any procedures they wish to explore further. SeaClass was designed to jump start the analysis process for both novice and advanced R users. See screenshots below for one demonstration.
The SeaClass application depends on numerous R packages. To install SeaClass and its dependencies run:
install.packages('devtools')
devtools::install_github('ChrisDienes/SeaClass')
Step 1. Begin by loading and preparing your data in R. Some general advice:
Step 2. After data preparation, start the application by either loading SeaClass from the RStudio Addins dropdown menu or by loading the SeaClass function from the command line. For example:
library(SeaClass)
### Make some fake data:
X <- matrix(rnorm(10000,0,1),ncol=10,nrow=1000)
X[1:100,1:2] <- X[1:100,1:2] + 3
Y <- c(rep(1,100), rep(0,900))
Fake_Data <- data.frame(Y = Y , X)
### Load the SeaClass rare failure data:
data("rareFailData")
### Start the interactive GUI:
SeaClass()
If the application fails to load, you may need to first specify your favorite browser path. For example:
options(browser = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")
Step 3. The user has various options for configuring their analysis within the GUI. Once the analysis runs, the user can view the results, interact with the results (module dependent), save the underlying R script, or start over. Additional help is provided within the application. See above screenshots for one depiction of these steps.
Step 4. Besides the SeaClass function, several other functions are contained within the library. For example:
### List available functions:
ls("package:SeaClass")
### Note this is a sample data set:
# data(rareFailData)
### Note code_output is a support function for SeaClass, not for general use.
### View help:
?accuracy_threshold
### Run example from help file:
### General Use: ###
set.seed(123)
x <- c(rnorm(100,0,1),rnorm(100,2,1))
group <- c(rep(0,100),rep(2,100))
accuracy_threshold(x=x, group=group, pos_class=2)
accuracy_threshold(x=x, group=group, pos_class=0)
### Bagged Example ###
set.seed(123)
replicate_function = function(index){accuracy_threshold(x=x[index], group=group[index], pos_class=2)[[2]]}
sample_cuts <- replicate(100, {
sample_index = sample.int(n=length(x),replace=TRUE)
replicate_function(index=sample_index)
})
bagged_scores <- sapply(x, function(x) mean(x > sample_cuts))
unbagged_cut <- accuracy_threshold(x=x, group=group, pos_class=2)[[2]]
unbagged_scores <- ifelse(x > unbagged_cut, 1, 0)
# Compare AUC:
PRROC::roc.curve(scores.class0 = bagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
PRROC::roc.curve(scores.class0 = unbagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
bagged_prediction <- ifelse(bagged_scores > 0.50, 2, 0)
unbagged_prediction <- ifelse(x > unbagged_cut, 2, 0)
# Compare Confusion Matrix:
table(bagged_prediction, group)
table(unbagged_prediction, group)
Author: ChrisDienes
Source Code: https://github.com/ChrisDienes/SeaClass
1660172640
An optimal Bayesian classification library and runtime for RNA-Seq data.
Pkg.update()
Pkg.clone("git://github.com/binarybana/OBC.jl.git")
You are now ready to use the OBC Julia library. The core operations look something like the following,
using OBC
data1,data2 = ... # your datasets as integer valued matrices (samples x genes)
d1,d2 = ... # the normalization factors for each dataset (float arrays)
cls = MPM.mpm_classifier(data1, data2, d1=d1, d2=d2)
MPM.sample(cls, 10000)
bemc = MPM.bee_e_mc(cls, (mean(d1),mean(d2)))
For a full example (with code to generate synthetic data) see the run.jl
runner script.
Author: Binarybana
Source Code: https://github.com/binarybana/OBC.jl
License: View license
1658770680
A library for classifying text into multiple categories.
Currently provided classifiers:
Ran a benchmark of 1345 items that I have previously manually classified with multiple categories. Here's the rate over which the 2 algorithms have correctly detected one of those categories:
I prefer the Naive Bayes approach, because while having lower stats on this benchmark, it seems to make better decisions than I did in many cases. For example, an item with title "Paintball Session, 100 Balls and Equipment" was classified as "Activities" by me, but the bayes classifier identified it as "Sports", at which point I had an intellectual orgasm. Also, the Tf-Idf classifier seems to do better on clear-cut cases, but doesn't seem to handle uncertainty so well. Of course, these are just quick tests I made and I have no idea which is really better.
gem install stuff-classifier
You either instantiate one class or the other. Both have the same signature:
require 'stuff-classifier'
# for the naive bayes implementation
cls = StuffClassifier::Bayes.new("Cats or Dogs")
# for the Tf-Idf based implementation
cls = StuffClassifier::TfIdf.new("Cats or Dogs")
# these classifiers use word stemming by default, but if it has weird
# behavior, then you can disable it on init:
cls = StuffClassifier::TfIdf.new("Cats or Dogs", :stemming => false)
# also by default, the parsing phase filters out stop words, to
# disable or to come up with your own list of stop words, on a
# classifier instance you can do this:
cls.ignore_words = [ 'the', 'my', 'i', 'dont' ]
Training the classifier:
cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")
And finally, classifying stuff:
cls.classify("This test is about cats.")
#=> :cat
cls.classify("I hate ...")
#=> :cat
cls.classify("The most annoying animal on earth.")
#=> :cat
cls.classify("The preferred company of software developers.")
#=> :cat
cls.classify("My precious, my favorite!")
#=> :cat
cls.classify("Get off my keyboard!")
#=> :cat
cls.classify("Kill that bird!")
#=> :cat
cls.classify("This test is about dogs.")
#=> :dog
cls.classify("Cats or Dogs?")
#=> :dog
cls.classify("What pet will I love more?")
#=> :dog
cls.classify("Willy, where the heck are you?")
#=> :dog
cls.classify("I like big buts and I cannot lie.")
#=> :dog
cls.classify("Why is the front door of our house open?")
#=> :dog
cls.classify("Who is eating my meat?")
#=> :dog
The following layers for saving the training data between sessions are implemented:
To persist the data in Redis, you can do this:
# defaults to redis running on localhost on default port
store = StuffClassifier::RedisStorage.new(@key)
# pass in connection args
store = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})
To persist the data on disk, you can do this:
store = StuffClassifier::FileStorage.new(@storage_path)
# global setting
StuffClassifier::Base.storage = store
# or alternative local setting on instantiation, by means of an
# optional param ...
cls = StuffClassifier::Bayes.new("Cats or Dogs", :storage => store)
# after training is done, to persist the data ...
cls.save_state
# or you could just do this:
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
# when done, save_state is called on END
end
# to start fresh, deleting the saved training data for this classifier
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true)
The name you give your classifier is important, as based on it the data will get loaded and saved. For instance, following 3 classifiers will be stored in different buckets, being independent of each other.
cls1 = StuffClassifier::Bayes.new("Cats or Dogs")
cls2 = StuffClassifier::Bayes.new("True or False")
cls3 = StuffClassifier::Bayes.new("Spam or Ham")
This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.
Author: Alexandru
Source Code: https://github.com/alexandru/stuff-classifier
License: MIT license
1658763196
A Naive Bayes text classification implementation as an OmniCat classifier strategy.
Add this line to your application's Gemfile:
gem 'omnicat-bayes'
And then execute:
$ bundle
Or install it yourself as:
$ gem install omnicat-bayes
See rdoc for detailed usage.
Optional configuration sample:
OmniCat.configure do |config|
# you can enable auto train mode by :unique or :continues
# unique: only uniq docs will be added to training docs on prediction
# continues: always add docs to training docs on prediction
config.auto_train = :off
config.exclude_tokens = ['something', 'anything'] # exclude token list
config.token_patterns = {
# exclude tokens with Regex patterns
minus: [/[\s\t\n\r]+/, /(@[\w\d]+)/],
# include tokens with Regex patterns
plus: [/[\p{L}\-0-9]{2,}/, /[\!\?]/, /[\:\)\(\;\-\|]{2,3}/]
}
end
Create a classifier object with Bayes strategy.
# If you need to change strategy on runtime, you should prefer this inialization
bayes = OmniCat::Classifier.new(OmniCat::Classifiers::Bayes.new)
or
# If you only need to use only Bayes classification, then you can use
bayes = OmniCat::Classifiers::Bayes.new
Create a classification category.
bayes.add_category('positive')
bayes.add_category('negative')
Train category with a document.
bayes.train('positive', 'great if you are in a slap happy mood .')
bayes.train('negative', 'bad tracking issue')
Untrain category with a document.
bayes.untrain('positive', 'great if you are in a slap happy mood .')
bayes.untrain('negative', 'bad tracking issue')
Train category with multiple documents.
bayes.train_batch('positive', [
'a feel-good picture in the best sense of the term...',
'it is a feel-good movie about which you can actually feel good.',
'love and money both of them are good choises'
])
bayes.train_batch('negative', [
'simplistic , silly and tedious .',
'interesting , but not compelling . ',
'seems clever but not especially compelling'
])
Untrain category with multiple documents.
bayes.untrain_batch('positive', [
'a feel-good picture in the best sense of the term...',
'it is a feel-good movie about which you can actually feel good.',
'love and money both of them are good choises'
])
bayes.untrain_batch('negative', [
'simplistic , silly and tedious .',
'interesting , but not compelling . ',
'seems clever but not especially compelling'
])
Classify a document.
result = bayes.classify('I feel so good and happy')
=> #<OmniCat::Result:0x007febb152af68 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb152add8 @key="positive", @value=6.813226744186048e-09, @percentage=58>, "negative"=>#<OmniCat::Score:0x007febb152ac70 @key="negative", @value=4.875003449064939e-09, @percentage=42>}, @total_score=1.1688230193250986e-08>
result.to_hash
=> {:top_score_key=>"positive", :scores=>{"positive"=>{:key=>"positive", :value=>6.813226744186048e-09, :percentage=>58}, "negative"=>{:key=>"negative", :value=>4.875003449064939e-09, :percentage=>42}}, :total_score=>1.1688230193250986e-08}
result.top_score
=> #<OmniCat::Score:0x007febb152add8 @key="positive", @value=6.813226744186048e-09, @percentage=58>
result.top_score.to_hash
=> {:key=>"positive", :value=>6.813226744186048e-09, :percentage=>58}
Classify multiple documents at a time.
results = bayes.classify_batch(
[
'the movie is silly so not compelling enough',
'a good piece of work'
]
)
=> [#<OmniCat::Result:0x007febb14f3680 @top_score_key="negative", @scores={"positive"=>#<OmniCat::Score:0x007febb14f34a0 @key="positive", @value=7.971480930520432e-14, @percentage=22>, "negative"=>#<OmniCat::Score:0x007febb14f32c0 @key="negative", @value=2.834304330851709e-13, @percentage=78>}, @total_score=3.6314524239037524e-13>, #<OmniCat::Result:0x007febb14f2aa0 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb14f2960 @key="positive", @value=3.802731206057328e-07, @percentage=72>, "negative"=>#<OmniCat::Score:0x007febb14f2820 @key="negative", @value=1.4625010347194818e-07, @percentage=28>}, @total_score=5.26523224077681e-07>]
Convert full Bayes object to hash.
# For storing, restoring modal data
bayes_hash = bayes.to_hash
=> {:categories=>{"positive"=>{:doc_count=>4, :docs=>{"28fd29bbf840c86db65e510ff3cd07a9"=>{:content=>"great if you are in a slap happy mood .", :content_md5=>"28fd29bbf840c86db65e510ff3cd07a9", :count=>1, :tokens=>{"great"=>1, "if"=>1, "you"=>1, "are"=>1, "in"=>1, "slap"=>1, "happy"=>1, "mood"=>1}}, "82b4cd9513f448dea0024f2d0e2ccd44"=>{:content=>"a feel-good picture in the best sense of the term...", :content_md5=>"82b4cd9513f448dea0024f2d0e2ccd44", :count=>1, :tokens=>{"feel-good"=>1, "picture"=>1, "in"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>1, "term"=>1}}, "f917bf1cf1256c78c5436d850dab3104"=>{:content=>"it is a feel-good movie about which you can actually feel good.", :content_md5=>"f917bf1cf1256c78c5436d850dab3104", :count=>1, :tokens=>{"it"=>1, "is"=>1, "feel-good"=>1, "movie"=>1, "about"=>1, "which"=>1, "you"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>1}}, "4343bbe84c035733708c3f58136f321e"=>{:content=>"love and money both of them are good choises", :content_md5=>"4343bbe84c035733708c3f58136f321e", :count=>1, :tokens=>{"love"=>1, "and"=>1, "money"=>1, "both"=>1, "of"=>1, "them"=>1, "are"=>1, "good"=>1, "choises"=>1}}}, :name=>"positive", :tokens=>{"great"=>1, "if"=>1, "you"=>2, "are"=>2, "in"=>2, "slap"=>1, "happy"=>1, "mood"=>1, "feel-good"=>2, "picture"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>2, "term"=>1, "it"=>1, "is"=>1, "movie"=>1, "about"=>1, "which"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>2, "love"=>1, "and"=>1, "money"=>1, "both"=>1, "them"=>1, "choises"=>1}, :token_count=>37, :prior=>0.5}, "negative"=>{:doc_count=>4, :docs=>{"89b36e774579662591ea21b3283d9b35"=>{:content=>"bad tracking issue", :content_md5=>"89b36e774579662591ea21b3283d9b35", :count=>1, :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1}}, "b0ec48bc87527e285b26d6cce8e278e7"=>{:content=>"simplistic , silly and tedious .", :content_md5=>"b0ec48bc87527e285b26d6cce8e278e7", :count=>1, :tokens=>{"simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1}}, "ae9d4fbaf40906614ca712a888648c5f"=>{:content=>"interesting , but not compelling . ", :content_md5=>"ae9d4fbaf40906614ca712a888648c5f", :count=>1, :tokens=>{"interesting"=>1, "but"=>1, "not"=>1, "compelling"=>1}}, "0e495f5d88d8049746a1b6961bf3cc90"=>{:content=>"seems clever but not especially compelling", :content_md5=>"0e495f5d88d8049746a1b6961bf3cc90", :count=>1, :tokens=>{"seems"=>1, "clever"=>1, "but"=>1, "not"=>1, "especially"=>1, "compelling"=>1}}}, :name=>"negative", :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1, "simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1, "interesting"=>1, "but"=>2, "not"=>2, "compelling"=>2, "seems"=>1, "clever"=>1, "especially"=>1}, :token_count=>17, :prior=>0.5}}, :category_count=>2, :category_size_limit=>0, :doc_count=>8, :token_count=>54, :unique_token_count=>43, :k_value=>1.0}
Load full Bayes object from hash.
another_bayes_obj = OmniCat::Classifiers::Bayes.new(bayes_hash)
=> #<OmniCat::Classifiers::Bayes:0x007febb14d15a8 @categories={"positive"=>#<OmniCat::Classifiers::BayesInternals::Category:0x007febb14d1530 @doc_count=4, @docs={"28fd29bbf840c86db65e510ff3cd07a9"=>{:content=>"great if you are in a slap happy mood .", :content_md5=>"28fd29bbf840c86db65e510ff3cd07a9", :count=>1, :tokens=>{"great"=>1, "if"=>1, "you"=>1, "are"=>1, "in"=>1, "slap"=>1, "happy"=>1, "mood"=>1}}, "82b4cd9513f448dea0024f2d0e2ccd44"=>{:content=>"a feel-good picture in the best sense of the term...", :content_md5=>"82b4cd9513f448dea0024f2d0e2ccd44", :count=>1, :tokens=>{"feel-good"=>1, "picture"=>1, "in"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>1, "term"=>1}}, "f917bf1cf1256c78c5436d850dab3104"=>{:content=>"it is a feel-good movie about which you can actually feel good.", :content_md5=>"f917bf1cf1256c78c5436d850dab3104", :count=>1, :tokens=>{"it"=>1, "is"=>1, "feel-good"=>1, "movie"=>1, "about"=>1, "which"=>1, "you"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>1}}, "4343bbe84c035733708c3f58136f321e"=>{:content=>"love and money both of them are good choises", :content_md5=>"4343bbe84c035733708c3f58136f321e", :count=>1, :tokens=>{"love"=>1, "and"=>1, "money"=>1, "both"=>1, "of"=>1, "them"=>1, "are"=>1, "good"=>1, "choises"=>1}}}, @name="positive", @tokens={"great"=>1, "if"=>1, "you"=>2, "are"=>2, "in"=>2, "slap"=>1, "happy"=>1, "mood"=>1, "feel-good"=>2, "picture"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>2, "term"=>1, "it"=>1, "is"=>1, "movie"=>1, "about"=>1, "which"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>2, "love"=>1, "and"=>1, "money"=>1, "both"=>1, "them"=>1, "choises"=>1}, @token_count=37, @prior=0.5>, "negative"=>#<OmniCat::Classifiers::BayesInternals::Category:0x007febb14d14e0 @doc_count=4, @docs={"89b36e774579662591ea21b3283d9b35"=>{:content=>"bad tracking issue", :content_md5=>"89b36e774579662591ea21b3283d9b35", :count=>1, :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1}}, "b0ec48bc87527e285b26d6cce8e278e7"=>{:content=>"simplistic , silly and tedious .", :content_md5=>"b0ec48bc87527e285b26d6cce8e278e7", :count=>1, :tokens=>{"simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1}}, "ae9d4fbaf40906614ca712a888648c5f"=>{:content=>"interesting , but not compelling . ", :content_md5=>"ae9d4fbaf40906614ca712a888648c5f", :count=>1, :tokens=>{"interesting"=>1, "but"=>1, "not"=>1, "compelling"=>1}}, "0e495f5d88d8049746a1b6961bf3cc90"=>{:content=>"seems clever but not especially compelling", :content_md5=>"0e495f5d88d8049746a1b6961bf3cc90", :count=>1, :tokens=>{"seems"=>1, "clever"=>1, "but"=>1, "not"=>1, "especially"=>1, "compelling"=>1}}}, @name="negative", @tokens={"bad"=>1, "tracking"=>1, "issue"=>1, "simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1, "interesting"=>1, "but"=>2, "not"=>2, "compelling"=>2, "seems"=>1, "clever"=>1, "especially"=>1}, @token_count=17, @prior=0.5>}, @category_count=2, @category_size_limit=0, @doc_count=8, @token_count=54, @unique_token_count=43, @k_value=1.0>
another_bayes_obj.classify('best senses')
=> #<OmniCat::Result:0x007febb14c0fc8 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb14c0ed8 @key="positive", @value=0.00029069767441860465, @percentage=52>, "negative"=>#<OmniCat::Score:0x007febb14c0de8 @key="negative", @value=0.0002704164413196322, @percentage=48>}, @total_score=0.0005611141157382368>
For bayes classification always try to train same amount of documents for each category. So, do not activate auto training mode, because it make overages on balance of trained docs and makes algorithm go crazy :). To get best results on text classification you should apply some cleaning actions like spellchecking, stemming, stop words cleaning before training and prediction actions.
git checkout -b my-new-feature
)git commit -am 'Add some feature'
)git push origin my-new-feature
)Author: Mustafaturan
Source Code: https://github.com/mustafaturan/omnicat-bayes
License: MIT license
1658755620
A generalized framework for text classifications.
Add this line to your application's Gemfile:
gem 'omnicat'
And then execute:
$ bundle
Or install it yourself as:
$ gem install omnicat
Stand-alone version of omnicat is just a strategy holder for developers. Its aim is providing omnification of methods for text classification gems with loseless conversion of a strategy to another one. End-users should see 'classifier strategies' section and 'changing classifier strategy' sub section.
OmniCat allows you to change strategy on runtime.
# Declare classifier with Naive Bayes classifier
classifier = OmniCat::Classifier.new(OmniCat::Classifiers::Bayes.new())
...
# do some operations like adding category, training, etc...
...
# make some classification using Bayes
classifier.classify('I am happy :)')
...
# change strategy to Support Vector Machine (SVM) on runtime
classifier.strategy = OmniCat::Classifiers::SVM.new
# now you do not need to re-train, add category and so on..
# just classify with new strategy
classifier.classify('I am happy :)')
Here is the classifier list avaliable for OmniCat.
git checkout -b my-new-feature
)git commit -am 'Add some feature'
)git push origin my-new-feature
)Author: Mustafaturan
Source Code: https://github.com/mustafaturan/omnicat
License: MIT license
1658748180
gem install nbayes
NBayes is a full-featured, Ruby implementation of Naive Bayes
. Some of the features include:
For more information, view this blog post: http://blog.oasic.net/2012/06/naive-bayes-for-ruby.html
This project is supported by the GrammarBot grammar checker
Author: oasic
Source Code: https://github.com/oasic/nbayes
License: MIT license