Real Time Machine Learning at Scale using SpaCy, Kafka & Seldon Core

In this post, we will cover how to train and deploy a machine learning model leveraging a scalable stream processing architecture for an automated text prediction use-case. We will be using Sklearn and SpaCy to train an ML model from the Reddit Content Moderation dataset, and we will deploy that model using Seldon Core for real time processing of text data from Kafka real-time streams. This is the content for the talk presented at the NLP Summit 2020.

You can find the full code for this article in the following links:

Seldon Model Containerization Notebook
Reddit Dataset Exploratory Data Analysis Notebook
Kafka Seldon Core Stream Processing Deployment Notebook

Model Training with SpaCy & Sklearn

For this use-case we will be using the Reddit /r/science Content Moderation Dataset. This dataset consists of over 200,000 reddit comments — primarily labelled based on whether the comments have been removed by moderators. We’ll be tasked to train an ML model that is able to predict the comments that would have been removed by reddit moderators.

#data-science #realtime #production #machine-learning #scale

Model Training with SpaCy & Sklearn

towardsdatascience.com

Real Time Machine Learning at Scale using SpaCy, Kafka & Seldon Core