LDA, or Latent Dirichlet Allocation, is one of the most widely used topic modelling algorithms. It is scalable, it is computationally fast and more importantly it generates simple and comprehensible topics that are close to what the human mind assigns when reading a text. While most of the use of LDA is for unsupervised tasks, e.g. topic modelling or document clustering, it can also be used as a feature extraction system for supervised tasks such as text classification. In this article we are going to assemble an LDA based classifier and see how it performs! Let’s go!
Folders were the classic solution to many text categorization problems!
For simplicity, we’re going to use [**lda_classification**](https://pypi.org/project/lda-classification/)
python package, which offers simple wrappers compatible with scikit-learn
estimator API for text preprocessing or text vectorization.
The 20 News Group dataset is one of the most known and heavily referenced datasets in the field of natural language processing. It consists of around 18K news documents in various categories. For making the task a little less resource heavy, we choose a subset of this dataset for our text classification problem. Since I really like following sports culture, I decided to choose the sport-related section of this dataset. The categories in this subset are as follows:
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
#machine-learning #text-classification #nlp #lda #python