Machine Learning - Text2Image: A new way to NLP?

The Problem

Natural Language Processing(NLP) has long been considered a tough nut to crack, at least in comparison to Computer Vision. NLP models take longer to run, are generally more difficult to implement and require substantially higher computational resources. On the other hand, Image recognition models have got much simpler to implement and are less taxing on your GPUs. This got me thinking, can we convert a corpus of text to an image? Can we interpret text as an image? It turns out, yes and with surprisingly promising results! We apply this approach to the problem of classifying fake and real news.

In this article, we will be exploring in detail this approach, the results, conclusions and further improvements. Buckle up, everyone!

Introduction

Inspiration

The idea of converting text to the image was initially inspired by this post by Gleb Esman on fraud detection. In this approach, they converted various data points like the speed of the mouse movements, direction, acceleration etc. into a colour image. An image recognition model was then run on these images, producing highly accurate results.

The Data

The data used for all experiments is a subset of the Fake News dataset by George Mclntire. It contains approximately 1000 articles of fake and real news, combined.

Text2Image from 20,000 feet

Let’s first discuss Text2Image at a high level. The basic idea is to convert text into something we can plot as a heat-map. Wait, but what is that something? The TF-IDF values of each word. Term Frequency — Inverse Document Frequency (TF-IDF) is a statistical method in determining the importance of a word with respect to other words in the document. After basic pre-processing and calculating the tf-idf values, we plot them as heatmaps on log scale using some amount of Gaussian filtering for smoothness. Once the heatmaps are plotted, we use a fast.ai implementation of a CNN and try to distinguish between the real and fake heatmaps. We ended up with a stable accuracy of around 71%, a great start for this new approach. Here’s a quick little flow chart on our approach

Text2Image

Not quite clear yet? Read on.

Text2Image from the ground

Pre-Processing

The data is lowercased, all special characters removed and the text and title concatenated. Words appearing in more than 85% of the document are also removed. Additionally, a list of words is explicitly avoided (stopwords). The one used is a standard list of stop-words, mostly uninformative repetitive words. Modifying the stopwords especially for fake news can be an area to explore in the future, especially to bring out writing style particular to fake news.

Calculating TF-IDF

For scoring and extracting the keywords, a scikit-learn implementation of a smooth term frequency-inverse document (tf-idf) is used. The IDF is calculated separately for both the fake and real news corpus. Calculating separate IDF scores resulted in a massive jump in accuracy in comparison with a single IDF score for the entire corpus. The tf-idf scores are then calculated iteratively for each document. Here, the title and the text are not scored separately, but together.

Calculating the IDF

Multiplying them together, we get the tf-idf. We do this iteratively for each document.

Processing TF-IDF values

For each document, 121 of the words with the highest TF-IDF values are extracted. These words are then used to create an 11x11 array. Here, the number of words chosen can act like a hyper-parameter. For shorter, simpler pieces of text, fewer words can be used while a larger amount of words can be utilized to represent longer, more complex text. Empirically it was found that 11x11 was an ideal size for this dataset. Instead of arranging the TF-IDF values in descending order of their magnitude, they are mapped in accordance to their position in the text. The TF-IDF values are mapped in this way as it is seen more representative of the text and provides richer features for the model to train on. Since a single word can appear multiple times in a piece of text, it’s the first occurrence is taken into account.

Instead of plotting the TF-IDF values as is, all values are plotted in log scale. This is done, to reduce the large difference between the top and bottom values

This is image title

While plotting, because of this difference, the majority of the heatmap will not show any colour variance. Hence, they are plotted on a log scale to better bring out the differences.

This is image title

Figure 1 (left) shows the TF-IDF values plotted as is. Figure 2 (right) shows the same values plotted on a log scale

One of the downfalls is a large amount of overfitting while training the model. This can be attributed to the lack of any data augmentation as currently, no way of data augmentation seems viable for this use case. Hence, Gaussian filtering was used on the entire dataset in a bid to smoothen out the plots. While it did decrease the accuracy by a little, there was a significant decrease in overfitting especially during initial stages of training.

Final Plots

The final heatmaps are of the dimension 11 x 11 and are plotted with seaborn. As both the x-axis and the y-axis, as well as the colour bar, do not convey any information while training, we remove them. The type of heatmap used is ‘plasma’ as it showed ideal colour variation. Experimenting with different colour combinations can be an area to explore in the future. Here is an example of how the final plots look.

Final Plots

Training our model

The model is trained on a resnet34 using fast.ai. 489 fake articles and 511 real articles were used in total. A standard 80:20 split was used between train and test set with no data augmentation. All code used can be found here.

Results

This is image title

Conclusions

After 9 epochs, the model had an accuracy of above 70%. Although far from state-of-the-art for this dataset, the new approach does seem promising. Here are a couple of observations made during the training process —

The model overfits by a huge amount. Increasing the data did not have any effect on the overfitting, contrary to what we expected. Further training or changing learning rates did not have any effect.
Increasing the plot size helped until size 11x11, after which increasing the plot size saw a decrease in accuracy.
Using some amount of Gaussian filtering on the plots helped increase the accuracy

Future work

Currently, we are working on visualizing Part-of-speech(POS) taggers and GloVe word embeddings. We are also looking into modifying stop words, playing around with the size of the plot and the colour schema. Will keep you’ll post on the progress!

#Machine Learning #machine learning #Text2Image #ai #NLP

The Problem

Introduction

Inspiration

The Data

Text2Image from 20,000 feet

Text2Image from the ground

Pre-Processing

Calculating TF-IDF

Processing TF-IDF values

Final Plots

Training our model

Results

Conclusions

Future work

towardsdatascience.com

Machine Learning - Text2Image: A new way to NLP?