Natural Language Processing(NLP) has long been considered a tough nut to crack, at least in comparison to Computer Vision. NLP models take longer to run, are generally more difficult to implement and require substantially higher computational resources. On the other hand, Image recognition models have got much simpler to implement and are less taxing on your GPUs. This got me thinking, can we convert a corpus of text to an image? Can we interpret text as an image? It turns out, yes and with surprisingly promising results! We apply this approach to the problem of classifying fake and real news.
In this article, we will be exploring in detail this approach, the results, conclusions and further improvements. Buckle up, everyone!
The idea of converting text to the image was initially inspired by this post by Gleb Esman on fraud detection. In this approach, they converted various data points like the speed of the mouse movements, direction, acceleration etc. into a colour image. An image recognition model was then run on these images, producing highly accurate results.
The data used for all experiments is a subset of the Fake News dataset by George Mclntire. It contains approximately 1000 articles of fake and real news, combined.
Let’s first discuss Text2Image at a high level. The basic idea is to convert text into something we can plot as a heat-map. Wait, but what is that something? The TF-IDF values of each word. Term Frequency — Inverse Document Frequency (TF-IDF) is a statistical method in determining the importance of a word with respect to other words in the document. After basic pre-processing and calculating the tf-idf values, we plot them as heatmaps on log scale using some amount of Gaussian filtering for smoothness. Once the heatmaps are plotted, we use a fast.ai implementation of a CNN and try to distinguish between the real and fake heatmaps. We ended up with a stable accuracy of around 71%, a great start for this new approach. Here’s a quick little flow chart on our approach
Not quite clear yet? Read on.
The data is lowercased, all special characters removed and the text and title concatenated. Words appearing in more than 85% of the document are also removed. Additionally, a list of words is explicitly avoided (stopwords). The one used is a standard list of stop-words, mostly uninformative repetitive words. Modifying the stopwords especially for fake news can be an area to explore in the future, especially to bring out writing style particular to fake news.
For scoring and extracting the keywords, a scikit-learn implementation of a smooth term frequency-inverse document (tf-idf) is used. The IDF is calculated separately for both the fake and real news corpus. Calculating separate IDF scores resulted in a massive jump in accuracy in comparison with a single IDF score for the entire corpus. The tf-idf scores are then calculated iteratively for each document. Here, the title and the text are not scored separately, but together.
Multiplying them together, we get the tf-idf. We do this iteratively for each document.
For each document, 121 of the words with the highest TF-IDF values are extracted. These words are then used to create an 11x11 array. Here, the number of words chosen can act like a hyper-parameter. For shorter, simpler pieces of text, fewer words can be used while a larger amount of words can be utilized to represent longer, more complex text. Empirically it was found that 11x11 was an ideal size for this dataset. Instead of arranging the TF-IDF values in descending order of their magnitude, they are mapped in accordance to their position in the text. The TF-IDF values are mapped in this way as it is seen more representative of the text and provides richer features for the model to train on. Since a single word can appear multiple times in a piece of text, it’s the first occurrence is taken into account.
Instead of plotting the TF-IDF values as is, all values are plotted in log scale. This is done, to reduce the large difference between the top and bottom values
While plotting, because of this difference, the majority of the heatmap will not show any colour variance. Hence, they are plotted on a log scale to better bring out the differences.
Figure 1 (left) shows the TF-IDF values plotted as is. Figure 2 (right) shows the same values plotted on a log scale
One of the downfalls is a large amount of overfitting while training the model. This can be attributed to the lack of any data augmentation as currently, no way of data augmentation seems viable for this use case. Hence, Gaussian filtering was used on the entire dataset in a bid to smoothen out the plots. While it did decrease the accuracy by a little, there was a significant decrease in overfitting especially during initial stages of training.
The final heatmaps are of the dimension 11 x 11 and are plotted with seaborn. As both the x-axis and the y-axis, as well as the colour bar, do not convey any information while training, we remove them. The type of heatmap used is ‘plasma’ as it showed ideal colour variation. Experimenting with different colour combinations can be an area to explore in the future. Here is an example of how the final plots look.
The model is trained on a resnet34 using fast.ai. 489 fake articles and 511 real articles were used in total. A standard 80:20 split was used between train and test set with no data augmentation. All code used can be found here.
After 9 epochs, the model had an accuracy of above 70%. Although far from state-of-the-art for this dataset, the new approach does seem promising. Here are a couple of observations made during the training process —
Currently, we are working on visualizing Part-of-speech(POS) taggers and GloVe word embeddings. We are also looking into modifying stop words, playing around with the size of the plot and the colour schema. Will keep you’ll post on the progress!
#Machine Learning #machine learning #Text2Image #ai #NLP