The amount of written text available and relevant for us is staggering and, like most things these days, increasing exponentially. Yet the tools we use to read have largely remained unchanged.
When researching a topic, reading the news or trying to get an update on an event, we tend to follow a T-shaped process.
We want to understand, from a body of documents:
The hunter-gatherer equivalent for information collection.
Solving this would be a tremendous step forward in how we consume information and I will definitely NOT be able to solve it by the end of this article. My aim is, however, to propose an approach for a tiny step forward.
What I am proposing here is a utility tool which helps us explore large collections of text quickly. I call it PictureText (silly name but WIP).
Given a set of short documents (think news headlines) it can group them into hierarchical groups that semantically belong together. The interactive treemap allows the reader to explore each group in more detail by going deeper into a hierarchy and dynamically pulling out of it when needed.
The approach is intended for grouping large sets of non-domain specific short texts. For instance: news headlines, natural language questions and social media posts would be good candidates.
This is largely the result of mixing three tools which I will talk about more later. Credit goes mostly to: the SBERT, plotly and fastcluster teams as I simply stitched the pieces together.
Check out the colab notebook and GitHub for results and examples seen in the article.
#clustering #transformers #artificial-intelligence #nlp