The amount of written text available and relevant for us is staggering and, like most things these days, increasing exponentially. Yet the tools we use to read have largely remained unchanged.

When researching a topic, reading the news or trying to get an update on an event, we tend to follow a T-shaped process.

We want to understand, from a body of documents:

  • What are they mostly about? What is generally going on?
  • What are the main themes?
  • After the broad strokes, we tend to quickly dive deep into something we find interesting.
  • Ideally keep track of what we read, we try and stay on course.
  • Pull up, dive again into a narrow topic.
  • Repeat…

The hunter-gatherer equivalent for information collection.

Solving this would be a tremendous step forward in how we consume information and I will definitely NOT be able to solve it by the end of this article. My aim is, however, to propose an approach for a tiny step forward.

What can we do about this

What I am proposing here is a utility tool which helps us explore large collections of text quickly. I call it PictureText (silly name but WIP).

Given a set of short documents (think news headlines) it can group them into hierarchical groups that semantically belong together. The interactive treemap allows the reader to explore each group in more detail by going deeper into a hierarchy and dynamically pulling out of it when needed.

The approach is intended for grouping large sets of non-domain specific short texts. For instance: news headlines, natural language questions and social media posts would be good candidates.

This is largely the result of mixing three tools which I will talk about more later. Credit goes mostly to: the SBERTplotly and fastcluster teams as I simply stitched the pieces together.

Check out the colab notebook and GitHub for results and examples seen in the article.

#clustering #transformers #artificial-intelligence #nlp

PictureText: Interactive Visuals of Text
1.60 GEEK