There is a substantial amount of data generated on the internet every second — posts, comments, photos, and videos. These different data types mean that there is a lot of ground to cover, so let’s focus on one — text.

All social conversations are based on written words — tweets, Facebook posts, comments, online reviews, and so on. Being a social media marketer, a Facebook group/profile moderator, or trying to promote your business on social media requires you to know how your audience reacts to the content you are uploading. One way is to read it all, mark hateful comments, divide them into similar topic groups, calculate statistics and… lose a big chunk of your time just to see that there are thousands of new comments to add to your calculations. Fortunately, there is another solution to this problem — machine learning. From this text you will learn:

  • Why do you need specialised tools for social media analyses?
  • What can you get from topic modeling and how it is done?
  • How to automatically look for hate speech in comments?

Why are social media texts unique?

Before jumping to the analyses, it is really important to understand why social media texts are so unique:

  • Posts and comments are short. They mostly contain one simple sentence or even single word or expression. This gives us a limited amount of information to obtain just from one post.

Image for post

  • Emojis and smiley faces — used almost exclusively on social media. They give additional details about the author’s emotions and context.

Image for post

  • Slang phrases which make posts resemble spoken language rather than written. It makes statements appear more casual.

Image for post

These features make social media a whole different source of information and demand special attention while running an analysis using machine learning. In contrast, most open-source machine learning solutions are based on long, formal text, like Wikipedia articles and other website posts. As a result, these models perform badly on social media data, because they don’t understand additional forms of expression included. This problem is called domain shift and is a typical NLP problem. Different data also require customised data preparation methods called preprocessing. The step consists of cleaning text from invaluable tokens like URLs or mentions and conversion to machine readable format (more about how we do it in Sotrender). This is why it is crucial to use tools created especially for your data source to get the best results.

#analysis #social-media #data-science #topic-modeling #data analysis

Social media and topic modeling: how to analyze posts in practice
1.35 GEEK