A discussion of Arabic natural language processing (NLP) for social media text, with code examples and in-depth analysis of the cutting-edge technology driving the most recent advancements.
Natural language processing (NLP) is not a new discipline; its roots date back to the 1600s when philosophers such as Descartes and Leibniz proposed theoretical codes for language. In the past decade, the results of this long history have led to the integration of NLP into our own homes, in the form of digital assistants like Siri and Alexa. Although machine learning has remarkably accelerated the improvement of English NLP techniques, the study of NLP for other languages has always lagged behind.
As the official language of 22 countries spread across the Middle-East North Africa (MENA) region, Arabic is the 4th most used language on the Internet. Statistics from 2018 show 164 million internet users in the Middle East and 121 million internet users in North Africa.
As a language, Arabic has complex morphology and various dialects. The complexity increases significantly when considering the informal nature of social-media text and the distinction between Modern Standard Arabic (MSA) and Dialectical Arabic (DA). MSA is used for formal writing and DA is used for informal daily communication; however, both forms are present on social media with the latter being the most common form.
Further complicating matters, there are numerous dialects, for example the Egyptian dialect is different from the Levantine dialect which is used in Palestine, Jordan, Syria, Lebanon and Israel. Both of these dialects are also distinct from the Gulf dialect used in Kuwait, Bahrain, Qatar, and the United Arab Emirates. In a paper from May, 2019 researchers commented that the inflectional and derivational nature of Arabic language makes monophonic analysis on Arabic more difficult. Simply put, due to the differences between English and Arabic, the advancements to English NLP are not easily transferred to the development of Arabic NLP resources. Additionally, crude Arabic-to-English translations cannot be relied upon as sufficient preprocessing before the application of sophisticated English NLP methods; much is lost in translation.
Despite all these challenges, the past four years have been fruitful, and Arabic NLP research in areas such as sentiment analysis and machine translation have produced extremely useful resources. My focus is on Arabic social media, since these platforms have been instrumental in the democratic development of the MENA region. Furthermore, in my opinion, social media text is the most accessible form of data available to study this often inscrutable region.
This will be the first part of a three-part series about Arabic NLP. In this post, I will focus on the two most effective and accessible tools I have used in my research, AraVec and AraBERT. Those of you familiar with current research trends in English NLP, will notice the similarities between these names and the popular English NLP tools, Word2Vec and BERT. For a beginner-friendly overview of English NLP resources, check out my earlier post on Sentiment Analysis with Python where I discuss both Word2Vec and BERT in detail.
We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.
What is neuron analysis of a machine? Learn machine learning by designing Robotics algorithm. Click here for best machine learning course models with AI
AI, Machine learning, as its title defines, is involved as a process to make the machine operate a task automatically to know more join CETPA
You got intrigued by the machine learning world and wanted to get started as soon as possible, read all the articles, watched all the videos, but still isn’t sure about where to start, welcome to the club.
Introduction to Transfer Learning for NLP using fast.ai. This is the third part of a series of posts showing the improvements in NLP modeling approaches.