After discussing a wide range of aspects in Azure Machine Learning, now it is time to move into a new area, Text Analytics and we have dedicated this article to Language Detection in Azure Machine Learning for Text Analytics.

Before this article, we have discussed multiple machine learning techniques such as Regression analysis, Classification Analysis, ClusteringRecommender Systems and Anomaly detection of Time Series in Azure Machine Learning. Further, we have discussed the basic cleaning techniquesfeature selection techniques and Principal component analysisComparing Models and Cross-Validation and Hyper Tune parameters in this article series to date and now it’s time to focus on a different aspect in Machine Learning which is text analytics.

Text Analytics

In modern days, text analytics has become an important technique and there are many problems related to Text Analytics such as Content-Based recommendation, Text Classification. As you can imagine, text analytics is challenging in many aspects as there are various techniques to be done depending on the language that you are focusing on. Some of those techniques are tokenization, stemming, and lemmatization. Azure machine learning supports different options for text analytics as shown in the below figure.

Before we do all these pre-processing and other tasks, it is essential to perform Language Detection in Azure Machine Learning for Text Analytics.

Language Detection in Azure Machine Learning

There are more than 7,000 languages that are being used in the world today. Due to this fact, it is guaranteed that user-generated data will have multiple languages. Every language has unique features. Therefore, to perform preprocessing tasks, we need to identify the language. Since text analytics has a non-schematic structure. Therefore, users will enter different types of text including text in multiple languages. Therefore, language detection for Text Analytics is an important process before we start any other pre-processing techniques.

Let us look at a data sample that consists of multiple languages such as English, Spanish, French, Sinhala, Japanese, Arabic, etc.

Let us import this data to the Azure Machine learning to detect languages as we did in the first article of the series. The CSV file is updated to the new data set in Azure Machine learning with relevant comments. Let us drag and drop the uploaded data set and add **Select Columns in Dataset **control in order to remove unnecessary columns for the data set.

Let us look at the data output from the **Select Columns in DataSet **control as shown below.

Now the task is to identify the language from **Detect Languages **control. In this control, you need to select the text that you are looking for Language Detection in Azure Machine Learning.

You can select only one column for language detection in Azure Machine Learning. If you are looking forward to detecting language in multiple columns, you need to include multiple Detect Languages Control.

Detect Languages control has very simple configurations as shown in the above screenshot. The Upper bound on several languages to detect option decides how many languages to detect. By default, this option is the one that means every text is assigned to the only language when there are multiple languages in one text.

Let us look at the output of the Detect Language control as shown below.

Language Detection in Azure Machine learning has identified each text language with the score. If the score is 100% for English that means the text is 100% English. Now you can filter only the English text by using the Split Data Control as shown below.

After the data split is completed, the first port will give English data while the other port will provide non-English data. The first port will now provide you with only the text in English as shown below.

#azure #machine learning #azure machine learning

Language Detection in Azure Machine Learning with basic Text Analytics Techniques
1.30 GEEK