Améliorer Les Performances Des Modèles D'apprentissage Automatique

Dans tout processus de prise de décision, vous devez rassembler tous les différents types de données dont vous disposez, toutes les informations, les séparer, en savoir plus à ce sujet, faire appel à des experts et bien plus avant de prendre une décision solide. 

Ceci est similaire dans le processus d'apprentissage automatique avec les techniques Ensemble. Les modèles d'ensemble combinent une variété de modèles ensemble pour aider le processus de prédiction (prise de décision). Un seul modèle peut ne pas avoir les capacités de produire la bonne prédiction pour un ensemble de données spécifique - cela augmente le risque de variance élevée, de faible précision, de bruit et de biais. En combinant plusieurs modèles, nous avons effectivement plus de chances d'améliorer le niveau de précision.

L'exemple le plus simple est celui des arbres de décision - un modèle de structure de type arbre de probabilité qui divise en continu les données pour faire des prédictions basées sur l'ensemble précédent de questions auxquelles on a répondu. 

Pourquoi utiliseriez-vous une technique d'ensemble ?

Pour répondre à la question de cet article, "Quand les techniques d'ensemble seraient-elles un bon choix?" Lorsque vous souhaitez améliorer les performances des modèles d'apprentissage automatique, c'est aussi simple que cela.

Par exemple, si vous travaillez sur une tâche de classification et que vous souhaitez augmenter la précision de votre modèle, utilisez des techniques d'ensemble. Si vous souhaitez réduire votre erreur moyenne pour votre tâche de régression, utilisez des techniques d'ensemble. 

Les 2 principales raisons d'utiliser un algorithme d'apprentissage d'ensemble sont :

  • Améliorez les prédictions - vous obtiendrez de meilleures compétences prédictives plutôt que d'utiliser un seul modèle.
  • Améliorez la robustesse - vous obtiendrez de meilleures prédictions stables plutôt que d'utiliser un seul modèle.

Votre objectif général lorsque vous utilisez des techniques d'ensemble devrait être de réduire l'erreur de généralisation de la prédiction. Par conséquent, l'utilisation d'une variété de modèles de base qui sont divers réduira automatiquement votre erreur de prédiction. 

Il s'agit essentiellement de construire un modèle plus stable, fiable et précis auquel vous faites confiance.

Il existe 3 types de techniques de modélisation d'ensemble : 

  1. Ensachage
  2. Booster
  3. Empilage

Ensachage

Abréviation de Bootstrap Aggregation, car la technique de modélisation d'ensemble combine Bootstrap et Aggregation pour former un modèle d'ensemble. Il est basé sur la création de plusieurs ensembles de données d'apprentissage d'origine, créant des modèles de probabilité de structure arborescente qui s'agrègent ensuite pour conclure à une prédiction finale. 

Chaque modèle apprend les erreurs produites dans le modèle précédent et utilise un sous-ensemble différent de l'ensemble de données d'apprentissage. Le bagging vise à éviter le surajustement des données et à réduire la variance dans les prédictions et peut être utilisé à la fois pour les modèles de régression et de classification.

Forêt aléatoire

Random Forest est un algorithme de Bagging mais avec une légère différence. Il utilise un sous-ensemble d'échantillons des données d'apprentissage et un sous-ensemble de fonctionnalités pour créer plusieurs arbres qui se divisent. Vous pouvez le voir comme plusieurs arbres de décision qui s'adaptent à chaque ensemble d'apprentissage de manière aléatoire. 

La décision de séparation est basée sur une sélection aléatoire de caractéristiques provoquant une différenciation entre chaque arbre. Cela produit un résultat agrégé et une prédiction finale plus précis. 

D'autres exemples d'algorithme sont :

  • Arbres de décision ensachés
  • Arbres supplémentaires
  • Ensachage personnalisé

Booster

Booster est l'acte de convertir des apprenants faibles en apprenants forts. Un apprenant faible ne parvient pas à faire des prédictions précises en raison de ses capacités. Une nouvelle règle de prédiction faible est générée en appliquant des algorithmes d'apprentissage de base. Cela se fait en prenant un échantillon aléatoire de données qui est ensuite entré dans un modèle puis formé de manière séquentielle qui vise à former les apprenants faibles et à essayer de corriger son prédécesseur.

Un exemple de Boosting est AdaBoost et XGBoost

AdaBoost

AdaBoost est l'abréviation de Adaptive Boosting et est utilisé comme technique pour augmenter les performances d'un algorithme d'apprentissage automatique. Il prend la notion d'apprenants faibles de Random Forests et construit des modèles sur plusieurs apprenants faibles.

class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)

XGBoost

XGboosts signifie Extreme Gradient Boosting et est l'un des algorithmes de boosting les plus populaires qui peuvent être utilisés à la fois pour les tâches de régression et de classification. Il s'agit d'un type d'algorithme d'apprentissage automatique supervisé qui vise à prédire avec précision une variable cible en combinant un ensemble de modèles plus faibles.

D'autres exemples d'algorithmes sont :

  • Gradient Boosting Machine
  • Amplification du gradient stochastique
  • LightGBM

Quand dois-je utiliser Bagging vs Boosting ?

La façon la plus simple de déterminer quand utiliser le bagging ou le boosting est :

  • Si le classificateur est instable et a une variance élevée - utilisez Bagging
  • Si le classificateur est stable, mais a un biais élevé - utilisez Boosting

Empilage

L'empilement est l'abréviation de Stacked Generalization et est similaire au boosting ; dans le but de produire des prédicteurs plus robustes. Cela se fait en prenant les prédictions des apprenants faibles et en les utilisant pour créer un modèle solide. Pour ce faire, il détermine comment combiner au mieux les prédictions de plusieurs modèles sur le même ensemble de données.

Il s'agit essentiellement de vous demander "Si vous disposiez d'une variété de modèles d'apprentissage automatique qui fonctionnent bien sur un problème spécifique, comment choisissez-vous le meilleur modèle auquel faire confiance ?"

Vote

Le vote est un exemple d'empilement, mais il est différent pour les tâches de classification et de régression. 

Pour la régression, la prédiction est faite sur la base de la moyenne des autres modèles de régression. 

class sklearn.ensemble.VotingRegressor(estimators, *, weights=None, n_jobs=None, verbose=False)

Pour la classification, il peut y avoir soit un vote dur, soit un vote mou. Le vote dur consiste essentiellement à choisir la prédiction avec le plus grand nombre de votes, tandis que le vote mou consiste à combiner les probabilités de chaque prédiction dans chacun des modèles, puis à choisir la prédiction avec la probabilité totale la plus élevée.

class sklearn.ensemble.VotingClassifier(estimators, *, voting='hard', weights=None, n_jobs=None, flatten_transform=True, verbose=False)

D'autres exemples d'algorithmes sont :

  • Moyenne pondérée
  • Mélange
  • Empilage
  • Super apprenant

Quelle est la différence entre le stacking et le bagging/boosting ?

L'ensachage utilise des arbres de décision, où l'empilement utilise différents modèles. L'ensachage prend des échantillons de l'ensemble de données d'apprentissage, où l'empilement tient sur le même ensemble de données.

Le boosting utilise une séquence de modèles qui convertit les apprenants faibles en apprenants forts pour corriger les prédictions des modèles précédents, tandis que l'empilement utilise un seul modèle pour apprendre à combiner au mieux les prédictions des modèles contributeurs. 

Tout agréger

Vous aurez toujours besoin de comprendre ce que vous essayez d'accomplir avant d'essayer de résoudre une tâche. Une fois cela fait, vous pourrez déterminer si votre tâche est une tâche de classification ou de régression - dans laquelle vous pourrez ensuite choisir quel algorithme d'ensemble sera le meilleur à utiliser pour améliorer les prédictions et la robustesse de vos modèles. 

Source :  https://www.kdnuggets.com

#machine-learning 

What is GEEK

Buddha Community

Améliorer Les Performances Des Modèles D'apprentissage Automatique

Améliorer Les Performances Des Modèles D'apprentissage Automatique

Dans tout processus de prise de décision, vous devez rassembler tous les différents types de données dont vous disposez, toutes les informations, les séparer, en savoir plus à ce sujet, faire appel à des experts et bien plus avant de prendre une décision solide. 

Ceci est similaire dans le processus d'apprentissage automatique avec les techniques Ensemble. Les modèles d'ensemble combinent une variété de modèles ensemble pour aider le processus de prédiction (prise de décision). Un seul modèle peut ne pas avoir les capacités de produire la bonne prédiction pour un ensemble de données spécifique - cela augmente le risque de variance élevée, de faible précision, de bruit et de biais. En combinant plusieurs modèles, nous avons effectivement plus de chances d'améliorer le niveau de précision.

L'exemple le plus simple est celui des arbres de décision - un modèle de structure de type arbre de probabilité qui divise en continu les données pour faire des prédictions basées sur l'ensemble précédent de questions auxquelles on a répondu. 

Pourquoi utiliseriez-vous une technique d'ensemble ?

Pour répondre à la question de cet article, "Quand les techniques d'ensemble seraient-elles un bon choix?" Lorsque vous souhaitez améliorer les performances des modèles d'apprentissage automatique, c'est aussi simple que cela.

Par exemple, si vous travaillez sur une tâche de classification et que vous souhaitez augmenter la précision de votre modèle, utilisez des techniques d'ensemble. Si vous souhaitez réduire votre erreur moyenne pour votre tâche de régression, utilisez des techniques d'ensemble. 

Les 2 principales raisons d'utiliser un algorithme d'apprentissage d'ensemble sont :

  • Améliorez les prédictions - vous obtiendrez de meilleures compétences prédictives plutôt que d'utiliser un seul modèle.
  • Améliorez la robustesse - vous obtiendrez de meilleures prédictions stables plutôt que d'utiliser un seul modèle.

Votre objectif général lorsque vous utilisez des techniques d'ensemble devrait être de réduire l'erreur de généralisation de la prédiction. Par conséquent, l'utilisation d'une variété de modèles de base qui sont divers réduira automatiquement votre erreur de prédiction. 

Il s'agit essentiellement de construire un modèle plus stable, fiable et précis auquel vous faites confiance.

Il existe 3 types de techniques de modélisation d'ensemble : 

  1. Ensachage
  2. Booster
  3. Empilage

Ensachage

Abréviation de Bootstrap Aggregation, car la technique de modélisation d'ensemble combine Bootstrap et Aggregation pour former un modèle d'ensemble. Il est basé sur la création de plusieurs ensembles de données d'apprentissage d'origine, créant des modèles de probabilité de structure arborescente qui s'agrègent ensuite pour conclure à une prédiction finale. 

Chaque modèle apprend les erreurs produites dans le modèle précédent et utilise un sous-ensemble différent de l'ensemble de données d'apprentissage. Le bagging vise à éviter le surajustement des données et à réduire la variance dans les prédictions et peut être utilisé à la fois pour les modèles de régression et de classification.

Forêt aléatoire

Random Forest est un algorithme de Bagging mais avec une légère différence. Il utilise un sous-ensemble d'échantillons des données d'apprentissage et un sous-ensemble de fonctionnalités pour créer plusieurs arbres qui se divisent. Vous pouvez le voir comme plusieurs arbres de décision qui s'adaptent à chaque ensemble d'apprentissage de manière aléatoire. 

La décision de séparation est basée sur une sélection aléatoire de caractéristiques provoquant une différenciation entre chaque arbre. Cela produit un résultat agrégé et une prédiction finale plus précis. 

D'autres exemples d'algorithme sont :

  • Arbres de décision ensachés
  • Arbres supplémentaires
  • Ensachage personnalisé

Booster

Booster est l'acte de convertir des apprenants faibles en apprenants forts. Un apprenant faible ne parvient pas à faire des prédictions précises en raison de ses capacités. Une nouvelle règle de prédiction faible est générée en appliquant des algorithmes d'apprentissage de base. Cela se fait en prenant un échantillon aléatoire de données qui est ensuite entré dans un modèle puis formé de manière séquentielle qui vise à former les apprenants faibles et à essayer de corriger son prédécesseur.

Un exemple de Boosting est AdaBoost et XGBoost

AdaBoost

AdaBoost est l'abréviation de Adaptive Boosting et est utilisé comme technique pour augmenter les performances d'un algorithme d'apprentissage automatique. Il prend la notion d'apprenants faibles de Random Forests et construit des modèles sur plusieurs apprenants faibles.

class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)

XGBoost

XGboosts signifie Extreme Gradient Boosting et est l'un des algorithmes de boosting les plus populaires qui peuvent être utilisés à la fois pour les tâches de régression et de classification. Il s'agit d'un type d'algorithme d'apprentissage automatique supervisé qui vise à prédire avec précision une variable cible en combinant un ensemble de modèles plus faibles.

D'autres exemples d'algorithmes sont :

  • Gradient Boosting Machine
  • Amplification du gradient stochastique
  • LightGBM

Quand dois-je utiliser Bagging vs Boosting ?

La façon la plus simple de déterminer quand utiliser le bagging ou le boosting est :

  • Si le classificateur est instable et a une variance élevée - utilisez Bagging
  • Si le classificateur est stable, mais a un biais élevé - utilisez Boosting

Empilage

L'empilement est l'abréviation de Stacked Generalization et est similaire au boosting ; dans le but de produire des prédicteurs plus robustes. Cela se fait en prenant les prédictions des apprenants faibles et en les utilisant pour créer un modèle solide. Pour ce faire, il détermine comment combiner au mieux les prédictions de plusieurs modèles sur le même ensemble de données.

Il s'agit essentiellement de vous demander "Si vous disposiez d'une variété de modèles d'apprentissage automatique qui fonctionnent bien sur un problème spécifique, comment choisissez-vous le meilleur modèle auquel faire confiance ?"

Vote

Le vote est un exemple d'empilement, mais il est différent pour les tâches de classification et de régression. 

Pour la régression, la prédiction est faite sur la base de la moyenne des autres modèles de régression. 

class sklearn.ensemble.VotingRegressor(estimators, *, weights=None, n_jobs=None, verbose=False)

Pour la classification, il peut y avoir soit un vote dur, soit un vote mou. Le vote dur consiste essentiellement à choisir la prédiction avec le plus grand nombre de votes, tandis que le vote mou consiste à combiner les probabilités de chaque prédiction dans chacun des modèles, puis à choisir la prédiction avec la probabilité totale la plus élevée.

class sklearn.ensemble.VotingClassifier(estimators, *, voting='hard', weights=None, n_jobs=None, flatten_transform=True, verbose=False)

D'autres exemples d'algorithmes sont :

  • Moyenne pondérée
  • Mélange
  • Empilage
  • Super apprenant

Quelle est la différence entre le stacking et le bagging/boosting ?

L'ensachage utilise des arbres de décision, où l'empilement utilise différents modèles. L'ensachage prend des échantillons de l'ensemble de données d'apprentissage, où l'empilement tient sur le même ensemble de données.

Le boosting utilise une séquence de modèles qui convertit les apprenants faibles en apprenants forts pour corriger les prédictions des modèles précédents, tandis que l'empilement utilise un seul modèle pour apprendre à combiner au mieux les prédictions des modèles contributeurs. 

Tout agréger

Vous aurez toujours besoin de comprendre ce que vous essayez d'accomplir avant d'essayer de résoudre une tâche. Une fois cela fait, vous pourrez déterminer si votre tâche est une tâche de classification ou de régression - dans laquelle vous pourrez ensuite choisir quel algorithme d'ensemble sera le meilleur à utiliser pour améliorer les prédictions et la robustesse de vos modèles. 

Source :  https://www.kdnuggets.com

#machine-learning 

Thierry  Perret

Thierry Perret

1657272480

5 Façons D'effectuer Une analyse Des Sentiments En Python

Qu'il s'agisse de Twitter, de Goodreads ou d'Amazon, il n'y a guère d'espace numérique qui ne soit pas saturé d'opinions. Dans le monde d'aujourd'hui, il est crucial pour les organisations d'approfondir ces opinions et d'obtenir des informations sur leurs produits ou services. Cependant, ces données existent en quantités si étonnantes que les évaluer manuellement est une poursuite presque impossible. C'est là qu'intervient une autre aubaine de la science des données  : l' analyse des sentiments . Dans cet article, nous allons explorer ce qu'englobe l'analyse des sentiments et les différentes façons de l'implémenter en Python.

Qu'est-ce que l'analyse des sentiments ?

L'analyse des sentiments est un cas d'utilisation du traitement du langage naturel (TLN) et relève de la catégorie de la classification de texte . Pour le dire simplement, l'analyse des sentiments consiste à classer un texte en différents sentiments, tels que positif ou négatif, heureux, triste ou neutre, etc. Ainsi, le but ultime de l'analyse des sentiments est de déchiffrer l'humeur, l'émotion ou le sentiment sous-jacent d'un texte. Ceci est également connu sous le nom d' Opinion Mining .

Voyons comment une recherche rapide sur Google définit l'analyse des sentiments :

définition de l'analyse des sentiments

Obtenir des informations et prendre des décisions grâce à l'analyse des sentiments

Eh bien, maintenant, je suppose que nous sommes quelque peu habitués à ce qu'est l'analyse des sentiments. Mais quelle est sa signification et comment les organisations en bénéficient-elles ? Essayons d'explorer la même chose avec un exemple. Supposons que vous démarriez une entreprise qui vend des parfums sur une plateforme en ligne. Vous proposez une large gamme de parfums et bientôt les clients commencent à affluer. Après un certain temps, vous décidez de changer la stratégie de prix des parfums - vous envisagez d'augmenter les prix des parfums populaires et en même temps d'offrir des remises sur les parfums impopulaires. . Maintenant, afin de déterminer quels parfums sont populaires, vous commencez à parcourir les avis des clients sur tous les parfums. Mais tu es coincé ! Ils sont tellement nombreux que vous ne pouvez pas tous les parcourir en une seule vie. C'est là que l'analyse des sentiments peut vous sortir de l'impasse.

Vous rassemblez simplement tous les avis en un seul endroit et y appliquez une analyse des sentiments. Ce qui suit est une représentation schématique de l'analyse des sentiments sur les critiques de trois parfums de parfums - Lavande, Rose et Citron. (Veuillez noter que ces avis peuvent avoir des fautes d'orthographe, de grammaire et de ponctuation, comme dans les scénarios du monde réel)

analyse des sentiments

A partir de ces résultats, nous pouvons clairement voir que :

Fragrance-1 (Lavande) a des critiques très positives de la part des clients, ce qui indique que votre entreprise peut augmenter ses prix compte tenu de sa popularité.

Il se trouve que Fragrance-2 (Rose) a une vision neutre parmi le client, ce qui signifie que votre entreprise ne doit pas modifier ses prix .

Fragrance-3 (Citron) a un sentiment global négatif qui lui est associé - votre entreprise devrait donc envisager d'offrir une remise pour équilibrer la balance.

Ce n'était qu'un exemple simple de la façon dont l'analyse des sentiments peut vous aider à mieux comprendre vos produits/services et aider votre organisation à prendre des décisions.

Cas d'utilisation de l'analyse des sentiments

Nous venons de voir comment l'analyse des sentiments peut donner aux organisations des informations qui peuvent les aider à prendre des décisions basées sur les données. Examinons maintenant d'autres cas d'utilisation de l'analyse des sentiments.

  1. Surveillance des médias sociaux pour la gestion de la marque : les marques peuvent utiliser l'analyse des sentiments pour évaluer les perspectives publiques de leur marque. Par exemple, une entreprise peut rassembler tous les Tweets avec la mention ou le tag de l'entreprise et effectuer une analyse des sentiments pour connaître les perspectives publiques de l'entreprise.
  2. Analyse des produits/services : les marques/organisations peuvent effectuer une analyse des sentiments sur les avis des clients pour voir dans quelle mesure un produit ou un service se comporte sur le marché et prendre des décisions futures en conséquence.
  3. Prévision du cours des actions : Prédire si les actions d'une entreprise vont monter ou descendre est crucial pour les investisseurs. On peut déterminer la même chose en effectuant une analyse des sentiments sur les titres des articles contenant le nom de l'entreprise. Si les gros titres concernant une organisation particulière ont un sentiment positif, le cours de ses actions devrait augmenter et vice-versa.

Façons d'effectuer une analyse des sentiments en Python

Python est l'un des outils les plus puissants lorsqu'il s'agit d'effectuer des tâches de science des données - il offre une multitude de façons d'effectuer une  analyse des sentiments . Les plus populaires sont enrôlés ici:

  1. Utilisation du blob de texte
  2. Utiliser Vador
  3. Utilisation de modèles basés sur la vectorisation de sacs de mots
  4. Utilisation de modèles basés sur LSTM
  5. Utilisation de modèles basés sur des transformateurs

Plongeons-les profondément un par un.

Remarque : Aux fins des démonstrations des méthodes 3 et 4 (utilisation de modèles basés sur la vectorisation de sacs de mots et utilisation de modèles basés sur LSTM) , l'analyse des sentiments a été utilisée. Il comprend plus de 5000 extraits de texte étiquetés comme positifs, négatifs ou neutres. Le jeu de données est sous licence Creative Commons.

Utilisation du blob de texte

Text Blob est une bibliothèque Python pour le traitement du langage naturel. L'utilisation de Text Blob pour l'analyse des sentiments est assez simple. Il prend le texte en entrée et peut renvoyer la polarité et la subjectivité en sortie.

La polarité détermine le sentiment du texte. Ses valeurs se situent dans [-1,1] où -1 dénote un sentiment très négatif et 1 dénote un sentiment très positif.

La subjectivité détermine si une entrée de texte est une information factuelle ou une opinion personnelle. Sa valeur est comprise entre [0,1] où une valeur plus proche de 0 dénote une information factuelle et une valeur plus proche de 1 dénote une opinion personnelle.

Mise en place :

pip install textblob

Importer un blob de texte :

from textblob import TextBlob

Implémentation de code pour l'analyse des sentiments à l'aide de Text Blob :

L'écriture de code pour l'analyse des sentiments à l'aide de TextBlob est assez simple. Importez simplement l'objet TextBlob et transmettez le texte à analyser avec les attributs appropriés comme suit :

from textblob import TextBlob
text_1 = "The movie was so awesome."
text_2 = "The food here tastes terrible."#Determining the Polarity 
p_1 = TextBlob(text_1).sentiment.polarity
p_2 = TextBlob(text_2).sentiment.polarity#Determining the Subjectivity
s_1 = TextBlob(text_1).sentiment.subjectivity
s_2 = TextBlob(text_2).sentiment.subjectivityprint("Polarity of Text 1 is", p_1)
print("Polarity of Text 2 is", p_2)
print("Subjectivity of Text 1 is", s_1)
print("Subjectivity of Text 2 is", s_2)

Production:

Polarity of Text 1 is 1.0 
Polarity of Text 2 is -1.0 
Subjectivity of Text 1 is 1.0 
Subjectivity of Text 2 is 1.0

Utiliser VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) est un analyseur de sentiments basé sur des règles qui a été formé sur le texte des médias sociaux. Tout comme Text Blob, son utilisation en Python est assez simple. Nous verrons son utilisation dans l'implémentation du code avec un exemple dans un moment.

Installation:

pip install vaderSentiment

Importation de la classe SentimentIntensityAnalyzer depuis Vader :

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Code pour l'analyse des sentiments à l'aide de Vader :

Tout d'abord, nous devons créer un objet de la classe SentimentIntensityAnalyzer ; alors nous devons passer le texte à la fonction polarity_scores() de l'objet comme suit :

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()
text_1 = "The book was a perfect balance between wrtiting style and plot."
text_2 =  "The pizza tastes terrible."
sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)
print("Sentiment of text 1:", sent_1)
print("Sentiment of text 2:", sent_2)

Sortie :

Sentiment of text 1: {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compound': 0.5719} 
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}

Comme nous pouvons le voir, un objet VaderSentiment renvoie un dictionnaire de scores de sentiment pour le texte à analyser.

Utilisation de modèles basés sur la vectorisation de sacs de mots

Dans les deux approches discutées jusqu'à présent, c'est-à-dire Text Blob et Vader, nous avons simplement utilisé des bibliothèques Python pour effectuer une analyse des sentiments. Nous allons maintenant discuter d'une approche dans laquelle nous formerons notre propre modèle pour la tâche. Les étapes impliquées dans l'analyse des sentiments à l'aide de la méthode de vectorisation du sac de mots sont les suivantes :

  1. Prétraiter le texte des données de formation (le prétraitement du texte implique la normalisation, la tokenisation, la suppression des mots vides et la radicalisation/lemmatisation.)
  2. Créez un sac de mots pour les données textuelles prétraitées à l'aide de l'approche de vectorisation par comptage ou de vectorisation TF-IDF.
  3. Entraînez un modèle de classification approprié sur les données traitées pour la classification des sentiments.

Code pour l'analyse des sentiments à l'aide de l'approche de vectorisation du sac de mots :

Pour créer un modèle d'analyse des sentiments à l'aide de l'approche de vectorisation BOW, nous avons besoin d'un ensemble de données étiqueté. Comme indiqué précédemment, l'ensemble de données utilisé pour cette démonstration a été obtenu auprès de Kaggle. Nous avons simplement utilisé le vectoriseur de comptage de sklearn pour créer le BOW. Ensuite, nous avons formé un classificateur Multinomial Naive Bayes, pour lequel un score de précision de 0,84 a été obtenu.

L'ensemble de données peut être obtenu à partir d' ici .

#Loading the Dataset
import pandas as pd
data = pd.read_csv('Finance_data.csv')
#Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(data['sentences'])
#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['feedback'], test_size=0.25, random_state=5)
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)
#Caluclating the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)

Sortie :

Accuracuy Score:  0.9111675126903553

Le classificateur formé peut être utilisé pour prédire le sentiment de n'importe quelle entrée de texte donnée.

Utilisation de modèles basés sur LSTM

Bien que nous ayons pu obtenir un score de précision décent avec la méthode de vectorisation du sac de mots, il se peut qu'elle ne donne pas les mêmes résultats lorsqu'il s'agit d'ensembles de données plus volumineux. Cela donne lieu à la nécessité d'utiliser des modèles basés sur l'apprentissage en profondeur pour la formation du modèle d'analyse des sentiments.

Pour les tâches NLP, nous utilisons généralement des modèles basés sur RNN car ils sont conçus pour traiter des données séquentielles. Ici, nous allons former un modèle LSTM (Long Short Term Memory) en utilisant TensorFlow avec Keras . Les étapes pour effectuer une analyse des sentiments à l'aide de modèles basés sur LSTM sont les suivantes :

  1. Prétraiter le texte des données de formation (le prétraitement du texte implique la normalisation, la tokenisation, la suppression des mots vides et la radicalisation/lemmatisation.)
  2. Importez Tokenizer depuis Keras.preprocessing.text et créez son objet. Ajustez le tokenizer sur l'ensemble du texte de formation (afin que le Tokenizer soit formé sur le vocabulaire des données de formation). Générez des incorporations de texte à l'aide de la méthode texts_to_sequence() du Tokenizer et stockez-les après les avoir remplies à une longueur égale. (Les incorporations sont des représentations numériques/vectorisées du texte. Comme nous ne pouvons pas alimenter directement notre modèle avec les données textuelles, nous devons d'abord les convertir en incorporations)
  3. Après avoir généré les plongements, nous sommes prêts à construire le modèle. Nous construisons le modèle à l'aide de TensorFlow - ajoutez-lui Input, LSTM et des couches denses. Ajoutez des abandons et réglez les hyperparamètres pour obtenir un score de précision décent. Généralement, nous avons tendance à utiliser les fonctions d'activation ReLU ou LeakyReLU dans les couches internes des modèles LSTM car cela évite le problème du gradient de fuite. Au niveau de la couche de sortie, nous utilisons la fonction d'activation Softmax ou Sigmoid.

Code pour l'analyse des sentiments à l'aide d'une approche de modèle basée sur LSTM :

Ici, nous avons utilisé le même jeu de données que celui que nous avons utilisé dans le cas de l'approche BOW. Une précision d'entraînement de 0,90 a été obtenue.

#Importing necessary libraries
import nltk
import pandas as pd
from textblob import Word
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split 
#Loading the dataset
data = pd.read_csv('Finance_data.csv')
#Pre-Processing the text 
def cleaning(df, stop_words):
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x.lower() for x in x.split()))
    # Replacing the digits/numbers
    df['sentences'] = df['sentences'].str.replace('d', '')
    # Removing stop words
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x for x in x.split() if x not in stop_words))
    # Lemmatization
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join([Word(x).lemmatize() for x in x.split()]))
    return df
stop_words = stopwords.words('english')
data_cleaned = cleaning(data, stop_words)
#Generating Embeddings using tokenizer
tokenizer = Tokenizer(num_words=500, split=' ') 
tokenizer.fit_on_texts(data_cleaned['verified_reviews'].values)
X = tokenizer.texts_to_sequences(data_cleaned['verified_reviews'].values)
X = pad_sequences(X)
#Model Building
model = Sequential()
model.add(Embedding(500, 120, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(704, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(352, activation='LeakyReLU'))
model.add(Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())
#Model Training
model.fit(X_train, y_train, epochs = 20, batch_size=32, verbose =1)
#Model Testing
model.evaluate(X_test,y_test)

Utilisation de modèles basés sur des transformateurs

Les modèles basés sur les transformateurs sont l'une des techniques de traitement du langage naturel les plus avancées. Ils suivent une architecture basée sur l'encodeur-décodeur et utilisent les concepts d'auto-attention pour donner des résultats impressionnants. Bien que l'on puisse toujours construire un modèle de transformateur à partir de zéro, c'est une tâche assez fastidieuse. Ainsi, nous pouvons utiliser des modèles de transformateurs pré-formés disponibles sur Hugging Face . Hugging Face est une communauté d'IA open source qui propose une multitude de modèles pré-formés pour les applications NLP. Ces modèles peuvent être utilisés tels quels ou être affinés pour des tâches spécifiques.

Installation:

pip install transformers

Importation de la classe SentimentIntensityAnalyzer depuis Vader :

import transformers

Code pour l'analyse des sentiments à l'aide de modèles basés sur Transformer :

Pour effectuer une tâche à l'aide de transformateurs, nous devons d'abord importer la fonction de pipeline à partir des transformateurs. Ensuite, un objet de la fonction pipeline est créé et la tâche à effectuer est passée en argument (c'est-à-dire l'analyse des sentiments dans notre cas). Nous pouvons également spécifier le modèle que nous devons utiliser pour effectuer la tâche. Ici, puisque nous n'avons pas mentionné le modèle à utiliser, le mode distillery-base-uncased-finetuned-sst-2-English est utilisé par défaut pour l'analyse des sentiments. Vous pouvez consulter la liste des tâches et des modèles disponibles ici .

from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["It was the best of times.", "t was the worst of times."]
sentiment_pipeline(data)Output:[{'label': 'POSITIVE', 'score': 0.999457061290741},  {'label': 'NEGATIVE', 'score': 0.9987301230430603}]

Conclusion

À cette époque où les utilisateurs peuvent exprimer leurs points de vue sans effort et où les données sont générées en superflu en quelques fractions de secondes seulement - tirer des enseignements de ces données est vital pour que les organisations prennent des décisions efficaces - et l'analyse des sentiments s'avère être la pièce manquante du puzzle !

Nous avons maintenant couvert en détail ce qu'implique exactement l'analyse des sentiments et les différentes méthodes que l'on peut utiliser pour l'exécuter en Python. Mais ce n'étaient que quelques démonstrations rudimentaires - vous devez sûrement aller de l'avant et jouer avec les modèles et les essayer sur vos propres données.

Source : https://www.analyticsvidhya.com/blog/2022/07/sentiment-analysis-using-python/

#python 

Go-perfbook: Thoughts on Go Performance Optimization

go-perfbook

This document outlines best practices for writing high-performance Go code.

The first sections cover writing optimized code in any language. The later sections cover Go-specific techniques.

Writing and Optimizing Go code

This document outlines best practices for writing high-performance Go code.

While some discussions will be made for making individual services faster (caching, etc), designing performant distributed systems is beyond the scope of this work. There are already good texts on monitoring and distributed system design. Optimizing distributed systems encompasses an entirely different set of research and design trade-offs.

All the content will be licensed under CC-BY-SA.

This book is split into different sections:

  1. Basic tips for writing not-slow software
    • CS 101-level stuff
  2. Tips for writing fast software
    • Go-specific sections on how to get the best from Go
  3. Advanced tips for writing really fast software
    • For when your optimized code isn't fast enough

We can summarize these three sections as:

  1. "Be reasonable"
  2. "Be deliberate"
  3. "Be dangerous"

When and Where to Optimize

I'm putting this first because it's really the most important step. Should you even be doing this at all?

Every optimization has a cost. Generally, this cost is expressed in terms of code complexity or cognitive load -- optimized code is rarely simpler than the unoptimized version.

But there's another side that I'll call the economics of optimization. As a programmer, your time is valuable. There's the opportunity cost of what else you could be working on for your project, which bugs to fix, which features to add. Optimizing things is fun, but it's not always the right task to choose. Performance is a feature, but so is shipping, and so is correctness.

Choose the most important thing to work on. Sometimes it's not an actual CPU optimization, but a user-experience one. Something as simple as adding a progress bar, or making a page more responsive by doing computation in the background after rendering the page.

Sometimes this will be obvious: an hourly report that completes in three hours is probably less useful than one that completes in less than one.

Just because something is easy to optimize doesn't mean it's worth optimizing. Ignoring low-hanging fruit is a valid development strategy.

Think of this as optimizing your time.

You get to choose what to optimize and when to optimize. You can move the slider between "Fast Software" and "Fast Deployment"

People hear and mindlessly repeat "premature optimization is the root of all evil", but they miss the full context of the quote.

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

-- Knuth

Add: https://www.youtube.com/watch?time_continue=429&v=3WBaY61c9sE

  • don't ignore the easy optimizations
  • more knowledge of algorithms and data structures makes more optimizations "easy" or "obvious"

Should you optimize?

Yes, but only if the problem is important, the program is genuinely too slow, and there is some expectation that it can be made faster while maintaining correctness, robustness, and clarity."

-- The Practice of Programming, Kernighan and Pike

Premature optimization can also hurt you by tying you into certain decisions. The optimized code can be harder to modify if requirements change and harder to throw away (sunk-cost fallacy) if needed.

BitFunnel performance estimation has some numbers that make this trade-off explicit. Imagine a hypothetical search engine needing 30,000 machines across multiple data centers. These machines have a cost of approximately $1,000 USD per year. If you can double the speed of the software, this can save the company $15M USD per year. Even a single developer spending an entire year to improve performance by only 1% will pay for itself.

In the vast majority of cases, the size and speed of a program is not a concern. The easiest optimization is not having to do it. The second easiest optimization is just buying faster hardware.

Once you've decided you're going to change your program, keep reading.

How to Optimize

Optimization Workflow

Before we get into the specifics, let's talk about the general process of optimization.

Optimization is a form of refactoring. But each step, rather than improving some aspect of the source code (code duplication, clarity, etc), improves some aspect of the performance: lower CPU, memory usage, latency, etc. This improvement generally comes at the cost of readability. This means that in addition to a comprehensive set of unit tests (to ensure your changes haven't broken anything), you also need a good set of benchmarks to ensure your changes are having the desired effect on performance. You must be able to verify that your change really is lowering CPU. Sometimes a change you thought would improve performance will actually turn out to have a zero or negative change. Always make sure you undo your fix in these cases.

What is the best comment in source code you have ever encountered? - Stack Overflow:

//
// Dear maintainer:
//
// Once you are done trying to 'optimize' this routine,
// and have realized what a terrible mistake that was,
// please increment the following counter as a warning
// to the next guy:
//
// total_hours_wasted_here = 42
//

The benchmarks you are using must be correct and provide reproducible numbers on representative workloads. If individual runs have too high a variance, it will make small improvements more difficult to spot. You will need to use benchstat or equivalent statistical tests and won't be able just to eyeball it. (Note that using statistical tests is a good idea anyway.) The steps to run the benchmarks should be documented, and any custom scripts and tooling should be committed to the repository with instructions for how to run them. Be mindful of large benchmark suites that take a long time to run: it will make the development iterations slower.

Note also that anything that can be measured can be optimized. Make sure you're measuring the right thing.

The next step is to decide what you are optimizing for. If the goal is to improve CPU, what is an acceptable speed? Do you want to improve the current performance by 2x? 10x? Can you state it as "a problem of size N in less than time T"? Are you trying to reduce memory usage? By how much? How much slower is acceptable for what change in memory usage? What are you willing to give up in exchange for lower space requirements?

Optimizing for service latency is a trickier proposition. Entire books have been written on how to performance test web servers. The primary issue is that for a single function, performance is fairly consistent for a given problem size. For webservices, you don't have a single number. A proper web-service benchmark suite will provide a latency distribution for a given reqs/second level. This talk gives a good overview of some of the issues: "How NOT to Measure Latency" by Gil Tene

TODO: See the later section on optimizing web services

The performance goals must be specific. You will (almost) always be able to make something faster. Optimizing is frequently a game of diminishing returns. You need to know when to stop. How much effort are you going to put into getting the last little bit of work. How much uglier and harder to maintain are you willing to make the code?

Dan Luu's previously mentioned talk on BitFunnel performance estimation shows an example of using rough calculations to determine if your target performance figures are reasonable.

Simon Eskildsen has a talk from SRECon covering this topic in more depth: Advanced Napkin Math: Estimating System Performance from First Principles

Finally, Jon Bentley's "Programming Pearls" has a chapter titled "The Back of the Envelope" covering Fermi problems. Sadly, these kinds of estimation skills got a bad wrap thanks to their use in Microsoft style "puzzle interview questions" in the 1990s and early 2000s.

For greenfield development, you shouldn't leave all benchmarking and performance numbers until the end. It's easy to say "we'll fix it later", but if performance is really important it will be a design consideration from the start. Any significant architectural changes required to fix performance issues will be too risky near the deadline. Note that during development, the focus should be on reasonable program design, algorithms, and data structures. Optimizing at lower-levels of the stack should wait until later in the development cycle when a more complete view of the system performance is available. Any full-system profiles you do while the system is incomplete will give a skewed view of where the bottlenecks will be in the finished system.

TODO: How to avoid/detect "Death by 1000 cuts" from poorly written software. Solution: "Premature pessimization is the root of all evil". This matches with my Rule 1: Be deliberate. You don't need to write every line of code to be fast, but neither should by default do wasteful things.

"Premature pessimization is when you write code that is slower than it needs to be, usually by asking for unnecessary extra work, when equivalently complex code would be faster and should just naturally flow out of your fingers."

-- Herb Sutter

Benchmarking as part of CI is hard due to noisy neighbours and even different CI boxes if it's just you. Hard to gate on performance metrics. A good middle ground is to have benchmarks run by the developer (on appropriate hardware) and included in the commit message for commits that specifically address performance. For those that are just general patches, try to catch performance degradations "by eye" in code review.

TODO: how to track performance over time?

Write code that you can benchmark. Profiling you can do on larger systems. Benchmarking you want to test isolated pieces. You need to be able to extract and setup sufficient context that benchmarks test enough and are representative.

The difference between what your target is and the current performance will also give you an idea of where to start. If you need only a 10-20% performance improvement, you can probably get that with some implementation tweaks and smaller fixes. If you need a factor of 10x or more, then just replacing a multiplication with a left-shift isn't going to cut it. That's probably going to call for changes up and down your stack, possibly redesigning large portions of the system with these performance goals in mind.

Good performance work requires knowledge at many different levels, from system design, networking, hardware (CPU, caches, storage), algorithms, tuning, and debugging. With limited time and resources, consider which level will give the most improvement: it won't always be an algorithm or program tuning.

In general, optimizations should proceed from top to bottom. Optimizations at the system level will have more impact than expression-level ones. Make sure you're solving the problem at the appropriate level.

This book is mostly going to talk about reducing CPU usage, reducing memory usage, and reducing latency. It's good to point out that you can very rarely do all three. Maybe CPU time is faster, but now your program uses more memory. Maybe you need to reduce memory space, but now the program will take longer.

Amdahl's Law tells us to focus on the bottlenecks. If you double the speed of routine that only takes 5% of the runtime, that's only a 2.5% speedup in total wall-clock. On the other hand, speeding up routine that takes 80% of the time by only 10% will improve runtime by almost 8%. Profiles will help identify where time is actually spent.

When optimizing, you want to reduce the amount of work the CPU has to do. Quicksort is faster than bubble sort because it solves the same problem (sorting) in fewer steps. It's a more efficient algorithm. You've reduced the work the CPU needs to do in order to accomplish the same task.

Program tuning, like compiler optimizations, will generally make only a small dent in the total runtime. Large wins will almost always come from an algorithmic change or data structure change, a fundamental shift in how your program is organized. Compiler technology improves, but slowly. Proebsting's Law says compilers double in performance every 18 years, a stark contrast with the (slightly misunderstood interpretation) of Moore's Law that doubles processor performance every 18 months. Algorithmic improvements work at larger magnitudes. Algorithms for mixed integer programming improved by a factor of 30,000 between 1991 and 2008. For a more concrete example, consider this breakdown of replacing a brute force geo-spatial algorithm described in an Uber blog post with more specialized one more suited to the presented task. There is no compiler switch that will give you an equivalent boost in performance.

TODO: Optimizing floating point FFT and MMM algorithm differences in gttse07.pdf

A profiler might show you that lots of time is spent in a particular routine. It could be this is an expensive routine, or it could be a cheap routine that is just called many many times. Rather than immediately trying to speed up that one routine, see if you can reduce the number of times it's called or eliminate it completely. We'll discuss more concrete optimization strategies in the next section.

The Three Optimization Questions:

  • Do we have to do this at all? The fastest code is the code that's never run.
  • If yes, is this the best algorithm.
  • If yes, is this the best implementation of this algorithm.

Concrete optimization tips

Jon Bentley's 1982 work "Writing Efficient Programs" approached program optimization as an engineering problem: Benchmark. Analyze. Improve. Verify. Iterate. A number of his tips are now done automatically by compilers. A programmer's job is to use the transformations compilers can't do.

There are summaries of the book:

and the program tuning rules:

When thinking of changes you can make to your program, there are two basic options: you can either change your data or you can change your code.

Data Changes

Changing your data means either adding to or altering the representation of the data you're processing. From a performance perspective, some of these will end up changing the O() associated with different aspects of the data structure. This may even include preprocessing the input to be in a different, more useful format.

Ideas for augmenting your data structure:

Extra fields

The classic example of this is storing the length of a linked list in a field in the root node. It takes a bit more work to keep it updated, but then querying the length becomes a simple field lookup instead of an O(n) traversal. Your data structure might present a similar win: a bit of bookkeeping during some operations in exchange for some faster performance on a common use case.

Similarly, storing pointers to frequently needed nodes instead of performing additional searches. This covers things like the "backwards" links in a doubly-linked list to make node removal O(1). Some skip lists keep a "search finger", where you store a pointer to where you just were in your data structure on the assumption it's a good starting point for your next operation.

Extra search indexes

Most data structures are designed for a single type of query. If you need two different query types, having an additional "view" onto your data can be large improvement. For example, a set of structs might have a primary ID (integer) that you use to look up in a slice, but sometimes need to look up with a secondary ID (string). Instead of iterating over the slice, you can augment your data structure with a map either from string to ID or directly to the struct itself.

Extra information about elements

For example, keeping a bloom filter of all the elements you've inserted can let you quickly return "no match" for lookups. These need to be small and fast to not overwhelm the rest of the data structure. (If a lookup in your main data structure is cheap, the cost of the bloom filter will outweigh any savings.)

If queries are expensive, add a cache.

At a larger level, an in-process or external cache (like memcache) can help. It might be excessive for a single data structure. We'll cover more about caches below.

These sorts of changes are useful when the data you need is cheap to store and easy to keep up-to-date.

These are all clear examples of "do less work" at the data structure level. They all cost space. Most of the time if you're optimizing for CPU, your program will use more memory. This is the classic space-time trade-off.

It's important to examine how this tradeoff can affect your solutions -- it's not always straight-forward. Sometimes a small amount of memory can give a significant speed, sometimes the tradeoff is linear (2x memory usage == 2x performance speedup), sometimes it's significantly worse: a huge amount of memory gives only a small speedup. Where you need to be on this memory/performance curve can affect what algorithm choices are reasonable. It's not always possible to just tune an algorithm parameter. Different memory usages might be completely different algorithmic approaches.

Lookup tables also fall into this space-time trade-off. A simple lookup table might just be a cache of previously requested computations.

If the domain is small enough, the entire set of results could be precomputed and stored in the table. As an example, this could be the approach taken for a fast popcount implementation, where by the number of set bits in byte is stored in a 256-entry table. A larger table could store the bits required for all 16-bit words. In this case, they're storing exact results.

A number of algorithms for trigonometric functions use lookup tables as a starting point for a calculation.

If your program uses too much memory, it's also possible to go the other way. Reduce space usage in exchange for increased computation. Rather than storing things, calculate them every time. You can also compress the data in memory and decompress it on the fly when you need it.

If the data you're processing is on disk, instead of loading everything into RAM, you could create an index for the pieces you need and keep that in memory, or pre-process the file into smaller workable chunks.

Small Memory Software is a book available online covering techniques for reducing the space used by your programs. While it was originally written targeting embedded developers, the ideas are applicable for programs on modern hardware dealing with huge amounts of data.

Rearrange your data

Eliminate structure padding. Remove extra fields. Use a smaller data type.

Change to a slower data structure

Simpler data structures frequently have lower memory requirements. For example, moving from a pointer-heavy tree structure to use slice and linear search instead.

Custom compression format for your data

Compression algorithms depend very heavily on what is being compressed. It's best to choose one that suites your data. If you have []byte, the something like snappy, gzip, lz4, behaves well. For floating point data there is go-tsz for time series and fpc for scientific data. Lots of research has been done around compressing integers, generally for information retrieval in search engines. Examples include delta encoding and varints to more complex schemes involving Huffman encoded xor-differences. You can also come up with custom compression formats optimized for exactly your data.

Do you need to inspect the data or can it stay compressed? Do you need random access or only streaming? If you need access to individual entries but don't want to decompress the entire thing, you can compress the data in smaller blocks and keep an index indicating what range of entries are in each block. Access to a single entry just needs to check the index and unpack the smaller data block.

If your data is not just in-process but will be written to disk, what about data migration or adding/removing fields. You'll now be dealing with raw []byte instead of nice structured Go types, so you'll need unsafe and to consider serialization options.

We will talk more about data layouts later.

Modern computers and the memory hierarchy make the space/time trade-off less clear. It's very easy for lookup tables to be "far away" in memory (and therefore expensive to access) making it faster to just recompute a value every time it's needed.

This also means that benchmarking will frequently show improvements that are not realized in the production system due to cache contention (e.g., lookup tables are in the processor cache during benchmarking but always flushed by "real data" when used in a real system. Google's Jump Hash paper in fact addressed this directly, comparing performance on both a contended and uncontended processor cache. (See graphs 4 and 5 in the Jump Hash paper)

TODO: how to simulate a contended cache, show incremental cost TODO: sync.Map as a Go-ish example of cache-contention addressing

Another aspect to consider is data-transfer time. Generally network and disk access is very slow, and so being able to load a compressed chunk will be much faster than the extra CPU time required to decompress the data once it has been fetched. As always, benchmark. A binary format will generally be smaller and faster to parse than a text one, but at the cost of no longer being as human readable.

For data transfer, move to a less chatty protocol, or augment the API to allow partial queries. For example, an incremental query rather than being forced to fetch the entire dataset each time.

Algorithmic Changes

If you're not changing the data, the other main option is to change the code.

The biggest improvement is likely to come from an algorithmic change. This is the equivalent of replacing bubble sort (O(n^2)) with quicksort (O(n log n)) or replacing a linear scan through an array (O(n)) with a binary search (O(log n)) or a map lookup (O(1)).

This is how software becomes slow. Structures originally designed for one use is repurposed for something it wasn't designed for. This happens gradually.

It's important to have an intuitive grasp of the different big-O levels. Choose the right data structure for your problem. You don't have to always shave cycles, but this just prevents dumb performance issues that might not be noticed until much later.

The basic classes of complexity are:

O(1): a field access, array or map lookup

Advice: don't worry about it (but keep in mind the constant factor.)

O(log n): binary search

Advice: only a problem if it's in a loop

O(n): simple loop

Advice: you're doing this all the time

O(n log n): divide-and-conquer, sorting

Advice: still fairly fast

O(n*m): nested loop / quadratic

Advice: be careful and constrain your set sizes

Anything else between quadratic and subexponential

Advice: don't run this on a million rows

O(b ^ n), O(n!): exponential and up

Advice: good luck if you have more than a dozen or two data points

Link: http://bigocheatsheet.com

Let's say you need to search through of an unsorted set of data. "I should use a binary search" you think, knowing that a binary search is O(log n) which is faster than the O(n) linear scan. However, a binary search requires that the data is sorted, which means you'll need to sort it first, which will take O(n log n) time. If you're doing lots of searches, then the upfront cost of sorting will pay off. On the other hand, if you're mostly doing lookups, maybe having an array was the wrong choice and you'd be better off paying the O(1) lookup cost for a map instead.

Being able to analyze your problem in terms of big-O notation also means you can figure out if you're already at the limit for what is possible for your problem, and if you need to change approaches in order to speed things up. For example, finding the minimum of an unsorted list is O(n), because you have to look at every single item. There's no way to make that faster.

If your data structure is static, then you can generally do much better than the dynamic case. It becomes easier to build an optimal data structure customized for exactly your lookup patterns. Solutions like minimal perfect hashing can make sense here, or precomputed bloom filters. This also make sense if your data structure is "static" for long enough and you can amortize the up-front cost of construction across many lookups.

Choose the simplest reasonable data structure and move on. This is CS 101 for writing "not-slow software". This should be your default development mode. If you know you need random access, don't choose a linked-list. If you know you need in-order traversal, don't use a map. Requirements change and you can't always guess the future. Make a reasonable guess at the workload.

http://daslab.seas.harvard.edu/rum-conjecture/

Data structures for similar problems will differ in when they do a piece of work. A binary tree sorts a little at a time as inserts happen. A unsorted array is faster to insert but it's unsorted: at the end to "finalize" you need to do the sorting all at once.

When writing a package to be used by others, avoid the temptation to optimize upfront for every single use case. This will result in unreadable code. Data structures by design are effectively single-purpose. You can neither read minds nor predict the future. If a user says "Your package is too slow for this use case", a reasonable answer might be "Then use this other package over here". A package should "do one thing well".

Sometimes hybrid data structures will provide the performance improvement you need. For example, by bucketing your data you can limit your search to a single bucket. This still pays the theoretical cost of O(n), but the constant will be smaller. We'll revisit these kinds of tweaks when we get to program tuning.

Two things that people forget when discussion big-O notation:

One, there's a constant factor involved. Two algorithms which have the same algorithmic complexity can have different constant factors. Imagine looping over a list 100 times vs just looping over it once. Even though both are O(n), one has a constant factor that's 100 times higher.

These constant factors are why even though merge sort, quicksort, and heapsort all O(n log n), everybody uses quicksort because it's the fastest. It has the smallest constant factor.

The second thing is that big-O only says "as n grows to infinity". It talks about the growth trend, "As the numbers get big, this is the growth factor that will dominate the run time." It says nothing about the actual performance, or how it behaves with small n.

There's frequently a cut-off point below which a dumber algorithm is faster. A nice example from the Go standard library's sort package. Most of the time it's using quicksort, but it has a shell-sort pass then insertion sort when the partition size drops below 12 elements.

For some algorithms, the constant factor might be so large that this cut-off point may be larger than all reasonable inputs. That is, the O(n^2) algorithm is faster than the O(n) algorithm for all inputs that you're ever likely to deal with.

This also means you need to know representative input sizes, both for choosing the most appropriate algorithm and for writing good benchmarks. 10 items? 1000 items? 1000000 items?

This also goes the other way: For example, choosing to use a more complicated data structure to give you O(n) scaling instead of O(n^2), even though the benchmarks for small inputs got slower. This also applies to most lock-free data structures. They're generally slower in the single-threaded case but more scalable when many threads are using it.

The memory hierarchy in modern computers confuses the issue here a little bit, in that caches prefer the predictable access of scanning a slice to the effectively random access of chasing a pointer. Still, it's best to begin with a good algorithm. We will talk about this in the hardware-specific section.

TODO: extending last paragraph, mention O() notation is an model where each operation has fixed cost. That's a wrong assumption on modern hardware.

The fight may not always go to the strongest, nor the race to the fastest, but that's the way to bet. -- Rudyard Kipling

Sometimes the best algorithm for a particular problem is not a single algorithm, but a collection of algorithms specialized for slightly different input classes. This "polyalgorithm" quickly detects what kind of input it needs to deal with and then dispatches to the appropriate code path. This is what the sorting package mentioned above does: determine the problem size and choose a different algorithm. In addition to combining quicksort, shell sort, and insertion sort, it also tracks recursion depth of quicksort and calls heapsort if necessary. The string and bytes packages do something similar, detecting and specializing for different cases. As with data compression, the more you know about what your input looks like, the better your custom solution can be. Even if an optimization is not always applicable, complicating your code by determining that it's safe to use and executing different logic can be worth it.

This also applies to subproblems your algorithm needs to solve. For example, being able to use radix sort can have a significant impact on performance, or using quickselect if you only need a partial sort.

Sometimes rather than specialization for your particular task, the best approach is to abstract it into a more general problem space that has been well-studied by researchers. Then you can apply the more general solution to your specific problem. Mapping your problem into a domain that already has well-researched implementations can be a significant win.

Similarly, using a simpler algorithm means that tradeoffs, analysis, and implementation details are more likely to be more studied and well understood than more esoteric or exotic and complex ones.

Simpler algorithms can also be faster. These two examples are not isolated cases https://go-review.googlesource.com/c/crypto/+/169037 https://go-review.googlesource.com/c/go/+/170322/

TODO: notes on algorithm selection

TODO: improve worst-case behaviour at slight cost to average runtime linear-time regexp matching

While most algorithms are deterministic, there are a class of algorithms that use randomness as a way to simplify otherwise complex decision making step. Instead of having code that does the Right Thing, you use randomness to select a probably not bad thing. For example, a treap is a probabilistically balanced binary tree. Each node has a key, but also is assigned a random value. When inserting into the tree, the normal binary tree insertion path is followed but the nodes also obey the heap property based on each nodes randomly assigned weight. This simpler approach replaces otherwise complicated tree rotating solutions (like AVL and Red Black trees) but still maintains a balanced tree with O(log n) insert/lookup "with high probability. Skip lists are another similar, simple data structure that uses randomness to produce "probably" O(log n) insertion and lookups.

Similarly, choosing a random pivot for quicksort can be simpler than a more complex median-of-medians approach to finding a good pivot, and the probability that bad pivots are continually (randomly) chosen and degrading quicksort's performance to O(n^2) is vanishingly small.

Randomized algorithms are classed as either "Monte Carlo" algorithms or "Las Vegas" algorithms, after two well known gambling locations. A Monte Carlo algorithm gambles with correctness: it might output a wrong answer (or in the case of the above, an unbalanced binary tree). A Las Vegas algorithm always outputs a correct answer, but might take a very long time to terminate.

Another well-known example of a randomized algorithm is the Miller-Rabin primality testing algorithm. Each iteration will output either "not prime" or "maybe prime". While "not prime" is certain, the "maybe prime" is correct with probability at least 1/2. That is, there are non-primes for which "maybe prime" will still be output. By running many iterations of Miller-Rabin, we can make the probability of failure (that is, outputing "maybe prime" for a composite number) as small as we'd like. If it passes 200 iterations, then we can say the number is composite with probability at most 1/(2^200).

Another area where randomness plays a part is called "The power of two random choices". While initially the research was applied to load balancing, it turned out to be widely applicable to a number of selection problems. The idea is that rather than trying to find the best selection out of a group of items, pick two at random and select the best from that. Returning to load balancing (or hash table chains), the power of two random choices reduces the expected load (or hash chain length) from O(log n) items to O(log log n) items. For more information, see The Power of Two Random Choices: A Survey of Techniques and Results

randomized algorithms: other caching algorithms statistical approximations (frequently depend on sample size and not population size)

TODO: batching to reduce overhead: https://lemire.me/blog/2018/04/17/iterating-in-batches-over-data-structures-can-be-much-faster/

TODO: - Algorithm Design Manual: http://algorist.com/algorist.html - How To Solve It By Computer - to what extent is this a "how to write algorithms" book? If you're going to change the code to speed it up, by definition you're writing new algorithms. Soo... maybe?

Benchmark Inputs

Real-world inputs rarely match the theoretical "worst case". Benchmarking is vital to understanding how your system behaves in production.

You need to know what class of inputs your system will be seeing once deployed, and your benchmarks must use instances pulled from that same distribution. As we've seen, different algorithms make sense at different input sizes. If your expected input range is <100, then your benchmarks should reflect that. Otherwise, choosing an algorithm which is optimal for n=10^6 might not be the fastest.

Be able to generate representative test data. Different distributions of data can provoke different behaviours in your algorithm: think of the classic "quicksort is O(n^2) when the data is sorted" example. Similarly, interpolation search is O(log log n) for uniform random data, but O(n) worst case. Knowing what your inputs look like is the key to both representative benchmarks and for choosing the best algorithm. If the data you're using to test isn't representative of real workloads, you can easily end up optimizing for one particular data set, "overfitting" your code to work best with one specific set of inputs.

This also means your benchmark data needs to be representative of the real world. Using purely randomized inputs may skew the behaviour of your algorithm. Caching and compression algorithms both exploit skewed distributions not present in random data and so will perform worse, while a binary tree will perform better with random values as they will tend to keep the tree balanced. (This is the idea behind a treap, by the way.)

On the other hand, consider the case of testing a system with a cache. If your benchmark input consists only a single query, then every request will hit the cache giving potentially a very unrealistic view of how the system will behave in the real world with a more varied request pattern.

Also, note that some issues that are not apparent on your laptop might be visible once you deploy to production and are hitting 250k reqs/second on a 40 core server. Similarly, the behaviour of the garbage collector during benchmarking can misrepresent real-world impact. There are (rare) cases where a microbenchmark will show a slow-down, but real-world performance improves. Microbenchmarks can help nudge you in the right direction but being able to fully test the impact of a change across the entire system is best.

Writing good benchmarks can be difficult.

Use geometric mean to compare groups of benchmarks.

Evaluating Benchmark Accuracy:

Program Tuning

Program tuning used to be an art form, but then compilers got better. So now it turns out that compilers can optimize straight-forward code better than complicated code. The Go compiler still has a long way to go to match gcc and clang, but it does mean that you need to be careful when tuning and especially when upgrading Go versions that your code doesn't become "worse". There are definitely cases where tweaks to work around the lack of a particular compiler optimization became slower once the compiler was improved.

My RC6 cipher implementation had a 10% speed up for the inner loop just by switching to encoding/binary and math/bits instead of my hand-rolled versions.

Similarly, the compress/bzip2 package was sped by switching to simpler code the compiler was better able to optimize

If you are working around a specific runtime or compiler code generation issue, always document your change with a link to the upstream issue. This will allow you to quickly revisit your optimization once the bug is fixed.

Fight the temptation to cargo cult folklore-based "performance tips", or even over-generalize from your own experience. Each performance bug needs to be approached on its own merits. Even if something has worked previously, make sure to profile to ensure the fix is still applicable. Your previous work can guide you, but don't apply previous optimizations blindly.

Program tuning is an iterative process. Keep revisiting your code and seeing what changes can be made. Ensure you're making progress at each step. Frequently one improvement will enable others to be made. (Now that I'm not doing A, I can simplify B by doing C instead.) This means you need to keep looking at the entire picture and not get too obsessed with one small set of lines.

Once you've settled on the right algorithm, program tuning is the process of improving the implementation of that algorithm. In Big-O notation, this is the process of reducing the constants associated with your program.

All program tuning is either making a slow thing fast, or doing a slow thing fewer times. Algorithmic changes also fall into these categories, but we're going to be looking at smaller changes. Exactly how you do this varies as technologies change.

Making a slow thing fast might be replacing SHA1 or hash/fnv1 with a faster hash function. Doing a slow thing fewer times might be saving the result of the hash calculation of a large file so you don't have to do it multiple times.

Keep comments. If something doesn't need to be done, explain why. Frequently when optimizing an algorithm you'll discover steps that don't need to be performed under some circumstances. Document them. Somebody else might think it's a bug and needs to be put back.

Empty programs gives the wrong answer in no time at all.

It's easy to be fast if you don't have to be correct.

"Correctness" can depend on the problem. Heuristic algorithms that are mostly-right most of the time can be fast, as can algorithms which guess and improve allowing you to stop when you hit an acceptable limit.

Cache common cases:

We're all familiar with memcache, but there are also in-process caches. Using an in-process cache saves the cost of both the network call and the cost of serialization. On the other hand, this increases GC pressure as there is more memory to keep track of. You also need to consider eviction strategies, cache invalidation, and thread-safety. An external cache will generally handle eviction for you, but cache invalidation remains a problem. Thread-safety can also be an issue with external caches as it becomes effectively shared mutable state either between different goroutines in the same service or even different service instances if the external cache is shared.

A cache saves information you've just spent time computing in the hopes that you'll be able to reuse it again soon and save the computation time. A cache doesn't need to be complex. Even storing a single item -- the most recently seen query/response -- can be a big win, as seen in the time.Parse() example below.

With caches it's important to compare the cost (in terms of actual wall-clock and code complexity) of your caching logic to simply refetching or recomputing the data. The more complex algorithms that give higher hit rates are generally not cheap themselves. Randomized cache eviction is simple and fast and can be effective in many cases. Similarly, randomized cache insertion can limit your cache to only popular items with minimal logic. While these may not be as effective as the more complex algorithms, the big improvement will be adding a cache in the first place: choosing exactly which caching algorithm gives only minor improvements.

It's important to benchmark your choice of cache eviction algorithm with real-world traces. If in the real world repeated requests are sufficiently rare, it can be more expensive to keep cached responses around than to simply recompute them when needed. I've had services where testing with production data showed even an optimal cache wasn't worth it. we simply did't have sufficient repeated requests to make the added complexity of a cache make sense.

Your expected cache hit ratio is important. You'll want to export the ratio to your monitoring stack. Changing ratios will show a shift in traffic. Then it's time to revisit the cache size or the expiration policy.

A large cache can increase GC pressure. At the extreme (little or no eviction, caching all requests to an expensive function) this can turn into memoization

Program tuning:

Program tuning is the art of iteratively improving a program in small steps. Egon Elbre lays out his procedure:

  • Come up with a hypothesis as to why your program is slow.
  • Come up with N solutions to solve it
  • Try them all and keep the fastest.
  • Keep the second fastest just in case.
  • Repeat.

Tunings can take many forms.

  • If possible, keep the old implementation around for testing.
  • If not possible, generate sufficient golden test cases to compare output to.
  • "Sufficient" means including edge cases, as those are the ones likely to get affected by tuning as you aim to improve performance in the general case.
  • Exploit a mathematical identity:
    • Note that implementing and optimizing numerical calculations is almost its own field
    • "pay only for what you use, not what you could have used"
      • zero only part of an array, rather than the whole thing
    • best done in tiny steps, a few statements at a time
    • cheap checks before more expensive checks:
      • e.g., strcmp before regexp, (q.v., bloom filter before query) "do expensive things fewer times"
    • common cases before rare cases i.e., avoid extra tests that always fail
    • unrolling still effective: https://play.golang.org/p/6tnySwNxG6O
      • code size. vs branch test overhead
    • using offsets instead of slice assignment can help with bounds checks, data dependencies, and code gen (less to copy in inner loop).
    • remove bounds checks and nil checks from loops: https://go-review.googlesource.com/c/go/+/151158
    • other tricks for the prove pass
    • this is where pieces of Hacker's Delight fall

Many folklore performance tips for tuning rely on poorly optimizing compilers and encourage the programmer to do these transformations by hand. Compilers have been using shifts instead of multiplying or dividing by a power of two for 15 years now -- nobody should be doing that by hand. Similarly, hoisting invariant calculations out of loops, basic loop unrolling, common sub-expression elimination and many others are all done automatically by gcc and clang and the like. Go's compiler does many of these and continues to improve. As always, benchmark before committing to the new version.

The transformations the compiler can't do rely on you knowing things about the algorithm, about your input data, about invariants in your system, and other assumptions you can make, and factoring that implicit knowledge into removing or altering steps in the data structure.

Every optimization codifies an assumption about your data. These must be documented and, even better, tested for. These assumptions are going to be where your program crashes, slows down, or starts returning incorrect data as the system evolves.

Program tuning improvements are cumulative. 5x 3% improvements is a 15% improvement. When making optimizations, it's worth it to think about the expected performance improvement. Replacing a hash function with a faster one is a constant factor improvement.

Understanding your requirements and where they can be altered can lead to performance improvements. One issue that was presented in the #performance Gophers Slack channel was the amount of time that was spent creating a unique identifier for a map of string key/value pairs. The original solution was to extract the keys, sort them, and pass the resulting string to a hash function. The improved solution we came up was to individually hash the keys/values as they were added to the map, then xor all these hashes together to create the identifier.

Here's an example of specialization.

Let's say we're processing a massive log file for a single day, and each line begins with a time stamp.

Sun  4 Mar 2018 14:35:09 PST <...........................>

For each line, we're going to call time.Parse() to turn it into a epoch. If profiling shows us time.Parse() is the bottleneck, we have a few options to speed things up.

The easiest is to keep a single-item cache of the previously seen time stamp and the associated epoch. As long as our log file has multiple lines for a single second, this will be a win. For the case of a 10 million line log file, this strategy reduces the number of expensive calls to time.Parse() from 10,000,000 to 86400 -- one for each unique second.

TODO: code example for single-item cache

Can we do more? Because we know exactly what format the timestamps are in and that they all fall in a single day, we can write custom time parsing logic that takes this into account. We can calculate the epoch for midnight, then extract hour, minute, and second from the timestamp string -- they'll all be in fixed offsets in the string -- and do some integer math.

TODO: code example for string offset version

In my benchmarks, this reduced the time parsing from 275ns/op to 5ns/op. (Of course, even at 275 ns/op, you're more likely to be blocked on I/O and not CPU for time parsing.)

The general algorithm is slow because it has to handle more cases. Your algorithm can be faster because you know more about your problem. But the code is more closely tied to exactly what you need. It's much more difficult to update if the time format changes.

Optimization is specialization, and specialized code is more fragile to change than general purpose code.

The standard library implementations need to be "fast enough" for most cases. If you have higher performance needs you will probably need specialized implementations.

Profile regularly to ensure to track the performance characteristics of your system and be prepared to re-optimize as your traffic changes. Know the limits of your system and have good metrics that allow you to predict when you will hit those limits.

When the usage of your application changes, different pieces may become hotspots. Revisit previous optimizations and decide if they're still worth it, and revert to more readable code when possible. I had one system that I had optimized process startup time with a complex set of mmap, reflect, and unsafe. Once we changed how the system was deployed, this code was no longer required and I replaced it with much more readable regular file operations.

TODO(dgryski): hash function work should fall here; manually inlining, removing structs, unrolling loops, removing bounds checks

Optimization workflow summary

All optimizations should follow these steps:

  1. determine your performance goals and confirm you are not meeting them
  2. profile to identify the areas to improve.
    • This can be CPU, heap allocations, or goroutine blocking.
  3. benchmark to determine the speed up your solution will provide using the built-in benchmarking framework (http://golang.org/pkg/testing/)
    • Make sure you're benchmarking the right thing on your target operating system and architecture.
  4. profile again afterwards to verify the issue is gone
  5. use https://godoc.org/golang.org/x/perf/benchstat or https://github.com/codahale/tinystat to verify that a set of timings are 'sufficiently' different for an optimization to be worth the added code complexity.
  6. use https://github.com/tsenart/vegeta for load testing http services (+ other fancy ones: k6, fortio, fbender)
    • if possible, test ramp-up/ramp-down in addition to steady-state load
  7. make sure your latency numbers make sense

TODO: mention github.com/aclements/perflock as cpu noise reduction tool

The first step is important. It tells you when and where to start optimizing. More importantly, it also tells you when to stop. Pretty much all optimizations add code complexity in exchange for speed. And you can always make code faster. It's a balancing act.

Garbage Collection

You pay for memory allocation more than once. The first is obviously when you allocate it. But you also pay every time the garbage collection runs.

Reduce/Reuse/Recycle. -- @bboreham

  • Stack vs. heap allocations
  • What causes heap allocations?
  • Understanding escape analysis (and the current limitation)
  • /debug/pprof/heap , and -base
  • API design to limit allocations:
    • allow passing in buffers so caller can reuse rather than forcing an allocation
    • you can even modify a slice in place carefully while you scan over it
    • passing in a struct could allow caller to stack allocate it
  • reducing pointers to reduce gc scan times
    • pointer-free slices
    • maps with both pointer-free keys and values
  • GOGC
  • buffer reuse (sync.Pool vs or custom via go-slab, etc)
  • slicing vs. offset: pointer writes while GC is running need writebarrier: https://github.com/golang/go/commit/b85433975aedc2be2971093b6bbb0a7dc264c8fd
  • use error variables instead of errors.New() / fmt.Errorf() at call site (performance or style? interface requires pointer, so it escapes to heap anyway)
  • use structured errors to reduce allocation (pass struct value), create string at error printing time
  • size classes
  • beware pinning larger allocation with smaller substrings or slices

Runtime and compiler

  • cost of calls via interfaces (indirect calls on the CPU level)
  • runtime.convT2E / runtime.convT2I
  • type assertions vs. type switches
  • defer
  • special-case map implementations for ints, strings
  • bounds check elimination
  • []byte <-> string copies, map optimizations
  • two-value range will copy an array, use the slice instead:
  • use string concatenation instead of fmt.Sprintf where possible; runtime has optimized routines for it

Unsafe

Common gotchas with the standard library

  • time.After() leaks until it fires; use t := NewTimer(); t.Stop() / t.Reset()
  • Reusing HTTP connections...; ensure the body is drained (issue #?)
  • rand.Int() and friends are 1) mutex protected and 2) expensive to create
    • consider alternate random number generation (go-pcgr, xorshift)
  • binary.Read and binary.Write use reflection and are slow; do it by hand. (https://github.com/conformal/yubikey/commit/613e3b04ae2eeb78e6a19636b8ff8e9106d2e7bc)
  • use strconv instead of fmt if possible
  • Use strings.EqualFold(str1, str2) instead of strings.ToLower(str1) == strings.ToLower(str2) or strings.ToUpper(str1) == strings.ToUpper(str2) to efficiently compare strings if possible.
  • ...

Alternate implementations

Popular replacements for standard library packages:

  • encoding/json -> ffjson, easyjson, jingo (only encoder), etc
  • net/http
    • fasthttp (but incompatible API, not RFC compliant in subtle ways)
    • httprouter (has other features besides speed; I've never actually seen routing in my profiles)
  • regexp -> ragel (or other regular expression package)
  • serialization
  • database/sql -> has tradeoffs that affect performance
    • look for drivers that don't use it: jackx/pgx, crawshaw sqlite, ...
  • gccgo (benchmark!), gollvm (WIP)
  • container/list: use a slice instead (almost always)

cgo

cgo is not go -- Rob Pike

  • Performance characteristics of cgo calls
  • Tricks to reduce the costs: batching
  • Rules on passing pointers between Go and C
  • syso files (race detector, dev.boringssl)

Advanced Techniques

Techniques specific to the architecture running the code

introduction to CPU caches

  • performance cliffs
  • building intuition around cache-lines: sizes, padding, alignment
  • OS tools to view cache-misses (perf)
  • maps vs. slices
  • SOA vs AOS layouts: row-major vs. column major; when you have an X, do you need another X or do you need a Y?
  • temporal and spacial locality: use what you have and what's nearby as much as possible
  • reducing pointer chasing
  • explicit memory prefetching; frequently ineffective; lack of intrinsics means function call overhead (removed from runtime)
  • make the first 64-bytes of your struct count

branch prediction

remove branches from inner loops: if a { for { } } else { for { } } instead of for { if a { } else { } } benchmark due to branch prediction structure to avoid branch

if i % 2 == 0 { evens++ } else { odds++ }

counts[i & 1] ++ "branch-free code", benchmark; not always faster, but frequently harder to read TODO: ASCII class counts example, with benchmarks

sorting data can help improve performance via both cache locality and branch prediction, even taking into account the time it takes to sort

function call overhead: inliner is getting better

reduce data copies (including for repeated large lists of function params)

Comment about Jeff Dean's 2002 numbers (plus updates)

  • cpus have gotten faster, but memory hasn't kept up

TODO: little comment about code-aligment free optimization (or unoptimization)

Concurrency

  • Figure out which pieces can be done in parallel and which must be sequential
  • goroutines are cheap, but not free.
  • Optimizing multi-threaded code
    • false-sharing -> pad to cache-line size
    • true sharing -> sharding
  • Overlap with previous section on caches and false/true sharing
  • Lazy synchronization; it's expensive, so duplicating work may be cheaper
  • things you can control: number of workers, batch size

You need a mutex to protect shared mutable state. If you have lots of mutex contention, you need to either reduce the shared, or reduce the mutable. Two ways to reduce the shared are 1) shard the locks or 2) process independently and combine afterwards. To reduce mutable: well, make your data structure read-only. You can also reduce the time the data needs be shared by reducing the critical section -- hold the lock as little as needed. Sometimes a RWMutex will be sufficient, although note that they're slower but they allow multiple readers in.

If you're sharding the locks, be careful of shared cache-lines. You'll need to pad to avoid cache-line bouncing between processors.

var stripe [8]struct{ sync.Mutex; _ [7]uint64 } // mutex is 64-bits; padding fills the rest of the cacheline

Don't do anything expensive in your critical section if you can help it. This includes things like I/O (which are cheap but slow).

TODO: how to decompose problem for concurrency TODO: reasons parallel implementation might be slower (communication overhead, best algorithm is sequential, ... )

Assembly

  • Stuff about writing assembly code for Go
  • compilers improve; the bar is high
  • replace as little as possible to make an impact; maintenance cost is high
  • good reasons: SIMD instructions or other things outside of what Go and the compiler can provide
  • very important to benchmark: improvements can be huge (10x for go-highway) zero (go-speck/rc6/farm32), or even slower (no inlining)
  • rebenchmark with new versions to see if you can delete your code yet
    • TODO: link to 1.11 patches removing asm code
  • always have pure-Go version (purego build tag): testing, arm, gccgo
  • brief intro to syntax
  • how to type the middle dot
  • calling convention: everything is on the stack, followed by the return values.
  • using opcodes unsupported by the asm (asm2plan9, but this is getting rarer)
  • notes about why inline assembly is hard: golang/go#26891
  • all the tooling to make this easier:
  • https://github.com/golang/go/wiki/AssemblyPolicy
  • Design of the Go Assembler: https://talks.golang.org/2016/asm.slide

Optimizing an entire service

Most of the time you won't be presented with a single CPU-bound routine. That's the easy case. If you have a service to optimize, you need to look at the entire system. Monitoring. Metrics. Log lots of things over time so you can see them getting worse and so you can see the impact your changes have in production.

tip.golang.org/doc/diagnostics.html

  • references for system design: SRE Book, practical distributed system design
  • extra tooling: more logging + analysis
  • The two basic rules: either speed up the slow things or do them less frequently.
  • distributed tracing to track bottlenecks at a higher level
  • query patterns for querying a single server instead of in bulk
  • your performance issues may not be your code, but you'll have to work around them anyway
  • https://docs.microsoft.com/en-us/azure/architecture/antipatterns/

Tooling

Introductory Profiling

This is a quick cheat-sheet for using the pprof tooling. There are plenty of other guides available on this. Check out https://github.com/davecheney/high-performance-go-workshop.

TODO(dgryski): videos?

  1. Introduction to pprof
  2. Writing and running (micro)benchmarks
    • small, like unit tests
    • profile, extract hot code to benchmark, optimize benchmark, profile.
    • -cpuprofile / -memprofile / -benchmem
    • 0.5 ns/op means it was optimized away -> how to avoid
    • tips for writing good microbenchmarks (remove unnecessary work, but add baselines)
  3. How to read it pprof output
  4. What are the different pieces of the runtime that show up
  • malloc, gc workers
  • runtime._ExternalCode
  1. Macro-benchmarks (Profiling in production)
    • larger, like end-to-end tests
    • net/http/pprof, debug muxer
    • because it's sampling, hitting 10 servers at 100hz is the same as hitting 1 server at 1000hz
  2. Using -base to look at differences
  3. Memory options: -inuse_space, -inuse_objects, -alloc_space, -alloc_objects
  4. Profiling in production; localhost+ssh tunnels, auth headers, using curl.
  5. How to read flame graphs

Tracer

Look at some more interesting/advanced tooling

Appendix: Implementing Research Papers

Tips for implementing papers: (For algorithm read also data structure)

  • Don't. Start with the obvious solution and reasonable data structures.

"Modern" algorithms tend to have lower theoretical complexities but high constant factors and lots of implementation complexity. One of the classic examples of this is Fibonacci heaps. They're notoriously difficult to get right and have a huge constant factor. There has been a number of papers published comparing different heap implementations on different workloads, and in general the 4- or 8-ary implicit heaps consistently come out on top. And even in the cases where Fibonacci heap should be faster (due to O(1) "decrease-key"), experiments with Dijkstra's depth-first search algorithm show it's faster when they use the straight heap removal and addition.

Similarly, treaps or skiplists vs. the more complex red-black or AVL trees. On modern hardware, the "slower" algorithm may be fast enough, or even faster.

The fastest algorithm can frequently be replaced by one that is almost as fast and much easier to understand.

-- Douglas W. Jones, University of Iowa

and

Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy.

Rule 4. Fancy algorithms are buggier than simple ones, and they're much harder to implement. Use simple algorithms as well as simple data structures. -- "Notes on C Programming" (Rob Pike, 1989)

The added complexity has to be enough that the payoff is actually worth it. Another example is cache eviction algorithms. Different algorithms can have much higher complexity for only a small improvement in hit ratio. Of course, you may not be able to test this until you have a working implementation and have integrated it into your program.

Sometimes the paper will have graphs, but much like the trend towards publishing only positive results, these will tend to be skewed in favour of showing how good the new algorithm is.

  • Choose the right paper.
  • Look for the paper their algorithm claims to beat and implement that.

Frequently, earlier papers will be easier to understand and necessarily have simpler algorithms.

Not all papers are good.

Look at the context the paper was written in. Determine assumptions about the hardware: disk space, memory usage, etc. Some older papers make different tradeoffs that were reasonable in the 70s or 80s but don't necessarily apply to your use case. For example, what they determine to be "reasonable" memory vs. disk usage tradeoffs. Memory sizes are now orders of magnitude larger, and SSDs have altered the latency penalty for using disk. Similarly, some streaming algorithms are designed for router hardware, which can make it a pain to translate into software.

Make sure the assumptions the algorithm makes about your data hold.

This will take some digging. You probably don't want to implement the first paper you find.

Make sure you understand the algorithm. This sounds obvious, but it will be impossible to debug otherwise.

https://blizzard.cs.uwaterloo.ca/keshav/home/Papers/data/07/paper-reading.pdf

A good understanding may allow you to extract the key idea from the paper and possibly apply just that to your problem, which may be simpler than reimplementing the entire thing.

The original paper for a data structure or algorithm isn't always the best. Later papers may have better explanations.

Some papers release reference source code which you can compare against, but

  1. academic code is almost universally terrible
  2. beware licensing restrictions ("research purposes only")
  3. beware bugs; edge cases, error checking, performance etc.

Other resources on this topic:

Contributing

This is a work-in-progress book in Go performance.

There are different ways to contribute:

  1. add to or summarizes the resources in TODO
  2. add bullet points or new topics to be covered
  3. write prose and flesh out the sections in the book

Eventually sample programs to optimize and exercises will be needed (maybe).

Coordination will be done in the #performance channel on the Gophers slack.

Multiple Language Versions

Author: dgryski
Source Code: https://github.com/dgryski/go-perfbook/ 
License: 

#go #golang #performance 

Hoang  Kim

Hoang Kim

1657276440

5 Cách để Thực Hiện Phân Tích Cảm Xúc Bằng Python

Cho dù bạn nói về Twitter, Goodreads hay Amazon - hầu như không có một không gian kỹ thuật số nào không bão hòa với ý kiến ​​của mọi người. Trong thế giới ngày nay, điều quan trọng là các tổ chức phải tìm hiểu kỹ những ý kiến ​​này và có được những hiểu biết sâu sắc về sản phẩm hoặc dịch vụ của họ. Tuy nhiên, dữ liệu này tồn tại với số lượng đáng kinh ngạc đến mức việc đánh giá nó theo cách thủ công là một mục tiêu không thể theo đuổi tiếp theo. Đây là nơi mà một lợi ích khác của Khoa học dữ liệu đến  - Phân tích cảm xúc . Trong bài viết này, chúng ta sẽ khám phá phân tích cảm xúc bao gồm những gì và các cách khác nhau để triển khai nó trong Python.

Phân tích cảm xúc là gì?

Phân tích cảm xúc là một trường hợp sử dụng của Xử lý ngôn ngữ tự nhiên (NLP) và thuộc phạm trù phân loại văn bản . Nói một cách đơn giản, Phân tích cảm xúc bao gồm việc phân loại một văn bản thành nhiều cảm xúc khác nhau, chẳng hạn như tích cực hoặc tiêu cực, Vui vẻ, Buồn bã hoặc Trung lập, v.v. Vì vậy, mục tiêu cuối cùng của phân tích tình cảm là giải mã tâm trạng, cảm xúc hoặc tình cảm tiềm ẩn của một chữ. Đây còn được gọi là Khai thác ý kiến .

Hãy cùng chúng tôi xem xét cách tìm kiếm nhanh trên google xác định Phân tích cảm xúc:

định nghĩa phân tích tình cảm

Thu thập thông tin chi tiết và đưa ra quyết định với phân tích cảm xúc

Chà, bây giờ tôi đoán chúng ta đã phần nào quen với việc phân tích tình cảm là gì. Nhưng ý nghĩa của nó là gì và các tổ chức thu lợi từ nó như thế nào? Hãy để chúng tôi thử và khám phá điều tương tự với một ví dụ. Giả sử bạn thành lập một công ty bán nước hoa trên nền tảng trực tuyến. Bạn bày bán một loạt các loại nước hoa và chẳng bao lâu sau, khách hàng bắt đầu tràn vào. Sau một thời gian, bạn quyết định thay đổi chiến lược định giá nước hoa - bạn dự định tăng giá các loại nước hoa phổ biến và đồng thời giảm giá cho những loại nước hoa không phổ biến . Bây giờ, để xác định loại nước hoa nào được ưa chuộng, bạn bắt đầu xem xét đánh giá của khách hàng về tất cả các loại nước hoa. Nhưng bạn đang mắc kẹt! Chúng rất nhiều mà bạn không thể trải qua tất cả chúng trong một đời. Đây là nơi mà phân tích tình cảm có thể đưa bạn thoát khỏi hố sâu.

Bạn chỉ cần tập hợp tất cả các đánh giá vào một nơi và áp dụng phân tích cảm tính cho nó. Sau đây là sơ đồ phân tích tình cảm trên các bài đánh giá về ba loại nước hoa - Oải hương, Hoa hồng và Chanh. (Xin lưu ý rằng các bài đánh giá này có thể có lỗi chính tả, ngữ pháp và dấu chấm câu không chính xác như trong các tình huống thực tế)

phân tích tình cảm

Từ những kết quả này, chúng ta có thể thấy rõ rằng:

Fragrance-1 (Oải hương) được khách hàng đánh giá rất tích cực , điều này cho thấy công ty của bạn có thể tăng giá do mức độ phổ biến của nó.

Fragrance-2 (Hoa hồng) tình cờ có quan điểm trung lập với khách hàng, điều đó có nghĩa là công ty của bạn không nên thay đổi giá cả .

Fragrance-3 (Lemon) có cảm xúc tiêu cực liên quan đến nó - do đó, công ty của bạn nên xem xét giảm giá cho nó để cân bằng quy mô.

Đây chỉ là một ví dụ đơn giản về cách phân tích tình cảm có thể giúp bạn hiểu rõ hơn về sản phẩm / dịch vụ của mình và giúp tổ chức của bạn đưa ra quyết định.

Các trường hợp sử dụng phân tích cảm xúc

Chúng ta vừa thấy cách phân tích tình cảm có thể trao quyền cho các tổ chức với những hiểu biết sâu sắc có thể giúp họ đưa ra quyết định dựa trên dữ liệu. Bây giờ, chúng ta hãy đi sâu vào một số trường hợp sử dụng khác của phân tích tình cảm.

  1. Giám sát truyền thông xã hội để quản lý thương hiệu: Các thương hiệu có thể sử dụng phân tích tình cảm để đánh giá triển vọng của công chúng về Thương hiệu của họ. Ví dụ: một công ty có thể thu thập tất cả các Tweet có đề cập hoặc gắn thẻ của công ty và thực hiện phân tích tình cảm để tìm hiểu triển vọng công khai của công ty.
  2. Phân tích Sản phẩm / Dịch vụ: Các Thương hiệu / Tổ chức có thể thực hiện phân tích tình cảm trên các đánh giá của khách hàng để xem sản phẩm hoặc dịch vụ đang hoạt động tốt như thế nào trên thị trường và đưa ra các quyết định trong tương lai cho phù hợp.
  3. Dự đoán giá cổ phiếu: Dự đoán liệu cổ phiếu của một công ty sẽ tăng hay giảm là rất quan trọng đối với các nhà đầu tư. Người ta có thể xác định điều tương tự bằng cách thực hiện phân tích tình cảm trên Tiêu đề tin tức của các bài báo có chứa tên công ty. Nếu các tiêu đề tin tức liên quan đến một tổ chức cụ thể xảy ra có tâm lý tích cực - giá cổ phiếu của tổ chức đó sẽ tăng và ngược lại.

Các cách thực hiện phân tích cảm xúc bằng Python

Python là một trong những công cụ mạnh mẽ nhất khi thực hiện các nhiệm vụ khoa học dữ liệu - nó cung cấp vô số cách để thực hiện  phân tích cảm tính . Những người phổ biến nhất được tranh thủ ở đây:

  1. Sử dụng Text Blob
  2. Sử dụng Vader
  3. Sử dụng các mô hình dựa trên biểu tượng hóa Bag of Words
  4. Sử dụng Mô hình dựa trên LSTM
  5. Sử dụng mô hình dựa trên máy biến áp

Hãy đi sâu vào từng cái một.

Lưu ý: Với mục đích chứng minh phương pháp 3 & 4 (Sử dụng mô hình dựa trên hình ảnh hóa từ ngữ và sử dụng hình dựa trên LSTM) đã được sử dụng. Nó bao gồm hơn 5000 đoạn văn bản được gắn nhãn là tích cực, tiêu cực hoặc trung tính. Tập dữ liệu nằm trong giấy phép Creative Commons.

Sử dụng Text Blob

Text Blob là một thư viện Python để xử lý ngôn ngữ tự nhiên. Sử dụng Text Blob để phân tích tình cảm khá đơn giản. Nó lấy văn bản làm đầu vào và có thể trả về tính phân cựctính chủ thể làm đầu ra.

Tính phân cực quyết định tình cảm của văn bản. Giá trị của nó nằm ở [-1,1] trong đó -1 biểu thị tình cảm tiêu cực cao và 1 biểu thị cảm xúc tích cực cao.

Tính chủ quan xác định xem đầu vào văn bản là thông tin thực tế hay là ý kiến ​​cá nhân. Giá trị của nó nằm giữa [0,1] trong đó giá trị gần 0 biểu thị một phần thông tin thực tế và giá trị gần 1 biểu thị ý kiến ​​cá nhân.

Cài đặt :

pip install textblob

Nhập khối văn bản:

from textblob import TextBlob

Triển khai mã để phân tích tình cảm bằng cách sử dụng khối văn bản:

Viết mã để phân tích tình cảm bằng TextBlob khá đơn giản. Chỉ cần nhập đối tượng TextBlob và chuyển văn bản cần phân tích với các thuộc tính thích hợp như sau:

from textblob import TextBlob
text_1 = "The movie was so awesome."
text_2 = "The food here tastes terrible."#Determining the Polarity 
p_1 = TextBlob(text_1).sentiment.polarity
p_2 = TextBlob(text_2).sentiment.polarity#Determining the Subjectivity
s_1 = TextBlob(text_1).sentiment.subjectivity
s_2 = TextBlob(text_2).sentiment.subjectivityprint("Polarity of Text 1 is", p_1)
print("Polarity of Text 2 is", p_2)
print("Subjectivity of Text 1 is", s_1)
print("Subjectivity of Text 2 is", s_2)

Đầu ra:

Polarity of Text 1 is 1.0 
Polarity of Text 2 is -1.0 
Subjectivity of Text 1 is 1.0 
Subjectivity of Text 2 is 1.0

Sử dụng VADER

VADER (Valence Aware Dictionary và sEntiment Reasoner) là một công cụ phân tích tình cảm dựa trên quy tắc đã được đào tạo về văn bản trên mạng xã hội. Cũng giống như Text Blob, cách sử dụng nó trong Python khá đơn giản. Chúng ta sẽ thấy cách sử dụng của nó trong triển khai mã với một ví dụ sau.

Cài đặt:

pip install vaderSentiment

Nhập lớp SentimentIntensityAnalyzer từ Vader:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Mã phân tích tình cảm bằng Vader:

Đầu tiên, chúng ta cần tạo một đối tượng của lớp SentimentIntensityAnalyzer; thì chúng ta cần truyền văn bản vào hàm polarity_scores () của đối tượng như sau:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()
text_1 = "The book was a perfect balance between wrtiting style and plot."
text_2 =  "The pizza tastes terrible."
sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)
print("Sentiment of text 1:", sent_1)
print("Sentiment of text 2:", sent_2)

Đầu ra :

Sentiment of text 1: {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compound': 0.5719} 
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}

Như chúng ta có thể thấy, một đối tượng VaderSentiment trả về một từ điển về điểm tình cảm cho văn bản được phân tích.

Sử dụng mô hình dựa trên hình ảnh hóa dựa trên Bag of Words

Trong hai cách tiếp cận đã thảo luận, tức là Text Blob và Vader, chúng tôi chỉ đơn giản sử dụng các thư viện Python để thực hiện phân tích tình cảm. Bây giờ chúng ta sẽ thảo luận về một cách tiếp cận, trong đó chúng ta sẽ đào tạo mô hình của riêng mình cho nhiệm vụ. Các bước liên quan đến việc thực hiện phân tích tình cảm bằng phương pháp Vectơ hóa Bag of Words như sau:

  1. Xử lý trước văn bản của dữ liệu đào tạo (Xử lý trước văn bản bao gồm Chuẩn hóa, Mã hóa, Xóa từ dừng và Tạo gốc / Bổ sung.)
  2. Tạo một Túi từ cho dữ liệu văn bản được xử lý trước bằng cách sử dụng phương pháp Vectơ hóa số lượng hoặc TF-IDF Vectơ hóa.
  3. Đào tạo một mô hình phân loại phù hợp trên dữ liệu đã xử lý để phân loại tình cảm.

Mã phân tích tình cảm sử dụng Phương pháp vector hóa Bag of Words:

Để xây dựng một mô hình phân tích tình cảm bằng cách sử dụng Phương pháp Vectơ hóa BOW, chúng ta cần một tập dữ liệu được gắn nhãn. Như đã nêu trước đó, tập dữ liệu được sử dụng cho cuộc trình diễn này đã được lấy từ Kaggle. Chúng tôi chỉ đơn giản sử dụng vectơ đếm của sklearn để tạo BOW. Sau đó, chúng tôi đã đào tạo một bộ phân loại Naive Bayes đa thức, cho điểm chính xác là 0,84.

Tập dữ liệu có thể được lấy từ đây .

#Loading the Dataset
import pandas as pd
data = pd.read_csv('Finance_data.csv')
#Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(data['sentences'])
#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['feedback'], test_size=0.25, random_state=5)
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)
#Caluclating the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)

Đầu ra :

Accuracuy Score:  0.9111675126903553

Bộ phân loại được đào tạo có thể được sử dụng để dự đoán cảm xúc của bất kỳ đầu vào văn bản nhất định nào.

Sử dụng mô hình dựa trên LSTM

Mặc dù chúng tôi có thể đạt được điểm chính xác khá với phương pháp Vectơ hóa Bag of Words, nhưng nó có thể không mang lại kết quả tương tự khi xử lý các bộ dữ liệu lớn hơn. Điều này làm phát sinh nhu cầu sử dụng các mô hình dựa trên học tập sâu để đào tạo mô hình phân tích tình cảm.

Đối với các tác vụ NLP, chúng tôi thường sử dụng các mô hình dựa trên RNN vì chúng được thiết kế để xử lý dữ liệu tuần tự. Ở đây, chúng tôi sẽ đào tạo mô hình LSTM (Bộ nhớ ngắn hạn dài hạn) bằng cách sử dụng TensorFlow với Keras . Các bước để thực hiện phân tích tình cảm bằng cách sử dụng các mô hình dựa trên LSTM như sau:

  1. Xử lý trước văn bản của dữ liệu đào tạo (Xử lý trước văn bản bao gồm Chuẩn hóa, Mã hóa, Xóa từ dừng và Tạo gốc / Bổ sung.)
  2. Nhập Tokenizer từ Keras.preprocessing.text và tạo đối tượng của nó. Đặt Tokenizer trên toàn bộ văn bản đào tạo (để Tokenizer được đào tạo về từ vựng dữ liệu đào tạo). Nhúng văn bản đã tạo bằng cách sử dụng phương thức text_to_sequence () của Tokenizer và lưu trữ chúng sau khi đệm chúng có độ dài bằng nhau. (Nhúng là các đại diện bằng số / vectơ của văn bản. Vì chúng tôi không thể cung cấp mô hình của mình trực tiếp với dữ liệu văn bản, trước tiên chúng tôi cần chuyển đổi chúng thành nhúng)
  3. Sau khi tạo các nhúng, chúng tôi đã sẵn sàng để xây dựng mô hình. Chúng tôi xây dựng mô hình bằng cách sử dụng TensorFlow - thêm Đầu vào, LSTM và các lớp dày đặc vào nó. Thêm người bỏ học và điều chỉnh các siêu tham số để có được điểm số chính xác khá. Nói chung, chúng tôi có xu hướng sử dụng các chức năng kích hoạt ReLU hoặc LeakyReLU trong các lớp bên trong của các mô hình LSTM vì nó tránh được vấn đề gradient biến mất. Ở lớp đầu ra, chúng tôi sử dụng chức năng kích hoạt Softmax hoặc Sigmoid.

Mã phân tích tình cảm sử dụng phương pháp tiếp cận mô hình dựa trên LSTM:

Ở đây, chúng tôi đã sử dụng cùng một tập dữ liệu như chúng tôi đã sử dụng trong trường hợp của phương pháp BOW. Độ chính xác huấn luyện là 0,90.

#Importing necessary libraries
import nltk
import pandas as pd
from textblob import Word
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split 
#Loading the dataset
data = pd.read_csv('Finance_data.csv')
#Pre-Processing the text 
def cleaning(df, stop_words):
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x.lower() for x in x.split()))
    # Replacing the digits/numbers
    df['sentences'] = df['sentences'].str.replace('d', '')
    # Removing stop words
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x for x in x.split() if x not in stop_words))
    # Lemmatization
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join([Word(x).lemmatize() for x in x.split()]))
    return df
stop_words = stopwords.words('english')
data_cleaned = cleaning(data, stop_words)
#Generating Embeddings using tokenizer
tokenizer = Tokenizer(num_words=500, split=' ') 
tokenizer.fit_on_texts(data_cleaned['verified_reviews'].values)
X = tokenizer.texts_to_sequences(data_cleaned['verified_reviews'].values)
X = pad_sequences(X)
#Model Building
model = Sequential()
model.add(Embedding(500, 120, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(704, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(352, activation='LeakyReLU'))
model.add(Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())
#Model Training
model.fit(X_train, y_train, epochs = 20, batch_size=32, verbose =1)
#Model Testing
model.evaluate(X_test,y_test)

Sử dụng mô hình dựa trên máy biến áp

Các mô hình dựa trên máy biến áp là một trong những Kỹ thuật Xử lý Ngôn ngữ Tự nhiên tiên tiến nhất. Họ tuân theo kiến ​​trúc dựa trên Bộ mã hóa-Bộ giải mã và sử dụng các khái niệm về sự chú ý của bản thân để mang lại kết quả ấn tượng. Mặc dù người ta luôn có thể xây dựng một mô hình máy biến áp từ đầu, nhưng đó là một công việc khá tẻ nhạt. Do đó, chúng ta có thể sử dụng các mẫu máy biến áp đã được đào tạo trước có sẵn trên Mặt ôm . Hugging Face là một cộng đồng AI mã nguồn mở cung cấp vô số mô hình được đào tạo trước cho các ứng dụng NLP. Các mô hình này có thể được sử dụng như vậy hoặc có thể được tinh chỉnh cho các nhiệm vụ cụ thể.

Cài đặt:

pip install transformers

Nhập lớp SentimentIntensityAnalyzer từ Vader:

import transformers

Mã phân tích tình cảm bằng cách sử dụng các mô hình dựa trên Máy biến áp:

Để thực hiện bất kỳ tác vụ nào sử dụng máy biến áp, trước tiên chúng ta cần nhập chức năng đường ống từ máy biến áp. Sau đó, một đối tượng của hàm đường ống được tạo và nhiệm vụ cần thực hiện được chuyển như một đối số (tức là phân tích cảm tính trong trường hợp của chúng ta). Chúng tôi cũng có thể chỉ định mô hình mà chúng tôi cần sử dụng để thực hiện tác vụ. Ở đây, vì chúng tôi chưa đề cập đến mô hình sẽ được sử dụng, chế độ chưng cất-cơ sở-không phân biệt-finetuned-sst-2-English được sử dụng theo mặc định để phân tích cảm tính. Bạn có thể xem danh sách các nhiệm vụ và mô hình có sẵn tại đây .

from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["It was the best of times.", "t was the worst of times."]
sentiment_pipeline(data)Output:[{'label': 'POSITIVE', 'score': 0.999457061290741},  {'label': 'NEGATIVE', 'score': 0.9987301230430603}]

Sự kết luận

Trong thời đại này khi người dùng có thể bày tỏ quan điểm của mình một cách dễ dàng và dữ liệu được tạo ra một cách siêu tốc chỉ trong vài giây - việc rút ra thông tin chi tiết từ những dữ liệu đó là điều quan trọng để các tổ chức đưa ra quyết định hiệu quả - và Phân tích cảm xúc chứng tỏ là một mảnh ghép còn thiếu!

Bây giờ chúng ta đã trình bày rất chi tiết về những gì chính xác yêu cầu phân tích cảm xúc và các phương pháp khác nhau mà người ta có thể sử dụng để thực hiện nó trong Python. Nhưng đây chỉ là một số minh chứng thô sơ - bạn chắc chắn phải tiếp tục tìm hiểu các mô hình và thử chúng trên dữ liệu của riêng bạn.

Nguồn: https://www.analyticsvidhya.com/blog/2022/07/sentiment-analysis-using-python/

#python 

Iara  Simões

Iara Simões

1657268760

5 Maneiras de Realizar Análise de Sentimentos em Python

Quer você fale de Twitter, Goodreads ou Amazon – dificilmente existe um espaço digital não saturado com as opiniões das pessoas. No mundo de hoje, é crucial que as organizações se aprofundem nessas opiniões e obtenham insights sobre seus produtos ou serviços. No entanto, esses dados existem em quantidades tão surpreendentes que medi-los manualmente é uma busca quase impossível. É aqui que mais um benefício da Data Science entra em jogo  Análise de Sentimentos . Neste artigo, exploraremos o que a análise de sentimentos abrange e as várias maneiras de implementá-la em Python.

O que é Análise de Sentimentos?

A Análise de Sentimento é um caso de uso do Processamento de Linguagem Natural (NLP) e se enquadra na categoria de classificação de texto . Simplificando, a Análise de Sentimentos envolve a classificação de um texto em vários sentimentos, como positivo ou negativo, Feliz, Triste ou Neutro, etc. texto. Isso também é conhecido como Mineração de Opinião .

Vejamos como uma rápida pesquisa no Google define a Análise de Sentimento:

definição de análise de sentimento

Obtendo Insights e Tomando Decisões com Análise de Sentimentos

Bem, agora acho que estamos um pouco acostumados com o que é a análise de sentimentos. Mas qual é o seu significado e como as organizações se beneficiam dele? Vamos tentar explorar o mesmo com um exemplo. Suponha que você inicie uma empresa que vende perfumes em uma plataforma online. Você coloca uma grande variedade de fragrâncias por aí e logo os clientes começam a aparecer. Depois de algum tempo, você decide mudar a estratégia de preços dos perfumes - você planeja aumentar os preços das fragrâncias populares e, ao mesmo tempo, oferecer descontos nas impopulares . Agora, para determinar quais fragrâncias são populares, você começa a analisar as avaliações dos clientes de todas as fragrâncias. Mas você está preso! Eles são tantos que você não pode passar por todos eles em uma vida. É aqui que a análise de sentimentos pode tirá-lo do poço.

Você simplesmente reúne todas as avaliações em um só lugar e aplica a análise de sentimentos a elas. A seguir, uma representação esquemática da análise de sentimentos nas resenhas de três fragrâncias de perfumes – Lavanda, Rosa e Limão. (Observe que essas revisões podem ter ortografia, gramática e pontuação incorretas, como nos cenários do mundo real)

análise de sentimentos

A partir desses resultados, podemos ver claramente que:

Fragrance-1 (Lavender) tem avaliações altamente positivas dos clientes, o que indica que sua empresa pode aumentar seus preços devido à sua popularidade.

Fragrance-2 (Rose) tem uma perspectiva neutra entre o cliente, o que significa que sua empresa não deve alterar seus preços .

O Fragrance-3 (Lemon) tem um sentimento geral negativo associado a ele – portanto, sua empresa deve considerar oferecer um desconto para equilibrar a balança.

Este foi apenas um exemplo simples de como a análise de sentimentos pode ajudá-lo a obter insights sobre seus produtos/serviços e ajudar sua organização a tomar decisões.

Casos de uso de análise de sentimento

Acabamos de ver como a análise de sentimentos pode capacitar as organizações com insights que podem ajudá-las a tomar decisões baseadas em dados. Agora, vamos dar uma olhada em mais alguns casos de uso de análise de sentimentos.

  1. Monitoramento de mídia social para gerenciamento de marca: as marcas podem usar a análise de sentimentos para avaliar a perspectiva pública de sua marca. Por exemplo, uma empresa pode reunir todos os Tweets com a menção ou tag da empresa e realizar uma análise de sentimentos para conhecer a perspectiva pública da empresa.
  2. Análise de produto/serviço: as marcas/organizações podem realizar análises de sentimento nas avaliações dos clientes para ver o desempenho de um produto ou serviço no mercado e tomar decisões futuras de acordo.
  3. Previsão do preço das ações: prever se as ações de uma empresa vão subir ou descer é crucial para os investidores. Pode-se determinar o mesmo realizando uma análise de sentimento nas manchetes de notícias de artigos que contenham o nome da empresa. Se as manchetes de notícias relativas a uma determinada organização tiverem um sentimento positivo – seus preços de ações devem subir e vice-versa.

Maneiras de executar a análise de sentimentos em Python

O Python é uma das ferramentas mais poderosas quando se trata de realizar tarefas de ciência de dados — ele oferece várias maneiras de realizar  análises de sentimentos . Os mais populares estão listados aqui:

  1. Usando o Blob de Texto
  2. Usando Vader
  3. Usando modelos baseados em vetorização Bag of Words
  4. Usando modelos baseados em LSTM
  5. Usando modelos baseados em transformador

Vamos mergulhar fundo neles um por um.

Nota: Para fins de demonstração dos métodos 3 e 4 (Usando Modelos Baseados em Vetorização Bag of Words e Usando Modelos Baseados em LSTM) foi utilizada a análise de sentimentos . Compreende mais de 5.000 excretos de texto rotulados como positivos, negativos ou neutros. O conjunto de dados está sob a licença Creative Commons.

Usando o Blob de Texto

Text Blob é uma biblioteca Python para processamento de linguagem natural. Usar o Text Blob para análise de sentimentos é bastante simples. Ele recebe texto como entrada e pode retornar polaridade e subjetividade como saída.

A polaridade determina o sentimento do texto. Seus valores estão em [-1,1] onde -1 denota um sentimento altamente negativo e 1 denota um sentimento altamente positivo.

A subjetividade determina se uma entrada de texto é uma informação factual ou uma opinião pessoal. O seu valor situa-se entre [0,1] onde um valor mais próximo de 0 denota uma informação factual e um valor mais próximo de 1 denota uma opinião pessoal.

Instalação :

pip install textblob

Importando Blob de Texto:

from textblob import TextBlob

Implementação de código para análise de sentimento usando blob de texto:

Escrever código para análise de sentimentos usando TextBlob é bastante simples. Basta importar o objeto TextBlob e passar o texto a ser analisado com os devidos atributos da seguinte forma:

from textblob import TextBlob
text_1 = "The movie was so awesome."
text_2 = "The food here tastes terrible."#Determining the Polarity 
p_1 = TextBlob(text_1).sentiment.polarity
p_2 = TextBlob(text_2).sentiment.polarity#Determining the Subjectivity
s_1 = TextBlob(text_1).sentiment.subjectivity
s_2 = TextBlob(text_2).sentiment.subjectivityprint("Polarity of Text 1 is", p_1)
print("Polarity of Text 2 is", p_2)
print("Subjectivity of Text 1 is", s_1)
print("Subjectivity of Text 2 is", s_2)

Resultado:

Polarity of Text 1 is 1.0 
Polarity of Text 2 is -1.0 
Subjectivity of Text 1 is 1.0 
Subjectivity of Text 2 is 1.0

Usando VADER

O VADER (Valence Aware Dictionary and sEntiment Reasoner) é um analisador de sentimentos baseado em regras que foi treinado em texto de mídia social. Assim como o Text Blob, seu uso em Python é bastante simples. Veremos seu uso na implementação de código com um exemplo daqui a pouco.

Instalação:

pip install vaderSentiment

Importando a classe SentimentIntensityAnalyzer do Vader:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Código para análise de sentimentos usando o Vader:

Primeiramente, precisamos criar um objeto da classe SentimentIntensityAnalyzer; então precisamos passar o texto para a função polarity_scores() do objeto da seguinte forma:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()
text_1 = "The book was a perfect balance between wrtiting style and plot."
text_2 =  "The pizza tastes terrible."
sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)
print("Sentiment of text 1:", sent_1)
print("Sentiment of text 2:", sent_2)

Saída :

Sentiment of text 1: {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compound': 0.5719} 
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}

Como podemos ver, um objeto VaderSentiment retorna um dicionário de pontuações de sentimento para o texto a ser analisado.

Usando modelos baseados em vetorização Bag of Words

Nas duas abordagens discutidas até agora, ou seja, Text Blob e Vader, simplesmente usamos bibliotecas Python para realizar a análise de sentimentos. Agora discutiremos uma abordagem na qual treinaremos nosso próprio modelo para a tarefa. As etapas envolvidas na análise de sentimentos usando o método Bag of Words Vectorization são as seguintes:

  1. Pré-processe o texto dos dados de treinamento (o pré-processamento de texto envolve Normalização, Tokenização, Remoção de Stopwords e Stemming/Lematization.)
  2. Crie um Bag of Words para os dados de texto pré-processados ​​usando a abordagem Count Vectorization ou TF-IDF Vectorization.
  3. Treine um modelo de classificação adequado nos dados processados ​​para classificação de sentimentos.

Código para análise de sentimentos usando a abordagem de vetorização Bag of Words:

Para construir um modelo de análise de sentimento usando a Abordagem de Vetorização BOW, precisamos de um conjunto de dados rotulado. Como afirmado anteriormente, o conjunto de dados usado para esta demonstração foi obtido do Kaggle. Nós simplesmente usamos o vetorizador de contagem do sklearn para criar o BOW. Após, treinamos um classificador Multinomial Naive Bayes, para o qual foi obtido um escore de precisão de 0,84.

O conjunto de dados pode ser obtido aqui .

#Loading the Dataset
import pandas as pd
data = pd.read_csv('Finance_data.csv')
#Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(data['sentences'])
#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['feedback'], test_size=0.25, random_state=5)
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)
#Caluclating the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)

Saída :

Accuracuy Score:  0.9111675126903553

O classificador treinado pode ser usado para prever o sentimento de qualquer entrada de texto.

Usando modelos baseados em LSTM

Embora tenhamos conseguido obter uma pontuação de precisão decente com o método Bag of Words Vectorization, ele pode não produzir os mesmos resultados ao lidar com conjuntos de dados maiores. Isso dá origem à necessidade de empregar modelos baseados em deep learning para o treinamento do modelo de análise de sentimentos.

Para tarefas de PNL, geralmente usamos modelos baseados em RNN, pois são projetados para lidar com dados sequenciais. Aqui, vamos treinar um modelo LSTM (Long Short Term Memory) usando o TensorFlow com Keras . As etapas para realizar a análise de sentimento usando modelos baseados em LSTM são as seguintes:

  1. Pré-processe o texto dos dados de treinamento (o pré-processamento de texto envolve Normalização, Tokenização, Remoção de Stopwords e Stemming/Lematization.)
  2. Importe o Tokenizer de Keras.preprocessing.text e crie seu objeto. Ajuste o tokenizer em todo o texto de treinamento (para que o Tokenizer seja treinado no vocabulário de dados de treinamento). Embeddings de texto gerados usando o método text_to_sequence() do Tokenizer e armazená-los após preenchê-los com um comprimento igual. (Embeddings são representações numéricas/vetorizadas de texto. Como não podemos alimentar nosso modelo com os dados de texto diretamente, primeiro precisamos convertê-los em embeddings)
  3. Depois de gerar os embeddings, estamos prontos para construir o modelo. Construímos o modelo usando o TensorFlow — adicionamos Input, LSTM e camadas densas a ele. Adicione dropouts e ajuste os hiperparâmetros para obter uma pontuação de precisão decente. Geralmente, tendemos a usar funções de ativação ReLU ou LeakyReLU nas camadas internas dos modelos LSTM, pois evita o problema do gradiente de fuga. Na camada de saída, usamos a função de ativação Softmax ou Sigmoid.

Código para análise de sentimentos usando abordagem de modelo baseada em LSTM:

Aqui, usamos o mesmo conjunto de dados que usamos no caso da abordagem BOW. Uma precisão de treinamento de 0,90 foi obtida.

#Importing necessary libraries
import nltk
import pandas as pd
from textblob import Word
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split 
#Loading the dataset
data = pd.read_csv('Finance_data.csv')
#Pre-Processing the text 
def cleaning(df, stop_words):
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x.lower() for x in x.split()))
    # Replacing the digits/numbers
    df['sentences'] = df['sentences'].str.replace('d', '')
    # Removing stop words
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x for x in x.split() if x not in stop_words))
    # Lemmatization
    df['sentences'] = df['sentences'].apply(lambda x: ' '.join([Word(x).lemmatize() for x in x.split()]))
    return df
stop_words = stopwords.words('english')
data_cleaned = cleaning(data, stop_words)
#Generating Embeddings using tokenizer
tokenizer = Tokenizer(num_words=500, split=' ') 
tokenizer.fit_on_texts(data_cleaned['verified_reviews'].values)
X = tokenizer.texts_to_sequences(data_cleaned['verified_reviews'].values)
X = pad_sequences(X)
#Model Building
model = Sequential()
model.add(Embedding(500, 120, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(704, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(352, activation='LeakyReLU'))
model.add(Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())
#Model Training
model.fit(X_train, y_train, epochs = 20, batch_size=32, verbose =1)
#Model Testing
model.evaluate(X_test,y_test)

Usando modelos baseados em transformador

Os modelos baseados em transformadores são uma das técnicas de processamento de linguagem natural mais avançadas. Eles seguem uma arquitetura baseada em Encoder-Decoder e empregam os conceitos de autoatenção para produzir resultados impressionantes. Embora sempre se possa construir um modelo de transformador do zero, é uma tarefa bastante tediosa. Assim, podemos usar modelos de transformadores pré-treinados disponíveis no Hugging Face . Hugging Face é uma comunidade de IA de código aberto que oferece uma infinidade de modelos pré-treinados para aplicativos de PNL. Esses modelos podem ser usados ​​como tal ou podem ser ajustados para tarefas específicas.

Instalação:

pip install transformers

Importando a classe SentimentIntensityAnalyzer do Vader:

import transformers

Código para análise de sentimentos usando modelos baseados em Transformer:

Para executar qualquer tarefa usando transformadores, primeiro precisamos importar a função pipeline dos transformadores. Então, um objeto da função pipeline é criado e a tarefa a ser executada é passada como um argumento (ou seja, análise de sentimento no nosso caso). Também podemos especificar o modelo que precisamos usar para realizar a tarefa. Aqui, como não mencionamos o modelo a ser usado, o modo destilaria-base-uncased-finetuned-sst-2-English é usado por padrão para análise de sentimentos. Você pode conferir a lista de tarefas e modelos disponíveis aqui .

from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["It was the best of times.", "t was the worst of times."]
sentiment_pipeline(data)Output:[{'label': 'POSITIVE', 'score': 0.999457061290741},  {'label': 'NEGATIVE', 'score': 0.9987301230430603}]

Conclusão

Nesta era em que os usuários podem expressar seus pontos de vista sem esforço e os dados são gerados em superfluidade em apenas frações de segundos - extrair insights desses dados é vital para as organizações tomarem decisões eficientes - e a Análise de Sentimentos prova ser a peça que faltava no quebra-cabeça!

Até agora, cobrimos em detalhes o que exatamente envolve a análise de sentimentos e os vários métodos que podemos usar para realizá-la em Python. Mas essas foram apenas algumas demonstrações rudimentares - você certamente deve ir em frente e mexer nos modelos e testá-los em seus próprios dados.

Fonte: https://www.analyticsvidhya.com/blog/2022/07/sentiment-analysis-using-python/ 

#python