1647072000

【Matplotlib入门教程】图例(Legend)和标注(Text,Annotate)

``````import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(-4, 4, 100)
y1 = 2 * x
y2 = x ** 2``````

图例 Legend

``````plt.plot(x, y1)
plt.plot(x, y2)
plt.legend(['line1', 'line2'])
plt.show()``````

``````plt.plot(x, y1, label = "line 1") # legend需要label才能展示
plt.plot(x, y2, label = 'line 2')
plt.legend()
plt.show()``````

``````plt.plot(x, y1, label = "line 1")
plt.plot(x, y2, label = 'line 2')
plt.legend(loc = "upper right") # 其他loc参数: upper right, upper left, lower right, lower left, ...
plt.show()``````

``````plt.plot(x, y1, label = "linear line") # legend需要label才能展示
plt.plot(x, y2, label = "line 2")
plt.legend(loc = 0, title = "legend title", shadow=True, ncol = 2, facecolor = "#F5F5F5")
plt.show()``````

标注

Matplotlib中有两种标注，一种是无指向性标注text，另一种是指向性注释annotate。

Text 无指向型标注

``````plt.plot(x, y1)
plt.plot(x, y2)
plt.text(-0.5, 5, "two functions") # 在x为-0.5，y为5的区域开始文本标注
plt.show()``````

``````plt.plot(x, y1)
plt.plot(x, y2)
plt.text(-1, 5, "two functions", family="Times New Roman", fontsize=18, fontweight="bold", color='red',
bbox=dict(boxstyle="round", fc="none", ec="black"))
plt.show()``````

Annotate 指向型注释

Annotate 称为指向型注释，标注不仅包含注释的文本内容还包含箭头指向。

annotate()函数用于注释，xy参数代表箭头指向的点，xytext代表文本标注开始的点：

``````plt.plot(x, y1, label = "line1")
plt.plot(x, y2, label = "line2")
plt.annotate("y = 2x", xy = (1, 2), xytext= (2, 0), arrowprops = dict(arrowstyle="->"))
plt.legend()
plt.show()``````

``````plt.plot(x, y1, label = "line1")
plt.plot(x, y2, label = "line2")
plt.annotate("y = 2x", xy = (1, 2), xytext= (2, 0),
arrowprops = dict(arrowstyle="->"),
bbox=dict(boxstyle="round", fc="none", ec="gray")) # boxstyle方形外框: facecolor, edgecolor
plt.legend()
plt.show()``````

``````plt.plot(x, y1, label = "line1")
plt.plot(x, y2, label = "line2")
plt.annotate("y = 2x", xy = (1, 2), xytext= (2, 0),
bbox=dict(boxstyle="round", fc="none", ec="gray"))
plt.legend()
plt.show()``````

00:24 - 图例Legend
02:49 - 无指向型标注Text
04:41 - 指向型标注Annotate

1650870267

Navigating Between DOM Nodes in JavaScript

In the previous chapters you've learnt how to select individual elements on a web page. But there are many occasions where you need to access a child, parent or ancestor element. See the JavaScript DOM nodes chapter to understand the logical relationships between the nodes in a DOM tree.

DOM node provides several properties and methods that allow you to navigate or traverse through the tree structure of the DOM and make changes very easily. In the following section we will learn how to navigate up, down, and sideways in the DOM tree using JavaScript.

Accessing the Child Nodes

You can use the `firstChild` and `lastChild` properties of the DOM node to access the first and last direct child node of a node, respectively. If the node doesn't have any child element, it returns `null`.

Example

``````<div id="main">
<p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
console.log(main.firstChild.nodeName); // Prints: #text

var hint = document.getElementById("hint");
console.log(hint.firstChild.nodeName); // Prints: SPAN
</script>``````

Note: The `nodeName` is a read-only property that returns the name of the current node as a string. For example, it returns the tag name for element node, `#text` for text node, `#comment` for comment node, `#document` for document node, and so on.

If you notice the above example, the `nodeName` of the first-child node of the main DIV element returns #text instead of H1. Because, whitespace such as spaces, tabs, newlines, etc. are valid characters and they form #text nodes and become a part of the DOM tree. Therefore, since the `<div>` tag contains a newline before the `<h1>` tag, so it will create a #text node.

To avoid the issue with `firstChild` and `lastChild` returning #text or #comment nodes, you could alternatively use the `firstElementChild` and `lastElementChild` properties to return only the first and last element node, respectively. But, it will not work in IE 9 and earlier.

Example

``````<div id="main">
<p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
main.firstElementChild.style.color = "red";

var hint = document.getElementById("hint");
hint.firstElementChild.style.color = "blue";
</script>``````

Similarly, you can use the `childNodes` property to access all child nodes of a given element, where the first child node is assigned index 0. Here's an example:

Example

``````<div id="main">
<p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes
if(main.hasChildNodes()) {
var nodes = main.childNodes;

// Loop through node list and display node name
for(var i = 0; i < nodes.length; i++) {
}
}
</script>``````

The `childNodes` returns all child nodes, including non-element nodes like text and comment nodes. To get a collection of only elements, use `children` property instead.

Example

``````<div id="main">
<p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes
if(main.hasChildNodes()) {
var nodes = main.children;

// Loop through node list and display node name
for(var i = 0; i < nodes.length; i++) {
}
}
</script>``````

1646033280

Erkennung gefälschter Nachrichten in Python

Untersuchen des Fake-News-Datensatzes, Durchführen von Datenanalysen wie Wortwolken und Ngrams und Feinabstimmen des BERT-Transformators, um einen Fake-News-Detektor in Python mithilfe der Transformer-Bibliothek zu erstellen.

Fake News sind die absichtliche Verbreitung falscher oder irreführender Behauptungen als Nachrichten, bei denen die Aussagen absichtlich irreführend sind.

Zeitungen, Boulevardzeitungen und Zeitschriften wurden durch digitale Nachrichtenplattformen, Blogs, Social-Media-Feeds und eine Vielzahl mobiler Nachrichtenanwendungen ersetzt. Nachrichtenorganisationen profitierten von der zunehmenden Nutzung sozialer Medien und mobiler Plattformen, indem sie ihren Abonnenten minutenaktuelle Informationen lieferten.

Die Verbraucher haben jetzt sofortigen Zugriff auf die neuesten Nachrichten. Diese digitalen Medienplattformen haben aufgrund ihrer einfachen Anbindung an den Rest der Welt an Bedeutung gewonnen und ermöglichen es den Benutzern, Ideen zu diskutieren und auszutauschen und Themen wie Demokratie, Bildung, Gesundheit, Forschung und Geschichte zu debattieren. Gefälschte Nachrichten auf digitalen Plattformen werden immer beliebter und werden für Profitzwecke wie politische und finanzielle Gewinne verwendet.

Wie groß ist dieses Problem?

Da das Internet, soziale Medien und digitale Plattformen weit verbreitet sind, kann jeder ungenaue und voreingenommene Informationen verbreiten. Die Verbreitung von Fake News lässt sich kaum verhindern. Es gibt einen enormen Anstieg bei der Verbreitung falscher Nachrichten, die nicht auf einen Sektor wie Politik beschränkt sind, sondern Sport, Gesundheit, Geschichte, Unterhaltung sowie Wissenschaft und Forschung umfassen.

Die Lösung

Es ist wichtig, falsche und richtige Nachrichten zu erkennen und zu unterscheiden. Eine Methode besteht darin, einen Experten entscheiden zu lassen und alle Informationen auf Fakten zu überprüfen, aber dies kostet Zeit und erfordert Fachwissen, das nicht geteilt werden kann. Zweitens können wir Tools für maschinelles Lernen und künstliche Intelligenz verwenden, um die Identifizierung von gefälschten Nachrichten zu automatisieren.

Online-Nachrichteninformationen umfassen verschiedene unstrukturierte Formatdaten (wie Dokumente, Videos und Audio), aber wir konzentrieren uns hier auf Nachrichten im Textformat. Mit dem Fortschritt des maschinellen Lernens und der Verarbeitung natürlicher Sprache können wir jetzt den irreführenden und falschen Charakter eines Artikels oder einer Aussage erkennen.

Mehrere Studien und Experimente werden durchgeführt, um Fake News in allen Medien aufzudecken.

Unser Hauptziel dieses Tutorials ist:

• Untersuchen und analysieren Sie den Fake-News-Datensatz.
• Erstellen Sie einen Klassifikator, der gefälschte Nachrichten so genau wie möglich unterscheiden kann.

Hier das Inhaltsverzeichnis:

• Einführung
• Wie groß ist dieses Problem?
• Die Lösung
• Datenexploration
• Verteilung der Klassen
• Datenbereinigung für die Analyse
• Explorative Datenanalyse
• Ein-Wort-Wolke
• Häufigstes Bigram (Zwei-Wort-Kombination)
• Häufigstes Trigramm (Drei-Wort-Kombination)
• Aufbau eines Klassifikators durch Feinabstimmung von BERT
• Datenaufbereitung
• Tokenisieren des Datensatzes
• Laden und Feintuning des Modells
• Modellbewertung
• Anhang: Erstellen einer Übermittlungsdatei für Kaggle
• Fazit

Datenexploration

In dieser Arbeit haben wir den Fake-News-Datensatz von Kaggle verwendet , um nicht vertrauenswürdige Nachrichtenartikel als Fake News zu klassifizieren. Wir verfügen über einen vollständigen Trainingsdatensatz mit den folgenden Merkmalen:

• `id`: eindeutige ID für einen Nachrichtenartikel
• `title`: Titel eines Nachrichtenartikels
• `author`: Autor des Nachrichtenartikels
• `text`: Text des Artikels; könnte unvollständig sein
• `label`: ein Etikett, das den Artikel als potenziell unzuverlässig markiert, gekennzeichnet durch 1 (unzuverlässig oder gefälscht) oder 0 (zuverlässig).

Es ist ein binäres Klassifizierungsproblem, bei dem wir vorhersagen müssen, ob eine bestimmte Nachricht zuverlässig ist oder nicht.

Wenn Sie ein Kaggle-Konto haben, können Sie den Datensatz einfach von der dortigen Website herunterladen und die ZIP-Datei entpacken.

Ich habe den Datensatz auch in Google Drive hochgeladen, und Sie können ihn hier herunterladen oder die `gdown`Bibliothek verwenden, um ihn automatisch in Google Colab- oder Jupyter-Notebooks herunterzuladen:

``\$ pip install gdown``
``````# download from Google Drive
``````Downloading...
To: /content/fake-news.zip
100% 48.7M/48.7M [00:00<00:00, 74.6MB/s]``````

Entpacken der Dateien:

``\$ unzip fake-news.zip``

Im aktuellen Arbeitsverzeichnis werden drei Dateien angezeigt: `train.csv`, `test.csv`, und `submit.csv`, die wir `train.csv`im Großteil des Tutorials verwenden werden.

Installieren der erforderlichen Abhängigkeiten:

``\$ pip install transformers nltk pandas numpy matplotlib seaborn wordcloud``

Hinweis: Wenn Sie sich in einer lokalen Umgebung befinden, stellen Sie sicher, dass Sie PyTorch für GPU installieren, gehen Sie zu dieser Seite für eine ordnungsgemäße Installation.

Lassen Sie uns die wesentlichen Bibliotheken für die Analyse importieren:

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns``````

``````import nltk

Der Fake-News-Datensatz umfasst Original- und fiktive Artikeltitel und -texte verschiedener Autoren. Lassen Sie uns unseren Datensatz importieren:

``````# load the dataset
``````print("Shape of News data:", news_d.shape)
print("News data columns", news_d.columns)``````

Ausgabe:

`````` Shape of News data: (20800, 5)
News data columns Index(['id', 'title', 'author', 'text', 'label'], dtype='object')``````

So sieht der Datensatz aus:

``````# by using df.head(), we can immediately familiarize ourselves with the dataset.

Ausgabe:

``````id	title	author	text	label
0	0	House Dem Aide: We Didn’t Even See Comey’s Let...	Darrell Lucus	House Dem Aide: We Didn’t Even See Comey’s Let...	1
1	1	FLYNN: Hillary Clinton, Big Woman on Campus - ...	Daniel J. Flynn	Ever get the feeling your life circles the rou...	0
2	2	Why the Truth Might Get You Fired	Consortiumnews.com	Why the Truth Might Get You Fired October 29, ...	1
3	3	15 Civilians Killed In Single US Airstrike Hav...	Jessica Purkiss	Videos 15 Civilians Killed In Single US Airstr...	1
4	4	Iranian woman jailed for fictional unpublished...	Howard Portnoy	Print \nAn Iranian woman has been sentenced to...	1``````

Wir haben 20.800 Zeilen, die fünf Spalten haben. Sehen wir uns einige Statistiken der `text`Spalte an:

``````#Text Word startistics: min.mean, max and interquartile range

txt_length = news_d.text.str.split().str.len()
txt_length.describe()``````

Ausgabe:

``````count    20761.000000
mean       760.308126
std        869.525988
min          0.000000
25%        269.000000
50%        556.000000
75%       1052.000000
max      24234.000000
Name: text, dtype: float64``````

Statistiken für die `title`Spalte:

``````#Title statistics

title_length = news_d.title.str.split().str.len()
title_length.describe()``````

Ausgabe:

``````count    20242.000000
mean        12.420709
std          4.098735
min          1.000000
25%         10.000000
50%         13.000000
75%         15.000000
max         72.000000
Name: title, dtype: float64``````

Die Statistiken für die Trainings- und Testsätze lauten wie folgt:

• Das `text`Attribut hat eine höhere Wortzahl mit durchschnittlich 760 Wörtern und 75 % mit mehr als 1000 Wörtern.
• Das `title`Attribut ist eine kurze Aussage mit durchschnittlich 12 Wörtern, und 75 % davon sind ungefähr 15 Wörter.

Unser Experiment wäre mit Text und Titel zusammen.

Verteilung der Klassen

Zählplots für beide Etiketten:

``````sns.countplot(x="label", data=news_d);
print("1: Unreliable")
print("0: Reliable")
print("Distribution of labels:")
print(news_d.label.value_counts());``````

Ausgabe:

``````1: Unreliable
0: Reliable
Distribution of labels:
1    10413
0    10387
Name: label, dtype: int64``````

``print(round(news_d.label.value_counts(normalize=True),2)*100);``

Ausgabe:

``````1    50.0
0    50.0
Name: label, dtype: float64``````

Die Anzahl der nicht vertrauenswürdigen Artikel (gefälscht oder 1) beträgt 10413, während die Anzahl der vertrauenswürdigen Artikel (zuverlässig oder 0) 10387 beträgt. Fast 50 % der Artikel sind gefälscht. Daher misst die Genauigkeitsmetrik, wie gut unser Modell beim Erstellen eines Klassifikators abschneidet.

Datenbereinigung für die Analyse

In diesem Abschnitt werden wir unseren Datensatz bereinigen, um einige Analysen durchzuführen:

• Löschen Sie nicht verwendete Zeilen und Spalten.
• Führen Sie eine Nullwertimputation durch.
• Sonderzeichen entfernen.
• Stoppwörter entfernen.
``````# Constants that are used to sanitize the datasets

column_n = ['id', 'title', 'author', 'text', 'label']
remove_c = ['id','author']
categorical_features = []
target_col = ['label']
text_f = ['title', 'text']``````
``````# Clean Datasets
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from collections import Counter

ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()

stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

# Removed unused clumns
def remove_unused_c(df,column_n=remove_c):
df = df.drop(column_n,axis=1)
return df

# Impute null values with None
def null_process(feature_df):
for col in text_f:
feature_df.loc[feature_df[col].isnull(), col] = "None"
return feature_df

def clean_dataset(df):
# remove unused column
df = remove_unused_c(df)
#impute null values
df = null_process(df)
return df

# Cleaning text from unused characters
def clean_text(text):
text = str(text).replace(r'http[\w:/\.]+', ' ')  # removing urls
text = str(text).replace(r'[^\.\w\s]', ' ')  # remove everything but characters and punctuation
text = str(text).replace('[^a-zA-Z]', ' ')
text = str(text).replace(r'\s\s+', ' ')
text = text.lower().strip()
#text = ' '.join(text)
return text

## Nltk Preprocessing include:
# Stop words, Stemming and Lemmetization
# For our project we use only Stop word removal
def nltk_preprocess(text):
text = clean_text(text)
wordlist = re.sub(r'[^\w\s]', '', text).split()
#text = ' '.join([word for word in wordlist if word not in stopwords_dict])
#text = [ps.stem(word) for word in wordlist if not word in stopwords_dict]
text = ' '.join([wnl.lemmatize(word) for word in wordlist if word not in stopwords_dict])
return  text``````

Im obigen Codeblock:

• Wir haben NLTK importiert, eine berühmte Plattform für die Entwicklung von Python-Anwendungen, die mit der menschlichen Sprache interagieren. Als nächstes importieren wir `re`für Regex.
• Wir importieren Stoppwörter aus `nltk.corpus`. Bei der Arbeit mit Wörtern, insbesondere bei der Betrachtung der Semantik, müssen wir manchmal gebräuchliche Wörter eliminieren, die einer Aussage keine signifikante Bedeutung hinzufügen, wie z. B. `"but"`, `"can"`, `"we"`, usw.
• `PorterStemmer`wird verwendet, um Wortstämme mit NLTK auszuführen. Stemmer entfernen Wörter ihrer morphologischen Affixe und lassen nur den Wortstamm übrig.
• Wir importieren `WordNetLemmatizer()`aus der NLTK-Bibliothek zur Lemmatisierung. Lemmatisierung ist viel effektiver als Stemmung . Es geht über die Wortreduktion hinaus und wertet das gesamte Lexikon einer Sprache aus, um eine morphologische Analyse auf Wörter anzuwenden, mit dem Ziel, nur Flexionsenden zu entfernen und die Basis- oder Wörterbuchform eines Wortes zurückzugeben, die als Lemma bekannt ist.
• `stopwords.words('english')`Lassen Sie uns einen Blick auf die Liste aller englischen Stoppwörter werfen, die von NLTK unterstützt werden.
• `remove_unused_c()`Funktion wird verwendet, um die unbenutzten Spalten zu entfernen.
• Wir imputieren Nullwerte mit `None`der Verwendung der `null_process()`Funktion.
• Innerhalb der Funktion `clean_dataset()`rufen wir `remove_unused_c()`und `null_process()`Funktionen auf. Diese Funktion ist für die Datenbereinigung zuständig.
• Um Text von ungenutzten Zeichen zu bereinigen, haben wir die `clean_text()`Funktion erstellt.
• Für die Vorverarbeitung verwenden wir nur die Entfernung von Stoppwörtern. Zu diesem Zweck haben wir die `nltk_preprocess()`Funktion erstellt.

Vorverarbeitung der `text`und `title`:

``````# Perform data cleaning on train and test dataset by calling clean_dataset function
df = clean_dataset(news_d)
# apply preprocessing on text through apply method by calling the function nltk_preprocess
df["text"] = df.text.apply(nltk_preprocess)
# apply preprocessing on title through apply method by calling the function nltk_preprocess
df["title"] = df.title.apply(nltk_preprocess)``````
``````# Dataset after cleaning and preprocessing step

Ausgabe:

``````title	text	label
0	house dem aide didnt even see comeys letter ja...	house dem aide didnt even see comeys letter ja...	1
1	flynn hillary clinton big woman campus breitbart	ever get feeling life circle roundabout rather...	0
2	truth might get fired	truth might get fired october 29 2016 tension ...	1
3	15 civilian killed single u airstrike identified	video 15 civilian killed single u airstrike id...	1
4	iranian woman jailed fictional unpublished sto...	print iranian woman sentenced six year prison ...	1``````

Explorative Datenanalyse

In diesem Abschnitt führen wir Folgendes durch:

• Univariate Analyse : Es ist eine statistische Analyse des Textes. Wir werden zu diesem Zweck die Wortwolke verwenden. Eine Wortwolke ist ein Visualisierungsansatz für Textdaten, bei dem der häufigste Begriff in der größten Schriftgröße dargestellt wird.
• Bivariate Analyse : Hier werden Bigramm und Trigramm verwendet. Laut Wikipedia: „ Ein N-Gramm ist eine zusammenhängende Folge von n Elementen aus einem gegebenen Text- oder Sprachmuster. Je nach Anwendung können die Elemente Phoneme, Silben, Buchstaben, Wörter oder Basenpaare sein. Die N-Gramme werden typischerweise aus einem Text- oder Sprachkorpus gesammelt".

Ein-Wort-Wolke

Die häufigsten Wörter erscheinen fett und größer in einer Wortwolke. In diesem Abschnitt wird eine Wortwolke für alle Wörter im Datensatz erstellt.

Die Funktion der WordCloud - Bibliothek `wordcloud()`wird verwendet, und die `generate()`wird zum Generieren des Wortwolkenbildes verwendet:

``````from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# initialize the word cloud
wordcloud = WordCloud( background_color='black', width=800, height=600)
# generate the word cloud by passing the corpus
text_cloud = wordcloud.generate(' '.join(df['text']))
# plotting the word cloud
plt.figure(figsize=(20,30))
plt.imshow(text_cloud)
plt.axis('off')
plt.show()``````

Ausgabe:

Wortwolke nur für zuverlässige Nachrichten:

``````true_n = ' '.join(df[df['label']==0]['text'])
wc = wordcloud.generate(true_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()``````

Ausgabe:

Wortwolke nur für Fake News:

``````fake_n = ' '.join(df[df['label']==1]['text'])
wc= wordcloud.generate(fake_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()``````

Ausgabe:

Häufigstes Bigram (Zwei-Wort-Kombination)

Ein N-Gramm ist eine Folge von Buchstaben oder Wörtern. Ein Zeichen-Unigramm besteht aus einem einzelnen Zeichen, während ein Bigramm aus einer Reihe von zwei Zeichen besteht. In ähnlicher Weise bestehen Wort-N-Gramme aus einer Reihe von n Wörtern. Das Wort "united" ist ein 1-Gramm (Unigram). Die Kombination der Wörter "United State" ist ein 2-Gramm (Bigramm), "New York City" ist ein 3-Gramm.

Lassen Sie uns das häufigste Bigramm in den zuverlässigen Nachrichten darstellen:

``````def plot_top_ngrams(corpus, title, ylabel, xlabel="Number of Occurences", n=2):
"""Utility function to plot top n-grams"""
true_b = (pd.Series(nltk.ngrams(corpus.split(), n)).value_counts())[:20]
true_b.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title(title)
plt.ylabel(ylabel)
plt.xlabel(xlabel)
plt.show()``````
``plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Bigrams', "Bigram", n=2)``

Das häufigste Bigramm in den Fake News:

``plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Bigrams', "Bigram", n=2)``

Häufigstes Trigramm (Drei-Wort-Kombination)

Das häufigste Trigramm bei zuverlässigen Nachrichten:

``plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Trigrams', "Trigrams", n=3)``

Für Fake News jetzt:

``plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Trigrams', "Trigrams", n=3)``

Die obigen Diagramme geben uns einige Ideen, wie beide Klassen aussehen. Im nächsten Abschnitt verwenden wir die Transformers-Bibliothek , um einen Detektor für gefälschte Nachrichten zu erstellen.

Aufbau eines Klassifikators durch Feinabstimmung von BERT

In diesem Abschnitt wird ausgiebig Code aus dem BERT-Tutorial zur Feinabstimmung entnommen, um mithilfe der Transformers-Bibliothek einen Klassifikator für gefälschte Nachrichten zu erstellen. Für detailliertere Informationen können Sie also zum Original-Tutorial gehen .

Wenn Sie keine Transformatoren installiert haben, müssen Sie:

``\$ pip install transformers``

Lassen Sie uns die erforderlichen Bibliotheken importieren:

``````import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.model_selection import train_test_split

import random``````

Wir wollen unsere Ergebnisse reproduzierbar machen, auch wenn wir unsere Umgebung neu starten:

``````def set_seed(seed: int):
"""
Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
installed).

Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf

tf.random.set_seed(seed)

set_seed(1)``````

Das Modell, das wir verwenden werden, ist das `bert-base-uncased`:

``````# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512``````

``````# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)``````

Datenaufbereitung

Lassen Sie uns nun `NaN`Werte aus den Spalten `text`, `author`und bereinigen:`title`

``````news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]``````

Erstellen Sie als Nächstes eine Funktion, die den Datensatz als Pandas-Datenrahmen nimmt und die Trainings-/Validierungsaufteilungen von Texten und Beschriftungen als Listen zurückgibt:

``````def prepare_data(df, test_size=0.2, include_title=True, include_author=True):
texts = []
labels = []
for i in range(len(df)):
text = df["text"].iloc[i]
label = df["label"].iloc[i]
if include_title:
text = df["title"].iloc[i] + " - " + text
if include_author:
text = df["author"].iloc[i] + " : " + text
if text and label in [0, 1]:
texts.append(text)
labels.append(label)
return train_test_split(texts, labels, test_size=test_size)

train_texts, valid_texts, train_labels, valid_labels = prepare_data(news_df)``````

Die obige Funktion nimmt den Datensatz in einem Datenrahmentyp und gibt sie als Listen zurück, die in Trainings- und Validierungssätze aufgeteilt sind. Die Einstellung `include_title`auf `True`bedeutet, dass wir die `title`Spalte zu dem hinzufügen, die `text`wir für das Training verwenden werden, die Einstellung `include_author`auf bedeutet, dass wir auch die Spalte zum Text `True`hinzufügen .`author`

Stellen wir sicher, dass die Beschriftungen und Texte die gleiche Länge haben:

``````print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))``````

Ausgabe:

``````14628 14628
3657 3657``````

Tokenisieren des Datensatzes

Verwenden wir den BERT-Tokenizer, um unseren Datensatz zu tokenisieren:

``````# tokenize the dataset, truncate when passed `max_length`,
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)``````

Konvertieren der Kodierungen in einen PyTorch-Datensatz:

``````class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor([self.labels[idx]])
return item

def __len__(self):
return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)``````

Wir werden verwenden `BertForSequenceClassification`, um unser BERT-Transformatormodell zu laden:

``````# load the model
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)``````

Wir setzen `num_labels`auf 2, da es sich um eine binäre Klassifikation handelt. Die folgende Funktion ist ein Rückruf, um die Genauigkeit für jeden Validierungsschritt zu berechnen:

``````from sklearn.metrics import accuracy_score

def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}``````

Lassen Sie uns die Trainingsparameter initialisieren:

``````training_args = TrainingArguments(
output_dir='./results',          # output directory
num_train_epochs=1,              # total number of training epochs
per_device_train_batch_size=10,  # batch size per device during training
per_device_eval_batch_size=20,   # batch size for evaluation
warmup_steps=100,                # number of warmup steps for learning rate scheduler
logging_dir='./logs',            # directory for storing logs
load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
# but you can specify `metric_for_best_model` argument to change to accuracy or other metric
logging_steps=200,               # log & save weights each logging_steps
save_steps=200,
evaluation_strategy="steps",     # evaluate each `logging_steps`
)``````

Ich habe den `per_device_train_batch_size`auf 10 eingestellt, aber Sie sollten ihn so hoch einstellen, wie Ihre GPU möglicherweise passen könnte. Setzen Sie `logging_steps`und `save_steps`auf 200, was bedeutet, dass wir eine Bewertung durchführen und die Modellgewichte bei jedem 200-Trainingsschritt speichern.

Auf dieser Seite finden Sie   detailliertere Informationen zu den verfügbaren Trainingsparametern.

Lassen Sie uns den Trainer instanziieren:

``````trainer = Trainer(
model=model,                         # the instantiated Transformers model to be trained
args=training_args,                  # training arguments, defined above
train_dataset=train_dataset,         # training dataset
eval_dataset=valid_dataset,          # evaluation dataset
compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)``````

Training des Modells:

``````# train the model
trainer.train()``````

Das Training dauert je nach GPU einige Stunden. Wenn Sie die kostenlose Version von Colab verwenden, sollte es mit NVIDIA Tesla K80 eine Stunde dauern. Hier ist die Ausgabe:

``````***** Running training *****
Num examples = 14628
Num Epochs = 1
Instantaneous batch size per device = 10
Total train batch size (w. parallel, distributed & accumulation) = 10
Total optimization steps = 1463
[1463/1463 41:07, Epoch 1/1]
Step	Training Loss	Validation Loss	Accuracy
200		0.250800		0.100533		0.983867
400		0.027600		0.043009		0.993437
600		0.023400		0.017812		0.997539
800		0.014900		0.030269		0.994258
1000	0.022400		0.012961		0.998086
1200	0.009800		0.010561		0.998633
1400	0.007700		0.010300		0.998633
***** Running Evaluation *****
Num examples = 3657
Batch size = 20
Saving model checkpoint to ./results/checkpoint-200
Configuration saved in ./results/checkpoint-200/config.json
Model weights saved in ./results/checkpoint-200/pytorch_model.bin
<SNIPPED>
***** Running Evaluation *****
Num examples = 3657
Batch size = 20
Saving model checkpoint to ./results/checkpoint-1400
Configuration saved in ./results/checkpoint-1400/config.json
Model weights saved in ./results/checkpoint-1400/pytorch_model.bin

Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=1463, training_loss=0.04888018785440506, metrics={'train_runtime': 2469.1722, 'train_samples_per_second': 5.924, 'train_steps_per_second': 0.593, 'total_flos': 3848788517806080.0, 'train_loss': 0.04888018785440506, 'epoch': 1.0})``````

Modellbewertung

Da `load_best_model_at_end`auf eingestellt ist, `True`werden nach Abschluss des Trainings die besten Gewichte geladen. Lassen Sie es uns mit unserem Validierungsset auswerten:

``````# evaluate the current model after training
trainer.evaluate()``````

Ausgabe:

``````***** Running Evaluation *****
Num examples = 3657
Batch size = 20
[183/183 02:11]
{'epoch': 1.0,
'eval_accuracy': 0.998632759092152,
'eval_loss': 0.010299865156412125,
'eval_runtime': 132.0374,
'eval_samples_per_second': 27.697,
'eval_steps_per_second': 1.386}``````

Speichern des Modells und des Tokenizers:

``````# saving the fine tuned model & tokenizer
model_path = "fake-news-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)``````

Nach dem Ausführen der obigen Zelle wird ein neuer Ordner mit der Modellkonfiguration und den Gewichten angezeigt. Wenn Sie eine Vorhersage durchführen möchten, verwenden Sie einfach die `from_pretrained()`Methode, die wir beim Laden des Modells verwendet haben, und Sie können loslegen.

Als nächstes erstellen wir eine Funktion, die den Artikeltext als Argument akzeptiert und zurückgibt, ob er gefälscht ist oder nicht:

``````def get_prediction(text, convert_to_label=False):
# prepare our text into tokenized sequence
inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
# perform inference to our model
outputs = model(**inputs)
# get output probabilities by doing softmax
probs = outputs[0].softmax(1)
# executing argmax function to get the candidate label
d = {
0: "reliable",
1: "fake"
}
if convert_to_label:
return d[int(probs.argmax())]
else:
return int(probs.argmax())``````

Ich habe ein Beispiel dafür genommen `test.csv`, dass das Modell nie eine Inferenz durchgeführt hat, ich habe es überprüft, und es ist ein aktueller Artikel aus der New York Times:

``````real_news = """
Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The New York Times",Daniel Victor,"If at first you don’t succeed, try a different sport. Tim Tebow, who was a Heisman   quarterback at the University of Florida but was unable to hold an N. F. L. job, is pursuing a career in Major League Baseball. <SNIPPED>
"""``````

Der Originaltext befindet sich in der Colab-Umgebung , wenn Sie ihn kopieren möchten, da es sich um einen vollständigen Artikel handelt. Übergeben wir es an das Modell und sehen uns die Ergebnisse an:

``get_prediction(real_news, convert_to_label=True)``

Ausgabe:

``reliable``

Anhang: Erstellen einer Übermittlungsdatei für Kaggle

In diesem Abschnitt werden wir alle Artikel vorhersagen `test.csv`, um eine Einreichungsdatei zu erstellen, um unsere Genauigkeit im Testsatz des Kaggle-Wettbewerbs zu sehen :

``````# read the test set
# make a copy of the testing set
new_df = test_df.copy()
# add a new column that contains the author, title and article content
new_df["new_text"] = new_df["author"].astype(str) + " : " + new_df["title"].astype(str) + " - " + new_df["text"].astype(str)
# get the prediction of all the test set
new_df["label"] = new_df["new_text"].apply(get_prediction)
# make the submission file
final_df = new_df[["id", "label"]]
final_df.to_csv("submit_final.csv", index=False)``````

Nachdem wir Autor, Titel und Artikeltext miteinander verkettet haben, übergeben wir die `get_prediction()`Funktion an die neue Spalte, um die Spalte zu füllen `label`, und verwenden dann die `to_csv()`Methode, um die Übermittlungsdatei für Kaggle zu erstellen. Hier ist mein Submission Score:

Wir haben eine Genauigkeit von 99,78 % und 100 % auf privaten und öffentlichen Bestenlisten. Das ist großartig!

Fazit

Okay, wir sind mit dem Tutorial fertig. Sie können diese Seite überprüfen , um verschiedene Trainingsparameter zu sehen, die Sie optimieren können.

Wenn Sie einen benutzerdefinierten Fake-News-Datensatz zur Feinabstimmung haben, müssen Sie einfach eine Liste von Beispielen an den Tokenizer übergeben, wie wir es getan haben, Sie werden danach keinen anderen Code mehr ändern.

Sehen Sie sich den vollständigen Code hier oder die Colab - Umgebung hier an .

1646029620

Como construir um detector de notícias falsas em Python

Explorando o conjunto de dados de notícias falsas, realizando análises de dados, como nuvens de palavras e ngrams, e ajustando o transformador BERT para construir um detector de notícias falsas em Python usando a biblioteca de transformadores.

Fake news é a transmissão intencional de alegações falsas ou enganosas como notícias, onde as declarações são propositalmente enganosas.

Jornais, tablóides e revistas foram suplantados por plataformas de notícias digitais, blogs, feeds de mídia social e uma infinidade de aplicativos de notícias móveis. As organizações de notícias se beneficiaram do aumento do uso de mídias sociais e plataformas móveis, fornecendo aos assinantes informações atualizadas.

Os consumidores agora têm acesso instantâneo às últimas notícias. Essas plataformas de mídia digital ganharam destaque devido à sua fácil conexão com o resto do mundo e permitem aos usuários discutir e compartilhar ideias e debater temas como democracia, educação, saúde, pesquisa e história. As notícias falsas nas plataformas digitais estão cada vez mais populares e são usadas para fins lucrativos, como ganhos políticos e financeiros.

Quão Grande é este Problema?

Como a Internet, as mídias sociais e as plataformas digitais são amplamente utilizadas, qualquer pessoa pode propagar informações imprecisas e tendenciosas. É quase impossível evitar a disseminação de notícias falsas. Há um tremendo aumento na distribuição de notícias falsas, que não se restringe a um setor como a política, mas inclui esportes, saúde, história, entretenimento, ciência e pesquisa.

A solução

É vital reconhecer e diferenciar entre notícias falsas e verdadeiras. Um método é fazer com que um especialista decida e verifique cada informação, mas isso leva tempo e requer conhecimentos que não podem ser compartilhados. Em segundo lugar, podemos usar ferramentas de aprendizado de máquina e inteligência artificial para automatizar a identificação de notícias falsas.

As informações de notícias on-line incluem vários dados de formato não estruturado (como documentos, vídeos e áudio), mas vamos nos concentrar nas notícias em formato de texto aqui. Com o progresso do aprendizado de máquina e do processamento de linguagem natural , agora podemos reconhecer o caráter enganoso e falso de um artigo ou declaração.

Vários estudos e experimentos estão sendo realizados para detectar notícias falsas em todos os meios.

Nosso principal objetivo deste tutorial é:

• Explore e analise o conjunto de dados de Fake News.
• Construa um classificador que possa distinguir Fake news com o máximo de precisão possível.

Aqui está a tabela de conteúdo:

• Introdução
• Quão Grande é este Problema?
• A solução
• Distribuição de aulas
• Limpeza de dados para análise
• Nuvem de palavra única
• Bigrama mais frequente (combinação de duas palavras)
• Trigrama mais frequente (combinação de três palavras)
• Construindo um classificador ajustando o BERT
• Tokenização do conjunto de dados
• Carregando e Ajustando o Modelo
• Avaliação do modelo
• Apêndice: Criando um arquivo de envio para o Kaggle
• Conclusão

Neste trabalho, utilizamos o conjunto de dados de notícias falsas do Kaggle para classificar notícias não confiáveis ​​como notícias falsas. Temos um conjunto de dados de treinamento completo contendo as seguintes características:

• `id`: ID exclusivo para um artigo de notícias
• `title`: título de uma notícia
• `author`: autor da reportagem
• `text`: texto do artigo; pode estar incompleto
• `label`: um rótulo que marca o artigo como potencialmente não confiável indicado por 1 (não confiável ou falso) ou 0 (confiável).

É um problema de classificação binária no qual devemos prever se uma determinada notícia é confiável ou não.

Se você tiver uma conta Kaggle, basta baixar o conjunto de dados do site e extrair o arquivo ZIP.

Também carreguei o conjunto de dados no Google Drive, e você pode obtê-lo aqui ou usar a `gdown`biblioteca para baixá-lo automaticamente nos notebooks do Google Colab ou Jupyter:

``\$ pip install gdown``
``````# download from Google Drive
``````Downloading...
To: /content/fake-news.zip
100% 48.7M/48.7M [00:00<00:00, 74.6MB/s]``````

Descompactando os arquivos:

``\$ unzip fake-news.zip``

Três arquivos aparecerão no diretório de trabalho atual: `train.csv`, `test.csv`, e `submit.csv`, que usaremos `train.csv`na maior parte do tutorial.

Instalando as dependências necessárias:

``\$ pip install transformers nltk pandas numpy matplotlib seaborn wordcloud``

Nota: Se você estiver em um ambiente local, certifique-se de instalar o PyTorch para GPU, vá para esta página para uma instalação adequada.

Vamos importar as bibliotecas essenciais para análise:

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns``````

``````import nltk

O conjunto de dados de notícias falsas inclui títulos e textos de artigos originais e fictícios de vários autores. Vamos importar nosso conjunto de dados:

``````# load the dataset
``````print("Shape of News data:", news_d.shape)
print("News data columns", news_d.columns)``````

Saída:

`````` Shape of News data: (20800, 5)
News data columns Index(['id', 'title', 'author', 'text', 'label'], dtype='object')``````

Veja como fica o conjunto de dados:

``````# by using df.head(), we can immediately familiarize ourselves with the dataset.

Saída:

``````id	title	author	text	label
0	0	House Dem Aide: We Didn’t Even See Comey’s Let...	Darrell Lucus	House Dem Aide: We Didn’t Even See Comey’s Let...	1
1	1	FLYNN: Hillary Clinton, Big Woman on Campus - ...	Daniel J. Flynn	Ever get the feeling your life circles the rou...	0
2	2	Why the Truth Might Get You Fired	Consortiumnews.com	Why the Truth Might Get You Fired October 29, ...	1
3	3	15 Civilians Killed In Single US Airstrike Hav...	Jessica Purkiss	Videos 15 Civilians Killed In Single US Airstr...	1
4	4	Iranian woman jailed for fictional unpublished...	Howard Portnoy	Print \nAn Iranian woman has been sentenced to...	1``````

Temos 20.800 linhas, que têm cinco colunas. Vamos ver algumas estatísticas da `text`coluna:

``````#Text Word startistics: min.mean, max and interquartile range

txt_length = news_d.text.str.split().str.len()
txt_length.describe()``````

Saída:

``````count    20761.000000
mean       760.308126
std        869.525988
min          0.000000
25%        269.000000
50%        556.000000
75%       1052.000000
max      24234.000000
Name: text, dtype: float64``````

Estatísticas da `title`coluna:

``````#Title statistics

title_length = news_d.title.str.split().str.len()
title_length.describe()``````

Saída:

``````count    20242.000000
mean        12.420709
std          4.098735
min          1.000000
25%         10.000000
50%         13.000000
75%         15.000000
max         72.000000
Name: title, dtype: float64``````

As estatísticas para os conjuntos de treinamento e teste são as seguintes:

• O `text`atributo possui maior contagem de palavras com média de 760 palavras e 75% com mais de 1000 palavras.
• O `title`atributo é uma declaração curta com uma média de 12 palavras, sendo que 75% delas são em torno de 15 palavras.

Nosso experimento seria com texto e título juntos.

Distribuição de aulas

Contando parcelas para ambos os rótulos:

``````sns.countplot(x="label", data=news_d);
print("1: Unreliable")
print("0: Reliable")
print("Distribution of labels:")
print(news_d.label.value_counts());``````

Saída:

``````1: Unreliable
0: Reliable
Distribution of labels:
1    10413
0    10387
Name: label, dtype: int64``````

``print(round(news_d.label.value_counts(normalize=True),2)*100);``

Saída:

``````1    50.0
0    50.0
Name: label, dtype: float64``````

O número de artigos não confiáveis ​​(falsos ou 1) é 10.413, enquanto o número de artigos confiáveis ​​(confiáveis ​​ou 0) é 10.387. Quase 50% dos artigos são falsos. Portanto, a métrica de precisão medirá o desempenho do nosso modelo ao construir um classificador.

Nesta seção, vamos limpar nosso conjunto de dados para fazer algumas análises:

• Elimine linhas e colunas não utilizadas.
• Execute a imputação de valor nulo.
• Remova os caracteres especiais.
``````# Constants that are used to sanitize the datasets

column_n = ['id', 'title', 'author', 'text', 'label']
remove_c = ['id','author']
categorical_features = []
target_col = ['label']
text_f = ['title', 'text']``````
``````# Clean Datasets
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from collections import Counter

ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()

stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

# Removed unused clumns
def remove_unused_c(df,column_n=remove_c):
df = df.drop(column_n,axis=1)
return df

# Impute null values with None
def null_process(feature_df):
for col in text_f:
feature_df.loc[feature_df[col].isnull(), col] = "None"
return feature_df

def clean_dataset(df):
# remove unused column
df = remove_unused_c(df)
#impute null values
df = null_process(df)
return df

# Cleaning text from unused characters
def clean_text(text):
text = str(text).replace(r'http[\w:/\.]+', ' ')  # removing urls
text = str(text).replace(r'[^\.\w\s]', ' ')  # remove everything but characters and punctuation
text = str(text).replace('[^a-zA-Z]', ' ')
text = str(text).replace(r'\s\s+', ' ')
text = text.lower().strip()
#text = ' '.join(text)
return text

## Nltk Preprocessing include:
# Stop words, Stemming and Lemmetization
# For our project we use only Stop word removal
def nltk_preprocess(text):
text = clean_text(text)
wordlist = re.sub(r'[^\w\s]', '', text).split()
#text = ' '.join([word for word in wordlist if word not in stopwords_dict])
#text = [ps.stem(word) for word in wordlist if not word in stopwords_dict]
text = ' '.join([wnl.lemmatize(word) for word in wordlist if word not in stopwords_dict])
return  text``````

No bloco de código acima:

• Importamos o NLTK, que é uma famosa plataforma de desenvolvimento de aplicativos Python que interagem com a linguagem humana. Em seguida, importamos `re`para regex.
• Importamos palavras irrelevantes de `nltk.corpus`. Ao trabalhar com palavras, principalmente ao considerar a semântica, às vezes precisamos eliminar palavras comuns que não adicionam nenhum significado significativo a uma declaração, como `"but"`, `"can"`, `"we"`, etc.
• `PorterStemmer`é usado para executar palavras derivadas com NLTK. Stemmers retiram palavras de seus afixos morfológicos, deixando apenas o radical da palavra.
• Importamos `WordNetLemmatizer()`da biblioteca NLTK para lematização. A lematização é muito mais eficaz do que a derivação . Ele vai além da redução de palavras e avalia todo o léxico de uma língua para aplicar a análise morfológica às palavras, com o objetivo de apenas remover as extremidades flexionais e retornar a forma base ou dicionário de uma palavra, conhecida como lema.
• `stopwords.words('english')`permite-nos ver a lista de todas as palavras de parada em inglês suportadas pelo NLTK.
• `remove_unused_c()`A função é usada para remover as colunas não utilizadas.
• Nós imputamos valores nulos `None`usando a `null_process()`função.
• Dentro da função `clean_dataset()`, chamamos `remove_unused_c()`e `null_process()`funções. Esta função é responsável pela limpeza dos dados.
• Para limpar o texto de caracteres não utilizados, criamos a `clean_text()`função.
• Para pré-processamento, usaremos apenas a remoção de palavras de parada. Criamos a `nltk_preprocess()`função para isso.

Pré-processando o `text`e `title`:

``````# Perform data cleaning on train and test dataset by calling clean_dataset function
df = clean_dataset(news_d)
# apply preprocessing on text through apply method by calling the function nltk_preprocess
df["text"] = df.text.apply(nltk_preprocess)
# apply preprocessing on title through apply method by calling the function nltk_preprocess
df["title"] = df.title.apply(nltk_preprocess)``````
``````# Dataset after cleaning and preprocessing step

Saída:

``````title	text	label
0	house dem aide didnt even see comeys letter ja...	house dem aide didnt even see comeys letter ja...	1
1	flynn hillary clinton big woman campus breitbart	ever get feeling life circle roundabout rather...	0
2	truth might get fired	truth might get fired october 29 2016 tension ...	1
3	15 civilian killed single u airstrike identified	video 15 civilian killed single u airstrike id...	1
4	iranian woman jailed fictional unpublished sto...	print iranian woman sentenced six year prison ...	1``````

Nesta seção, vamos realizar:

• Análise Univariada : É uma análise estatística do texto. Usaremos a nuvem de palavras para esse propósito. Uma nuvem de palavras é uma abordagem de visualização de dados de texto em que o termo mais comum é apresentado no tamanho de fonte mais considerável.
• Análise Bivariada : Bigrama e Trigrama serão usados ​​aqui. Segundo a Wikipedia: " um n-grama é uma sequência contígua de n itens de uma determinada amostra de texto ou fala. De acordo com a aplicação, os itens podem ser fonemas, sílabas, letras, palavras ou pares de bases. Os n-gramas são normalmente coletados de um texto ou corpus de fala".

Nuvem de palavra única

As palavras mais frequentes aparecem em negrito e fonte maior em uma nuvem de palavras. Esta seção realizará uma nuvem de palavras para todas as palavras no conjunto de dados.

A função da biblioteca WordCloud`wordcloud()` será usada, e o `generate()`é utilizado para gerar a imagem da nuvem de palavras:

``````from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# initialize the word cloud
wordcloud = WordCloud( background_color='black', width=800, height=600)
# generate the word cloud by passing the corpus
text_cloud = wordcloud.generate(' '.join(df['text']))
# plotting the word cloud
plt.figure(figsize=(20,30))
plt.imshow(text_cloud)
plt.axis('off')
plt.show()``````

Saída:

Nuvem de palavras apenas para notícias confiáveis:

``````true_n = ' '.join(df[df['label']==0]['text'])
wc = wordcloud.generate(true_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()``````

Saída:

Nuvem de palavras apenas para notícias falsas:

``````fake_n = ' '.join(df[df['label']==1]['text'])
wc= wordcloud.generate(fake_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()``````

Saída:

Bigrama mais frequente (combinação de duas palavras)

Um N-gram é uma sequência de letras ou palavras. Um unigrama de caractere é composto por um único caractere, enquanto um bigrama compreende uma série de dois caracteres. Da mesma forma, os N-gramas de palavras são compostos de uma série de n palavras. A palavra "unidos" é um 1 grama (unigrama). A combinação das palavras "estado unido" é um 2 gramas (bigrama), "nova york cidade" é um 3 gramas.

Vamos traçar o bigrama mais comum nas notícias confiáveis:

``````def plot_top_ngrams(corpus, title, ylabel, xlabel="Number of Occurences", n=2):
"""Utility function to plot top n-grams"""
true_b = (pd.Series(nltk.ngrams(corpus.split(), n)).value_counts())[:20]
true_b.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title(title)
plt.ylabel(ylabel)
plt.xlabel(xlabel)
plt.show()``````
``plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Bigrams', "Bigram", n=2)``

O bigrama mais comum nas notícias falsas:

``plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Bigrams', "Bigram", n=2)``

Trigrama mais frequente (combinação de três palavras)

O trigrama mais comum em notícias confiáveis:

``plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Trigrams', "Trigrams", n=3)``

Para notícias falsas agora:

``plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Trigrams', "Trigrams", n=3)``

Os gráficos acima nos dão algumas ideias de como as duas classes se parecem. Na próxima seção, usaremos a biblioteca de transformadores para construir um detector de notícias falsas.

Construindo um classificador ajustando o BERT

Esta seção irá pegar o código extensivamente do tutorial BERT de ajuste fino para fazer um classificador de notícias falsas usando a biblioteca de transformadores. Portanto, para obter informações mais detalhadas, você pode acessar o tutorial original .

Se você não instalou transformadores, você deve:

``\$ pip install transformers``

Vamos importar as bibliotecas necessárias:

``````import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.model_selection import train_test_split

import random``````

Queremos tornar nossos resultados reproduzíveis mesmo se reiniciarmos nosso ambiente:

``````def set_seed(seed: int):
"""
Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
installed).

Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf

tf.random.set_seed(seed)

set_seed(1)``````

O modelo que vamos usar é o `bert-base-uncased`:

``````# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512``````

``````# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)``````

Vamos agora limpar os `NaN`valores das colunas `text`, `author`e :`title`

``````news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]``````

Em seguida, criando uma função que recebe o conjunto de dados como um dataframe do Pandas e retorna as divisões de trem/validação de textos e rótulos como listas:

``````def prepare_data(df, test_size=0.2, include_title=True, include_author=True):
texts = []
labels = []
for i in range(len(df)):
text = df["text"].iloc[i]
label = df["label"].iloc[i]
if include_title:
text = df["title"].iloc[i] + " - " + text
if include_author:
text = df["author"].iloc[i] + " : " + text
if text and label in [0, 1]:
texts.append(text)
labels.append(label)
return train_test_split(texts, labels, test_size=test_size)

train_texts, valid_texts, train_labels, valid_labels = prepare_data(news_df)``````

A função acima pega o conjunto de dados em um tipo de dataframe e os retorna como listas divididas em conjuntos de treinamento e validação. Definir `include_title`para `True`significa que adicionamos a `title`coluna ao `text`que vamos usar para treinamento, definir `include_author`para `True`significa que também adicionamos o `author`ao texto.

Vamos garantir que os rótulos e os textos tenham o mesmo comprimento:

``````print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))``````

Saída:

``````14628 14628
3657 3657``````

Vamos usar o tokenizer BERT para tokenizar nosso conjunto de dados:

``````# tokenize the dataset, truncate when passed `max_length`,
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)``````

Convertendo as codificações em um conjunto de dados PyTorch:

``````class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor([self.labels[idx]])
return item

def __len__(self):
return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)``````

Carregando e Ajustando o Modelo

Usaremos `BertForSequenceClassification`para carregar nosso modelo de transformador BERT:

``````# load the model
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)``````

Definimos `num_labels`como 2, pois é uma classificação binária. A função abaixo é um retorno de chamada para calcular a precisão em cada etapa de validação:

``````from sklearn.metrics import accuracy_score

def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}``````

Vamos inicializar os parâmetros de treinamento:

``````training_args = TrainingArguments(
output_dir='./results',          # output directory
num_train_epochs=1,              # total number of training epochs
per_device_train_batch_size=10,  # batch size per device during training
per_device_eval_batch_size=20,   # batch size for evaluation
warmup_steps=100,                # number of warmup steps for learning rate scheduler
logging_dir='./logs',            # directory for storing logs
load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
# but you can specify `metric_for_best_model` argument to change to accuracy or other metric
logging_steps=200,               # log & save weights each logging_steps
save_steps=200,
evaluation_strategy="steps",     # evaluate each `logging_steps`
)``````

Eu configurei o `per_device_train_batch_size`para 10, mas você deve defini-lo o mais alto que sua GPU possa caber. Definindo o `logging_steps`e `save_steps`para 200, o que significa que vamos realizar a avaliação e salvar os pesos do modelo em cada 200 etapas de treinamento.

Você pode verificar  esta página  para obter informações mais detalhadas sobre os parâmetros de treinamento disponíveis.

``````trainer = Trainer(
model=model,                         # the instantiated Transformers model to be trained
args=training_args,                  # training arguments, defined above
train_dataset=train_dataset,         # training dataset
eval_dataset=valid_dataset,          # evaluation dataset
compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)``````

Treinando o modelo:

``````# train the model
trainer.train()``````

O treinamento leva algumas horas para terminar, dependendo da sua GPU. Se você estiver na versão gratuita do Colab, deve levar uma hora com o NVIDIA Tesla K80. Aqui está a saída:

``````***** Running training *****
Num examples = 14628
Num Epochs = 1
Instantaneous batch size per device = 10
Total train batch size (w. parallel, distributed & accumulation) = 10
Total optimization steps = 1463
[1463/1463 41:07, Epoch 1/1]
Step	Training Loss	Validation Loss	Accuracy
200		0.250800		0.100533		0.983867
400		0.027600		0.043009		0.993437
600		0.023400		0.017812		0.997539
800		0.014900		0.030269		0.994258
1000	0.022400		0.012961		0.998086
1200	0.009800		0.010561		0.998633
1400	0.007700		0.010300		0.998633
***** Running Evaluation *****
Num examples = 3657
Batch size = 20
Saving model checkpoint to ./results/checkpoint-200
Configuration saved in ./results/checkpoint-200/config.json
Model weights saved in ./results/checkpoint-200/pytorch_model.bin
<SNIPPED>
***** Running Evaluation *****
Num examples = 3657
Batch size = 20
Saving model checkpoint to ./results/checkpoint-1400
Configuration saved in ./results/checkpoint-1400/config.json
Model weights saved in ./results/checkpoint-1400/pytorch_model.bin

Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=1463, training_loss=0.04888018785440506, metrics={'train_runtime': 2469.1722, 'train_samples_per_second': 5.924, 'train_steps_per_second': 0.593, 'total_flos': 3848788517806080.0, 'train_loss': 0.04888018785440506, 'epoch': 1.0})``````

Avaliação do modelo

Como `load_best_model_at_end`está definido como `True`, os melhores pesos serão carregados quando o treinamento for concluído. Vamos avaliá-lo com nosso conjunto de validação:

``````# evaluate the current model after training
trainer.evaluate()``````

Saída:

``````***** Running Evaluation *****
Num examples = 3657
Batch size = 20
[183/183 02:11]
{'epoch': 1.0,
'eval_accuracy': 0.998632759092152,
'eval_loss': 0.010299865156412125,
'eval_runtime': 132.0374,
'eval_samples_per_second': 27.697,
'eval_steps_per_second': 1.386}``````

Salvando o modelo e o tokenizer:

``````# saving the fine tuned model & tokenizer
model_path = "fake-news-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)``````

Uma nova pasta contendo a configuração do modelo e pesos aparecerá após a execução da célula acima. Se você deseja realizar a previsão, basta usar o `from_pretrained()`método que usamos quando carregamos o modelo e pronto.

Em seguida, vamos fazer uma função que aceite o texto do artigo como argumento e retorne se é falso ou não:

``````def get_prediction(text, convert_to_label=False):
# prepare our text into tokenized sequence
inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
# perform inference to our model
outputs = model(**inputs)
# get output probabilities by doing softmax
probs = outputs[0].softmax(1)
# executing argmax function to get the candidate label
d = {
0: "reliable",
1: "fake"
}
if convert_to_label:
return d[int(probs.argmax())]
else:
return int(probs.argmax())``````

Peguei um exemplo de `test.csv`que o modelo nunca viu fazer inferência, eu verifiquei, e é um artigo real do The New York Times:

``````real_news = """
Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The New York Times",Daniel Victor,"If at first you don’t succeed, try a different sport. Tim Tebow, who was a Heisman   quarterback at the University of Florida but was unable to hold an N. F. L. job, is pursuing a career in Major League Baseball. <SNIPPED>
"""``````

O texto original está no ambiente Colab caso queira copiá-lo, pois é um artigo completo. Vamos passar para o modelo e ver os resultados:

``get_prediction(real_news, convert_to_label=True)``

Saída:

``reliable``

Apêndice: Criando um arquivo de envio para o Kaggle

Nesta seção, vamos prever todos os artigos `test.csv`para criar um arquivo de submissão para ver nossa precisão no teste definido na competição Kaggle :

``````# read the test set
# make a copy of the testing set
new_df = test_df.copy()
# add a new column that contains the author, title and article content
new_df["new_text"] = new_df["author"].astype(str) + " : " + new_df["title"].astype(str) + " - " + new_df["text"].astype(str)
# get the prediction of all the test set
new_df["label"] = new_df["new_text"].apply(get_prediction)
# make the submission file
final_df = new_df[["id", "label"]]
final_df.to_csv("submit_final.csv", index=False)``````

Depois de concatenar o autor, título e texto do artigo juntos, passamos a `get_prediction()`função para a nova coluna para preencher a `label`coluna, então usamos `to_csv()`o método para criar o arquivo de submissão para o Kaggle. Aqui está a minha pontuação de submissão:

Obtivemos 99,78% e 100% de precisão nas tabelas de classificação privadas e públicas. Fantástico!

Conclusão

Pronto, terminamos o tutorial. Você pode verificar esta página para ver vários parâmetros de treinamento que você pode ajustar.

Se você tiver um conjunto de dados de notícias falsas personalizado para ajuste fino, basta passar uma lista de amostras para o tokenizer como fizemos, você não alterará nenhum outro código depois disso.

Confira o código completo aqui , ou o ambiente Colab aqui .

1646051476

Détection de fausses nouvelles en Python

Explorer l'ensemble de données de fausses nouvelles, effectuer une analyse de données telles que des nuages ​​​​de mots et des ngrams, et affiner le transformateur BERT pour créer un détecteur de fausses nouvelles en Python à l'aide de la bibliothèque de transformateurs.

Les fausses nouvelles sont la diffusion intentionnelle d'allégations fausses ou trompeuses en tant que nouvelles, où les déclarations sont délibérément mensongères.

Les journaux, les tabloïds et les magazines ont été supplantés par les plateformes d'actualités numériques, les blogs, les flux de médias sociaux et une pléthore d'applications d'actualités mobiles. Les organes de presse ont profité de l'utilisation accrue des médias sociaux et des plates-formes mobiles en fournissant aux abonnés des informations de dernière minute.

Les consommateurs ont désormais un accès instantané aux dernières nouvelles. Ces plateformes de médias numériques ont gagné en importance en raison de leur connectivité facile au reste du monde et permettent aux utilisateurs de discuter et de partager des idées et de débattre de sujets tels que la démocratie, l'éducation, la santé, la recherche et l'histoire. Les fausses informations sur les plateformes numériques deviennent de plus en plus populaires et sont utilisées à des fins lucratives, telles que des gains politiques et financiers.

Quelle est la taille de ce problème ?

Parce qu'Internet, les médias sociaux et les plateformes numériques sont largement utilisés, n'importe qui peut propager des informations inexactes et biaisées. Il est presque impossible d'empêcher la diffusion de fausses nouvelles. Il y a une énorme augmentation de la diffusion de fausses nouvelles, qui ne se limite pas à un secteur comme la politique, mais comprend le sport, la santé, l'histoire, le divertissement, la science et la recherche.

La solution

Il est essentiel de reconnaître et de différencier les informations fausses des informations exactes. Une méthode consiste à demander à un expert de décider et de vérifier chaque élément d'information, mais cela prend du temps et nécessite une expertise qui ne peut être partagée. Deuxièmement, nous pouvons utiliser des outils d'apprentissage automatique et d'intelligence artificielle pour automatiser l'identification des fausses nouvelles.

Les informations d'actualité en ligne incluent diverses données de format non structuré (telles que des documents, des vidéos et de l'audio), mais nous nous concentrerons ici sur les informations au format texte. Avec les progrès de l'apprentissage automatique et du traitement automatique du langage naturel , nous pouvons désormais reconnaître le caractère trompeur et faux d'un article ou d'une déclaration.

Plusieurs études et expérimentations sont menées pour détecter les fake news sur tous les supports.

Notre objectif principal de ce tutoriel est :

• Explorez et analysez l'ensemble de données Fake News.
• Construisez un classificateur capable de distinguer les fausses nouvelles avec autant de précision que possible.

Voici la table des matières :

• introduction
• Quelle est la taille de ce problème ?
• La solution
• Exploration des données
• Répartition des cours
• Nettoyage des données pour l'analyse
• Analyse exploratoire des données
• Nuage à un seul mot
• Bigramme le plus fréquent (combinaison de deux mots)
• Trigramme le plus fréquent (combinaison de trois mots)
• Construire un classificateur en affinant le BERT
• Préparation des données
• Tokénisation de l'ensemble de données
• Chargement et réglage fin du modèle
• Évaluation du modèle
• Annexe : Création d'un fichier de soumission pour Kaggle
• Conclusion

Exploration des données

Dans ce travail, nous avons utilisé l'ensemble de données sur les fausses nouvelles de Kaggle pour classer les articles d'actualité non fiables comme fausses nouvelles. Nous disposons d'un jeu de données d'entraînement complet contenant les caractéristiques suivantes :

• `id`: identifiant unique pour un article de presse
• `title`: titre d'un article de presse
• `author`: auteur de l'article de presse
• `text`: texte de l'article ; pourrait être incomplet
• `label`: une étiquette qui marque l'article comme potentiellement non fiable, notée 1 (non fiable ou faux) ou 0 (fiable).

Il s'agit d'un problème de classification binaire dans lequel nous devons prédire si une nouvelle particulière est fiable ou non.

Si vous avez un compte Kaggle, vous pouvez simplement télécharger l'ensemble de données à partir du site Web et extraire le fichier ZIP.

J'ai également téléchargé l'ensemble de données dans Google Drive, et vous pouvez l'obtenir ici , ou utiliser la `gdown`bibliothèque pour le télécharger automatiquement dans les blocs-notes Google Colab ou Jupyter :

``\$ pip install gdown``
``````# download from Google Drive
``````Downloading...
To: /content/fake-news.zip
100% 48.7M/48.7M [00:00<00:00, 74.6MB/s]``````

Décompressez les fichiers :

``\$ unzip fake-news.zip``

Trois fichiers apparaîtront dans le répertoire de travail actuel : `train.csv`, `test.csv`, et `submit.csv`, que nous utiliserons `train.csv`dans la majeure partie du didacticiel.

Installation des dépendances requises :

``\$ pip install transformers nltk pandas numpy matplotlib seaborn wordcloud``

Remarque : Si vous êtes dans un environnement local, assurez-vous d'installer PyTorch pour GPU, rendez-vous sur cette page pour une installation correcte.

Importons les bibliothèques essentielles pour l'analyse :

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns``````

Les corpus et modules NLTK doivent être installés à l'aide du téléchargeur NLTK standard :

``````import nltk

L'ensemble de données sur les fausses nouvelles comprend les titres et le texte d'articles originaux et fictifs de divers auteurs. Importons notre jeu de données :

``````# load the dataset
``````print("Shape of News data:", news_d.shape)
print("News data columns", news_d.columns)``````

Sortir:

`````` Shape of News data: (20800, 5)
News data columns Index(['id', 'title', 'author', 'text', 'label'], dtype='object')``````

Voici à quoi ressemble l'ensemble de données :

``````# by using df.head(), we can immediately familiarize ourselves with the dataset.

Sortir:

``````id	title	author	text	label
0	0	House Dem Aide: We Didn’t Even See Comey’s Let...	Darrell Lucus	House Dem Aide: We Didn’t Even See Comey’s Let...	1
1	1	FLYNN: Hillary Clinton, Big Woman on Campus - ...	Daniel J. Flynn	Ever get the feeling your life circles the rou...	0
2	2	Why the Truth Might Get You Fired	Consortiumnews.com	Why the Truth Might Get You Fired October 29, ...	1
3	3	15 Civilians Killed In Single US Airstrike Hav...	Jessica Purkiss	Videos 15 Civilians Killed In Single US Airstr...	1
4	4	Iranian woman jailed for fictional unpublished...	Howard Portnoy	Print \nAn Iranian woman has been sentenced to...	1``````

Nous avons 20 800 lignes, qui ont cinq colonnes. Voyons quelques statistiques de la `text`colonne :

``````#Text Word startistics: min.mean, max and interquartile range

txt_length = news_d.text.str.split().str.len()
txt_length.describe()``````

Sortir:

``````count    20761.000000
mean       760.308126
std        869.525988
min          0.000000
25%        269.000000
50%        556.000000
75%       1052.000000
max      24234.000000
Name: text, dtype: float64``````

Statistiques pour la `title`colonne :

``````#Title statistics

title_length = news_d.title.str.split().str.len()
title_length.describe()``````

Sortir:

``````count    20242.000000
mean        12.420709
std          4.098735
min          1.000000
25%         10.000000
50%         13.000000
75%         15.000000
max         72.000000
Name: title, dtype: float64``````

Les statistiques pour les ensembles d'entraînement et de test sont les suivantes :

• L' `text`attribut a un nombre de mots plus élevé avec une moyenne de 760 mots et 75% ayant plus de 1000 mots.
• L' `title`attribut est une courte déclaration avec une moyenne de 12 mots, et 75% d'entre eux sont d'environ 15 mots.

Notre expérience porterait à la fois sur le texte et le titre.

Répartition des cours

Compter les parcelles pour les deux étiquettes :

``````sns.countplot(x="label", data=news_d);
print("1: Unreliable")
print("0: Reliable")
print("Distribution of labels:")
print(news_d.label.value_counts());``````

Sortir:

``````1: Unreliable
0: Reliable
Distribution of labels:
1    10413
0    10387
Name: label, dtype: int64``````

``print(round(news_d.label.value_counts(normalize=True),2)*100);``

Sortir:

``````1    50.0
0    50.0
Name: label, dtype: float64``````

Le nombre d'articles non fiables (faux ou 1) est de 10413, tandis que le nombre d'articles dignes de confiance (fiables ou 0) est de 10387. Près de 50% des articles sont faux. Par conséquent, la métrique de précision mesurera la performance de notre modèle lors de la construction d'un classificateur.

Nettoyage des données pour l'analyse

Dans cette section, nous allons nettoyer notre ensemble de données pour effectuer une analyse :

• Supprimez les lignes et les colonnes inutilisées.
• Effectuez une imputation de valeur nulle.
• Supprimer les caractères spéciaux.
• Supprimez les mots vides.
``````# Constants that are used to sanitize the datasets

column_n = ['id', 'title', 'author', 'text', 'label']
remove_c = ['id','author']
categorical_features = []
target_col = ['label']
text_f = ['title', 'text']``````
``````# Clean Datasets
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from collections import Counter

ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()

stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

# Removed unused clumns
def remove_unused_c(df,column_n=remove_c):
df = df.drop(column_n,axis=1)
return df

# Impute null values with None
def null_process(feature_df):
for col in text_f:
feature_df.loc[feature_df[col].isnull(), col] = "None"
return feature_df

def clean_dataset(df):
# remove unused column
df = remove_unused_c(df)
#impute null values
df = null_process(df)
return df

# Cleaning text from unused characters
def clean_text(text):
text = str(text).replace(r'http[\w:/\.]+', ' ')  # removing urls
text = str(text).replace(r'[^\.\w\s]', ' ')  # remove everything but characters and punctuation
text = str(text).replace('[^a-zA-Z]', ' ')
text = str(text).replace(r'\s\s+', ' ')
text = text.lower().strip()
#text = ' '.join(text)
return text

## Nltk Preprocessing include:
# Stop words, Stemming and Lemmetization
# For our project we use only Stop word removal
def nltk_preprocess(text):
text = clean_text(text)
wordlist = re.sub(r'[^\w\s]', '', text).split()
#text = ' '.join([word for word in wordlist if word not in stopwords_dict])
#text = [ps.stem(word) for word in wordlist if not word in stopwords_dict]
text = ' '.join([wnl.lemmatize(word) for word in wordlist if word not in stopwords_dict])
return  text``````

Dans le bloc de code ci-dessus :

• Nous avons importé NLTK, qui est une plate-forme célèbre pour développer des applications Python qui interagissent avec le langage humain. Ensuite, nous importons `re`pour regex.
• Nous importons des mots vides à partir de `nltk.corpus`. Lorsque nous travaillons avec des mots, en particulier lorsque nous considérons la sémantique, nous devons parfois éliminer les mots courants qui n'ajoutent aucune signification significative à une déclaration, tels que `"but"`, `"can"`, `"we"`, etc.
• `PorterStemmer`est utilisé pour effectuer des mots radicaux avec NLTK. Les radicaux dépouillent les mots de leurs affixes morphologiques, laissant uniquement le radical du mot.
• Nous importons `WordNetLemmatizer()`de la bibliothèque NLTK pour la lemmatisation. La lemmatisation est bien plus efficace que la radicalisation . Il va au-delà de la réduction des mots et évalue l'ensemble du lexique d'une langue pour appliquer une analyse morphologique aux mots, dans le but de supprimer simplement les extrémités flexionnelles et de renvoyer la forme de base ou de dictionnaire d'un mot, connue sous le nom de lemme.
• `stopwords.words('english')`permettez-nous de regarder la liste de tous les mots vides en anglais pris en charge par NLTK.
• `remove_unused_c()`La fonction est utilisée pour supprimer les colonnes inutilisées.
• Nous imputons des valeurs nulles à `None`l'aide de la `null_process()`fonction.
• A l'intérieur de la fonction `clean_dataset()`, nous appelons `remove_unused_c()`et `null_process()`fonctions. Cette fonction est responsable du nettoyage des données.
• Pour nettoyer le texte des caractères inutilisés, nous avons créé la `clean_text()`fonction.
• Pour le prétraitement, nous n'utiliserons que la suppression des mots vides. Nous avons créé la `nltk_preprocess()`fonction à cet effet.

Prétraitement de `text`et `title`:

``````# Perform data cleaning on train and test dataset by calling clean_dataset function
df = clean_dataset(news_d)
# apply preprocessing on text through apply method by calling the function nltk_preprocess
df["text"] = df.text.apply(nltk_preprocess)
# apply preprocessing on title through apply method by calling the function nltk_preprocess
df["title"] = df.title.apply(nltk_preprocess)``````
``````# Dataset after cleaning and preprocessing step

Sortir:

``````title	text	label
0	house dem aide didnt even see comeys letter ja...	house dem aide didnt even see comeys letter ja...	1
1	flynn hillary clinton big woman campus breitbart	ever get feeling life circle roundabout rather...	0
2	truth might get fired	truth might get fired october 29 2016 tension ...	1
3	15 civilian killed single u airstrike identified	video 15 civilian killed single u airstrike id...	1
4	iranian woman jailed fictional unpublished sto...	print iranian woman sentenced six year prison ...	1``````

Analyse exploratoire des données

Dans cette section, nous effectuerons :

• Analyse Univariée : C'est une analyse statistique du texte. Nous utiliserons un nuage de mots à cette fin. Un nuage de mots est une approche de visualisation des données textuelles où le terme le plus courant est présenté dans la taille de police la plus importante.
• Analyse Bivariée : Bigramme et Trigramme seront utilisés ici. Selon Wikipedia : " un n-gramme est une séquence contiguë de n éléments d'un échantillon donné de texte ou de parole. Selon l'application, les éléments peuvent être des phonèmes, des syllabes, des lettres, des mots ou des paires de bases. Les n-grammes sont généralement collectées à partir d'un corpus textuel ou vocal ».

Nuage à un seul mot

Les mots les plus fréquents apparaissent en caractères gras et plus gros dans un nuage de mots. Cette section effectuera un nuage de mots pour tous les mots du jeu de données.

La fonction de la bibliothèque WordCloud`wordcloud()` sera utilisée, et la `generate()`est utilisée pour générer l'image du nuage de mots :

``````from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# initialize the word cloud
wordcloud = WordCloud( background_color='black', width=800, height=600)
# generate the word cloud by passing the corpus
text_cloud = wordcloud.generate(' '.join(df['text']))
# plotting the word cloud
plt.figure(figsize=(20,30))
plt.imshow(text_cloud)
plt.axis('off')
plt.show()``````

Sortir:

Nuage de mots pour les informations fiables uniquement :

``````true_n = ' '.join(df[df['label']==0]['text'])
wc = wordcloud.generate(true_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()``````

Sortir:

Nuage de mots pour les fake news uniquement :

``````fake_n = ' '.join(df[df['label']==1]['text'])
wc= wordcloud.generate(fake_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()``````

Sortir:

Bigramme le plus fréquent (combinaison de deux mots)

Un N-gramme est une séquence de lettres ou de mots. Un unigramme de caractère est composé d'un seul caractère, tandis qu'un bigramme est composé d'une série de deux caractères. De même, les N-grammes de mots sont constitués d'une suite de n mots. Le mot "uni" est un 1-gramme (unigramme). La combinaison des mots "États-Unis" est un 2-gramme (bigramme), "new york city" est un 3-gramme.

Traçons le bigramme le plus courant sur les nouvelles fiables :

``````def plot_top_ngrams(corpus, title, ylabel, xlabel="Number of Occurences", n=2):
"""Utility function to plot top n-grams"""
true_b = (pd.Series(nltk.ngrams(corpus.split(), n)).value_counts())[:20]
true_b.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title(title)
plt.ylabel(ylabel)
plt.xlabel(xlabel)
plt.show()``````
``plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Bigrams', "Bigram", n=2)``

Le bigramme le plus courant sur les fake news :

``plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Bigrams', "Bigram", n=2)``

Trigramme le plus fréquent (combinaison de trois mots)

Le trigramme le plus courant sur les informations fiables :

``plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Trigrams', "Trigrams", n=3)``

Pour les fausses nouvelles maintenant :

``plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Trigrams', "Trigrams", n=3)``

Les tracés ci-dessus nous donnent quelques idées sur l'apparence des deux classes. Dans la section suivante, nous utiliserons la bibliothèque de transformateurs pour créer un détecteur de fausses nouvelles.

Construire un classificateur en affinant le BERT

Cette section récupèrera largement le code du tutoriel de réglage fin du BERT pour créer un classificateur de fausses nouvelles à l'aide de la bibliothèque de transformateurs. Ainsi, pour des informations plus détaillées, vous pouvez vous diriger vers le tutoriel d'origine .

Si vous n'avez pas installé de transformateurs, vous devez :

``\$ pip install transformers``

Importons les bibliothèques nécessaires :

``````import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.model_selection import train_test_split

import random``````

Nous voulons rendre nos résultats reproductibles même si nous redémarrons notre environnement :

``````def set_seed(seed: int):
"""
Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
installed).

Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf

tf.random.set_seed(seed)

set_seed(1)``````

Le modèle que nous allons utiliser est le `bert-base-uncased`:

``````# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512``````

Chargement du tokenizer :

``````# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)``````

Préparation des données

Nettoyons maintenant les `NaN`valeurs des colonnes `text`, `author`et :`title`

``````news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]``````

Ensuite, créez une fonction qui prend l'ensemble de données en tant que dataframe Pandas et renvoie les fractionnements de train/validation des textes et des étiquettes sous forme de listes :

``````def prepare_data(df, test_size=0.2, include_title=True, include_author=True):
texts = []
labels = []
for i in range(len(df)):
text = df["text"].iloc[i]
label = df["label"].iloc[i]
if include_title:
text = df["title"].iloc[i] + " - " + text
if include_author:
text = df["author"].iloc[i] + " : " + text
if text and label in [0, 1]:
texts.append(text)
labels.append(label)
return train_test_split(texts, labels, test_size=test_size)

train_texts, valid_texts, train_labels, valid_labels = prepare_data(news_df)``````

La fonction ci-dessus prend l'ensemble de données dans un type de trame de données et les renvoie sous forme de listes divisées en ensembles d'apprentissage et de validation. Définir `include_title`sur `True`signifie que nous ajoutons la `title`colonne à celle `text`que nous allons utiliser pour la formation, définir `include_author`sur `True`signifie que nous ajoutons `author`également la au texte.

Assurons-nous que les étiquettes et les textes ont la même longueur :

``````print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))``````

Sortir:

``````14628 14628
3657 3657``````

Tokénisation de l'ensemble de données

Utilisons le tokenizer BERT pour tokeniser notre jeu de données :

``````# tokenize the dataset, truncate when passed `max_length`,
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)``````

Conversion des encodages en un jeu de données PyTorch :

``````class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor([self.labels[idx]])
return item

def __len__(self):
return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)``````

Chargement et réglage fin du modèle

Nous utiliserons `BertForSequenceClassification`pour charger notre modèle de transformateur BERT :

``````# load the model
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)``````

Nous avons mis `num_labels`à 2 puisqu'il s'agit d'une classification binaire. La fonction ci-dessous est un rappel pour calculer la précision à chaque étape de validation :

``````from sklearn.metrics import accuracy_score

def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}``````

Initialisons les paramètres d'entraînement :

``````training_args = TrainingArguments(
output_dir='./results',          # output directory
num_train_epochs=1,              # total number of training epochs
per_device_train_batch_size=10,  # batch size per device during training
per_device_eval_batch_size=20,   # batch size for evaluation
warmup_steps=100,                # number of warmup steps for learning rate scheduler
logging_dir='./logs',            # directory for storing logs
load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
# but you can specify `metric_for_best_model` argument to change to accuracy or other metric
logging_steps=200,               # log & save weights each logging_steps
save_steps=200,
evaluation_strategy="steps",     # evaluate each `logging_steps`
)``````

J'ai réglé le `per_device_train_batch_size`à 10, mais vous devriez le régler aussi haut que votre GPU pourrait éventuellement s'adapter. En réglant le `logging_steps`et `save_steps`sur 200, cela signifie que nous allons effectuer une évaluation et enregistrer les poids du modèle à chaque étape de formation de 200.

Vous pouvez consulter  cette page  pour des informations plus détaillées sur les paramètres d'entraînement disponibles.

Instancions le formateur :

``````trainer = Trainer(
model=model,                         # the instantiated Transformers model to be trained
args=training_args,                  # training arguments, defined above
train_dataset=train_dataset,         # training dataset
eval_dataset=valid_dataset,          # evaluation dataset
compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)``````

Entraînement du modèle :

``````# train the model
trainer.train()``````

La formation prend quelques heures pour se terminer, en fonction de votre GPU. Si vous êtes sur la version gratuite de Colab, cela devrait prendre une heure avec NVIDIA Tesla K80. Voici la sortie :

``````***** Running training *****
Num examples = 14628
Num Epochs = 1
Instantaneous batch size per device = 10
Total train batch size (w. parallel, distributed & accumulation) = 10
Total optimization steps = 1463
[1463/1463 41:07, Epoch 1/1]
Step	Training Loss	Validation Loss	Accuracy
200		0.250800		0.100533		0.983867
400		0.027600		0.043009		0.993437
600		0.023400		0.017812		0.997539
800		0.014900		0.030269		0.994258
1000	0.022400		0.012961		0.998086
1200	0.009800		0.010561		0.998633
1400	0.007700		0.010300		0.998633
***** Running Evaluation *****
Num examples = 3657
Batch size = 20
Saving model checkpoint to ./results/checkpoint-200
Configuration saved in ./results/checkpoint-200/config.json
Model weights saved in ./results/checkpoint-200/pytorch_model.bin
<SNIPPED>
***** Running Evaluation *****
Num examples = 3657
Batch size = 20
Saving model checkpoint to ./results/checkpoint-1400
Configuration saved in ./results/checkpoint-1400/config.json
Model weights saved in ./results/checkpoint-1400/pytorch_model.bin

Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=1463, training_loss=0.04888018785440506, metrics={'train_runtime': 2469.1722, 'train_samples_per_second': 5.924, 'train_steps_per_second': 0.593, 'total_flos': 3848788517806080.0, 'train_loss': 0.04888018785440506, 'epoch': 1.0})``````

Évaluation du modèle

Étant donné que `load_best_model_at_end`est réglé sur `True`, les meilleurs poids seront chargés une fois l'entraînement terminé. Évaluons-le avec notre ensemble de validation :

``````# evaluate the current model after training
trainer.evaluate()``````

Sortir:

``````***** Running Evaluation *****
Num examples = 3657
Batch size = 20
[183/183 02:11]
{'epoch': 1.0,
'eval_accuracy': 0.998632759092152,
'eval_loss': 0.010299865156412125,
'eval_runtime': 132.0374,
'eval_samples_per_second': 27.697,
'eval_steps_per_second': 1.386}``````

Enregistrement du modèle et du tokenizer :

``````# saving the fine tuned model & tokenizer
model_path = "fake-news-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)``````

Un nouveau dossier contenant la configuration du modèle et les poids apparaîtra après l'exécution de la cellule ci-dessus. Si vous souhaitez effectuer une prédiction, vous utilisez simplement la `from_pretrained()`méthode que nous avons utilisée lorsque nous avons chargé le modèle, et vous êtes prêt à partir.

Ensuite, créons une fonction qui accepte le texte de l'article comme argument et retourne s'il est faux ou non :

``````def get_prediction(text, convert_to_label=False):
# prepare our text into tokenized sequence
inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
# perform inference to our model
outputs = model(**inputs)
# get output probabilities by doing softmax
probs = outputs[0].softmax(1)
# executing argmax function to get the candidate label
d = {
0: "reliable",
1: "fake"
}
if convert_to_label:
return d[int(probs.argmax())]
else:
return int(probs.argmax())``````

J'ai pris un exemple à partir `test.csv`duquel le modèle n'a jamais vu effectuer d'inférence, je l'ai vérifié, et c'est un article réel du New York Times :

``````real_news = """
Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The New York Times",Daniel Victor,"If at first you don’t succeed, try a different sport. Tim Tebow, who was a Heisman   quarterback at the University of Florida but was unable to hold an N. F. L. job, is pursuing a career in Major League Baseball. <SNIPPED>
"""``````

Le texte original se trouve dans l'environnement Colab si vous souhaitez le copier, car il s'agit d'un article complet. Passons-le au modèle et voyons les résultats :

``get_prediction(real_news, convert_to_label=True)``

Sortir:

``reliable``

Annexe : Création d'un fichier de soumission pour Kaggle

Dans cette section, nous allons prédire tous les articles dans le `test.csv`pour créer un dossier de soumission pour voir notre justesse dans le jeu de test sur le concours Kaggle :

``````# read the test set
# make a copy of the testing set
new_df = test_df.copy()
# add a new column that contains the author, title and article content
new_df["new_text"] = new_df["author"].astype(str) + " : " + new_df["title"].astype(str) + " - " + new_df["text"].astype(str)
# get the prediction of all the test set
new_df["label"] = new_df["new_text"].apply(get_prediction)
# make the submission file
final_df = new_df[["id", "label"]]
final_df.to_csv("submit_final.csv", index=False)``````

Après avoir concaténé l'auteur, le titre et le texte de l'article, nous passons la `get_prediction()`fonction à la nouvelle colonne pour remplir la `label`colonne, nous utilisons ensuite la `to_csv()`méthode pour créer le fichier de soumission pour Kaggle. Voici mon score de soumission :

Nous avons obtenu une précision de 99,78 % et 100 % sur les classements privés et publics. C'est génial!

Conclusion

Très bien, nous avons terminé avec le tutoriel. Vous pouvez consulter cette page pour voir divers paramètres d'entraînement que vous pouvez modifier.

Si vous avez un ensemble de données de fausses nouvelles personnalisé pour un réglage fin, il vous suffit de transmettre une liste d'échantillons au tokenizer comme nous l'avons fait, vous ne modifierez plus aucun autre code par la suite.

Vérifiez le code complet ici , ou l'environnement Colab ici .

1646055360

Detección de noticias falsas en Python

Explorar el conjunto de datos de noticias falsas, realizar análisis de datos como nubes de palabras y ngramas, y ajustar el transformador BERT para construir un detector de noticias falsas en Python usando la biblioteca de transformadores.

Las noticias falsas son la transmisión intencional de afirmaciones falsas o engañosas como noticias, donde las declaraciones son deliberadamente engañosas.

Los periódicos, tabloides y revistas han sido reemplazados por plataformas de noticias digitales, blogs, fuentes de redes sociales y una plétora de aplicaciones de noticias móviles. Las organizaciones de noticias se beneficiaron del mayor uso de las redes sociales y las plataformas móviles al proporcionar a los suscriptores información actualizada al minuto.

Los consumidores ahora tienen acceso instantáneo a las últimas noticias. Estas plataformas de medios digitales han aumentado en importancia debido a su fácil conexión con el resto del mundo y permiten a los usuarios discutir y compartir ideas y debatir temas como la democracia, la educación, la salud, la investigación y la historia. Las noticias falsas en las plataformas digitales son cada vez más populares y se utilizan con fines de lucro, como ganancias políticas y financieras.

¿Qué tan grande es este problema?

Debido a que Internet, las redes sociales y las plataformas digitales son ampliamente utilizadas, cualquiera puede propagar información inexacta y sesgada. Es casi imposible evitar la difusión de noticias falsas. Hay un aumento tremendo en la distribución de noticias falsas, que no se restringe a un sector como la política sino que incluye deportes, salud, historia, entretenimiento y ciencia e investigación.

La solución

Es vital reconocer y diferenciar entre noticias falsas y veraces. Un método es hacer que un experto decida y verifique cada pieza de información, pero esto lleva tiempo y requiere experiencia que no se puede compartir. En segundo lugar, podemos utilizar herramientas de aprendizaje automático e inteligencia artificial para automatizar la identificación de noticias falsas.

La información de noticias en línea incluye varios datos en formato no estructurado (como documentos, videos y audio), pero aquí nos concentraremos en las noticias en formato de texto. Con el progreso del aprendizaje automático y el procesamiento del lenguaje natural , ahora podemos reconocer el carácter engañoso y falso de un artículo o declaración.

Se están realizando varios estudios y experimentos para detectar noticias falsas en todos los medios.

Nuestro objetivo principal de este tutorial es:

• Explore y analice el conjunto de datos de noticias falsas.
• Cree un clasificador que pueda distinguir noticias falsas con la mayor precisión posible.

Aquí está la tabla de contenido:

• Introducción
• ¿Qué tan grande es este problema?
• La solución
• Exploración de datos
• Distribución de Clases
• Limpieza de datos para análisis
• Análisis exploratorio de datos
• Nube de una sola palabra
• Bigrama más frecuente (combinación de dos palabras)
• Trigrama más frecuente (combinación de tres palabras)
• Creación de un clasificador mediante el ajuste fino de BERT
• Preparación de datos
• Tokenización del conjunto de datos
• Cargar y ajustar el modelo
• Evaluación del modelo
• Apéndice: Creación de un archivo de envío para Kaggle
• Conclusión

Exploración de datos

En este trabajo, utilizamos el conjunto de datos de noticias falsas de Kaggle para clasificar artículos de noticias no confiables como noticias falsas. Disponemos de un completo dataset de entrenamiento que contiene las siguientes características:

• `id`: identificación única para un artículo de noticias
• `title`: título de un artículo periodístico
• `author`: autor de la noticia
• `text`: texto del artículo; podría estar incompleto
• `label`: una etiqueta que marca el artículo como potencialmente no confiable denotado por 1 (poco confiable o falso) o 0 (confiable).

Es un problema de clasificación binaria en el que debemos predecir si una determinada noticia es fiable o no.

Si tiene una cuenta de Kaggle, simplemente puede descargar el conjunto de datos del sitio web y extraer el archivo ZIP.

También cargué el conjunto de datos en Google Drive y puede obtenerlo aquí o usar la `gdown`biblioteca para descargarlo automáticamente en Google Colab o cuadernos de Jupyter:

``\$ pip install gdown``
``````# download from Google Drive
``````Downloading...
To: /content/fake-news.zip
100% 48.7M/48.7M [00:00<00:00, 74.6MB/s]``````

Descomprimiendo los archivos:

``\$ unzip fake-news.zip``

Aparecerán tres archivos en el directorio de trabajo actual: `train.csv`, `test.csv`y `submit.csv`, que usaremos `train.csv`en la mayor parte del tutorial.

Instalando las dependencias requeridas:

``\$ pip install transformers nltk pandas numpy matplotlib seaborn wordcloud``

Nota: si se encuentra en un entorno local, asegúrese de instalar PyTorch para GPU, diríjase a esta página para una instalación adecuada.

Importemos las bibliotecas esenciales para el análisis:

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns``````

El corpus y los módulos NLTK deben instalarse mediante el descargador NLTK estándar:

``````import nltk

El conjunto de datos de noticias falsas comprende títulos y textos de artículos originales y ficticios de varios autores. Importemos nuestro conjunto de datos:

``````# load the dataset
``````print("Shape of News data:", news_d.shape)
print("News data columns", news_d.columns)``````

Producción:

`````` Shape of News data: (20800, 5)
News data columns Index(['id', 'title', 'author', 'text', 'label'], dtype='object')``````

Así es como se ve el conjunto de datos:

``````# by using df.head(), we can immediately familiarize ourselves with the dataset.

Producción:

``````id	title	author	text	label
0	0	House Dem Aide: We Didn’t Even See Comey’s Let...	Darrell Lucus	House Dem Aide: We Didn’t Even See Comey’s Let...	1
1	1	FLYNN: Hillary Clinton, Big Woman on Campus - ...	Daniel J. Flynn	Ever get the feeling your life circles the rou...	0
2	2	Why the Truth Might Get You Fired	Consortiumnews.com	Why the Truth Might Get You Fired October 29, ...	1
3	3	15 Civilians Killed In Single US Airstrike Hav...	Jessica Purkiss	Videos 15 Civilians Killed In Single US Airstr...	1
4	4	Iranian woman jailed for fictional unpublished...	Howard Portnoy	Print \nAn Iranian woman has been sentenced to...	1``````

Tenemos 20.800 filas, que tienen cinco columnas. Veamos algunas estadísticas de la `text`columna:

``````#Text Word startistics: min.mean, max and interquartile range

txt_length = news_d.text.str.split().str.len()
txt_length.describe()``````

Producción:

``````count    20761.000000
mean       760.308126
std        869.525988
min          0.000000
25%        269.000000
50%        556.000000
75%       1052.000000
max      24234.000000
Name: text, dtype: float64``````

Estadísticas de la `title`columna:

``````#Title statistics

title_length = news_d.title.str.split().str.len()
title_length.describe()``````

Producción:

``````count    20242.000000
mean        12.420709
std          4.098735
min          1.000000
25%         10.000000
50%         13.000000
75%         15.000000
max         72.000000
Name: title, dtype: float64``````

Las estadísticas para los conjuntos de entrenamiento y prueba son las siguientes:

• El `text`atributo tiene un conteo de palabras más alto con un promedio de 760 palabras y un 75% con más de 1000 palabras.
• El `title`atributo es una declaración breve con un promedio de 12 palabras, y el 75% de ellas tiene alrededor de 15 palabras.

Nuestro experimento sería con el texto y el título juntos.

Distribución de Clases

Parcelas de conteo para ambas etiquetas:

``````sns.countplot(x="label", data=news_d);
print("1: Unreliable")
print("0: Reliable")
print("Distribution of labels:")
print(news_d.label.value_counts());``````

Producción:

``````1: Unreliable
0: Reliable
Distribution of labels:
1    10413
0    10387
Name: label, dtype: int64``````

``print(round(news_d.label.value_counts(normalize=True),2)*100);``

Producción:

``````1    50.0
0    50.0
Name: label, dtype: float64``````

La cantidad de artículos no confiables (falsos o 1) es 10413, mientras que la cantidad de artículos confiables (confiables o 0) es 10387. Casi el 50% de los artículos son falsos. Por lo tanto, la métrica de precisión medirá qué tan bien funciona nuestro modelo al construir un clasificador.

Limpieza de datos para análisis

En esta sección, limpiaremos nuestro conjunto de datos para hacer algunos análisis:

• Elimina las filas y columnas que no uses.
• Realizar imputación de valor nulo.
• Eliminar caracteres especiales.
• Elimina las palabras vacías.
``````# Constants that are used to sanitize the datasets

column_n = ['id', 'title', 'author', 'text', 'label']
remove_c = ['id','author']
categorical_features = []
target_col = ['label']
text_f = ['title', 'text']``````
``````# Clean Datasets
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from collections import Counter

ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()

stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

# Removed unused clumns
def remove_unused_c(df,column_n=remove_c):
df = df.drop(column_n,axis=1)
return df

# Impute null values with None
def null_process(feature_df):
for col in text_f:
feature_df.loc[feature_df[col].isnull(), col] = "None"
return feature_df

def clean_dataset(df):
# remove unused column
df = remove_unused_c(df)
#impute null values
df = null_process(df)
return df

# Cleaning text from unused characters
def clean_text(text):
text = str(text).replace(r'http[\w:/\.]+', ' ')  # removing urls
text = str(text).replace(r'[^\.\w\s]', ' ')  # remove everything but characters and punctuation
text = str(text).replace('[^a-zA-Z]', ' ')
text = str(text).replace(r'\s\s+', ' ')
text = text.lower().strip()
#text = ' '.join(text)
return text

## Nltk Preprocessing include:
# Stop words, Stemming and Lemmetization
# For our project we use only Stop word removal
def nltk_preprocess(text):
text = clean_text(text)
wordlist = re.sub(r'[^\w\s]', '', text).split()
#text = ' '.join([word for word in wordlist if word not in stopwords_dict])
#text = [ps.stem(word) for word in wordlist if not word in stopwords_dict]
text = ' '.join([wnl.lemmatize(word) for word in wordlist if word not in stopwords_dict])
return  text``````

En el bloque de código de arriba:

• Hemos importado NLTK, que es una plataforma famosa para desarrollar aplicaciones de Python que interactúan con el lenguaje humano. A continuación, importamos `re`para expresiones regulares.
• Importamos palabras vacías desde `nltk.corpus`. Cuando trabajamos con palabras, particularmente cuando consideramos la semántica, a veces necesitamos eliminar palabras comunes que no agregan ningún significado significativo a una declaración, como `"but"`, `"can"`, `"we"`, etc.
• `PorterStemmer`se utiliza para realizar palabras derivadas con NLTK. Los lematizadores despojan a las palabras de sus afijos morfológicos, dejando únicamente la raíz de la palabra.
• Importamos `WordNetLemmatizer()`de la biblioteca NLTK para la lematización. La lematización es mucho más eficaz que la derivación . Va más allá de la reducción de palabras y evalúa todo el léxico de un idioma para aplicar el análisis morfológico a las palabras, con el objetivo de eliminar los extremos flexivos y devolver la forma base o de diccionario de una palabra, conocida como lema.
• `stopwords.words('english')`permítanos ver la lista de todas las palabras vacías en inglés admitidas por NLTK.
• `remove_unused_c()`La función se utiliza para eliminar las columnas no utilizadas.
• Imputamos valores nulos con `None`el uso de la `null_process()`función.
• Dentro de la función `clean_dataset()`, llamamos `remove_unused_c()`y `null_process()`funciones. Esta función es responsable de la limpieza de datos.
• Para limpiar texto de caracteres no utilizados, hemos creado la `clean_text()`función.
• Para el preprocesamiento, solo utilizaremos la eliminación de palabras vacías. Creamos la `nltk_preprocess()`función para ese propósito.

Preprocesando el `text`y `title`:

``````# Perform data cleaning on train and test dataset by calling clean_dataset function
df = clean_dataset(news_d)
# apply preprocessing on text through apply method by calling the function nltk_preprocess
df["text"] = df.text.apply(nltk_preprocess)
# apply preprocessing on title through apply method by calling the function nltk_preprocess
df["title"] = df.title.apply(nltk_preprocess)``````
``````# Dataset after cleaning and preprocessing step

Producción:

``````title	text	label
0	house dem aide didnt even see comeys letter ja...	house dem aide didnt even see comeys letter ja...	1
1	flynn hillary clinton big woman campus breitbart	ever get feeling life circle roundabout rather...	0
2	truth might get fired	truth might get fired october 29 2016 tension ...	1
3	15 civilian killed single u airstrike identified	video 15 civilian killed single u airstrike id...	1
4	iranian woman jailed fictional unpublished sto...	print iranian woman sentenced six year prison ...	1``````

Análisis exploratorio de datos

En esta sección realizaremos:

• Análisis Univariante : Es un análisis estadístico del texto. Usaremos la nube de palabras para ese propósito. Una nube de palabras es un enfoque de visualización de datos de texto donde el término más común se presenta en el tamaño de fuente más considerable.
• Análisis bivariado : Bigram y Trigram se utilizarán aquí. Según Wikipedia: " un n-grama es una secuencia contigua de n elementos de una muestra determinada de texto o habla. Según la aplicación, los elementos pueden ser fonemas, sílabas, letras, palabras o pares de bases. Los n-gramas normalmente se recopilan de un corpus de texto o de voz".

Nube de una sola palabra

Las palabras más frecuentes aparecen en negrita y de mayor tamaño en una nube de palabras. Esta sección creará una nube de palabras para todas las palabras del conjunto de datos.

Se usará la función de la biblioteca de WordCloud`wordcloud()` y `generate()`se utilizará para generar la imagen de la nube de palabras:

``````from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# initialize the word cloud
wordcloud = WordCloud( background_color='black', width=800, height=600)
# generate the word cloud by passing the corpus
text_cloud = wordcloud.generate(' '.join(df['text']))
# plotting the word cloud
plt.figure(figsize=(20,30))
plt.imshow(text_cloud)
plt.axis('off')
plt.show()``````

Producción:

Nube de palabras solo para noticias confiables:

``````true_n = ' '.join(df[df['label']==0]['text'])
wc = wordcloud.generate(true_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()``````

Producción:

Nube de palabras solo para noticias falsas:

``````fake_n = ' '.join(df[df['label']==1]['text'])
wc= wordcloud.generate(fake_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()``````

Producción:

Bigrama más frecuente (combinación de dos palabras)

Un N-grama es una secuencia de letras o palabras. Un unigrama de carácter se compone de un solo carácter, mientras que un bigrama comprende una serie de dos caracteres. De manera similar, los N-gramas de palabras se componen de una serie de n palabras. La palabra "unidos" es un 1 gramo (unigrama). La combinación de las palabras "estado unido" es de 2 gramos (bigrama), "ciudad de nueva york" es de 3 gramos.

Grafiquemos el bigrama más común en las noticias confiables:

``````def plot_top_ngrams(corpus, title, ylabel, xlabel="Number of Occurences", n=2):
"""Utility function to plot top n-grams"""
true_b = (pd.Series(nltk.ngrams(corpus.split(), n)).value_counts())[:20]
true_b.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title(title)
plt.ylabel(ylabel)
plt.xlabel(xlabel)
plt.show()``````
``plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Bigrams', "Bigram", n=2)``

El bigrama más común en las noticias falsas:

``plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Bigrams', "Bigram", n=2)``

Trigrama más frecuente (combinación de tres palabras)

El trigrama más común en noticias confiables:

``plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Trigrams', "Trigrams", n=3)``

Para noticias falsas ahora:

``plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Trigrams', "Trigrams", n=3)``

Los gráficos anteriores nos dan algunas ideas sobre cómo se ven ambas clases. En la siguiente sección, usaremos la biblioteca de transformadores para construir un detector de noticias falsas.

Creación de un clasificador mediante el ajuste fino de BERT

Esta sección tomará código ampliamente del tutorial BERT de ajuste fino para hacer un clasificador de noticias falsas utilizando la biblioteca de transformadores. Entonces, para obtener información más detallada, puede dirigirse al tutorial original .

``\$ pip install transformers``

Importemos las bibliotecas necesarias:

``````import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.model_selection import train_test_split

import random``````

Queremos que nuestros resultados sean reproducibles incluso si reiniciamos nuestro entorno:

``````def set_seed(seed: int):
"""
Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
installed).

Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf

tf.random.set_seed(seed)

set_seed(1)``````

El modelo que vamos a utilizar es el `bert-base-uncased`:

``````# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512``````

``````# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)``````

Preparación de datos

Limpiemos ahora los `NaN`valores de las columnas `text`, `author`y :`title`

``````news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]``````

A continuación, crear una función que tome el conjunto de datos como un marco de datos de Pandas y devuelva las divisiones de entrenamiento/validación de textos y etiquetas como listas:

``````def prepare_data(df, test_size=0.2, include_title=True, include_author=True):
texts = []
labels = []
for i in range(len(df)):
text = df["text"].iloc[i]
label = df["label"].iloc[i]
if include_title:
text = df["title"].iloc[i] + " - " + text
if include_author:
text = df["author"].iloc[i] + " : " + text
if text and label in [0, 1]:
texts.append(text)
labels.append(label)
return train_test_split(texts, labels, test_size=test_size)

train_texts, valid_texts, train_labels, valid_labels = prepare_data(news_df)``````

La función anterior toma el conjunto de datos en un tipo de marco de datos y los devuelve como listas divididas en conjuntos de entrenamiento y validación. Establecer `include_title`en `True`significa que agregamos la `title`columna a la `text`que vamos a usar para el entrenamiento, establecer `include_author`en `True`significa que también agregamos `author`al texto.

Asegurémonos de que las etiquetas y los textos tengan la misma longitud:

``````print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))``````

Producción:

``````14628 14628
3657 3657``````

Tokenización del conjunto de datos

Usemos el tokenizador BERT para tokenizar nuestro conjunto de datos:

``````# tokenize the dataset, truncate when passed `max_length`,
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)``````

Convertir las codificaciones en un conjunto de datos de PyTorch:

``````class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor([self.labels[idx]])
return item

def __len__(self):
return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)``````

Cargar y ajustar el modelo

Usaremos `BertForSequenceClassification`para cargar nuestro modelo de transformador BERT:

``````# load the model
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)``````

Establecemos `num_labels`a 2 ya que es una clasificación binaria. A continuación, la función es una devolución de llamada para calcular la precisión en cada paso de validación:

``````from sklearn.metrics import accuracy_score

def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}``````

Vamos a inicializar los parámetros de entrenamiento:

``````training_args = TrainingArguments(
output_dir='./results',          # output directory
num_train_epochs=1,              # total number of training epochs
per_device_train_batch_size=10,  # batch size per device during training
per_device_eval_batch_size=20,   # batch size for evaluation
warmup_steps=100,                # number of warmup steps for learning rate scheduler
logging_dir='./logs',            # directory for storing logs
load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
# but you can specify `metric_for_best_model` argument to change to accuracy or other metric
logging_steps=200,               # log & save weights each logging_steps
save_steps=200,
evaluation_strategy="steps",     # evaluate each `logging_steps`
)``````

Configuré el valor `per_device_train_batch_size`en 10, pero debe configurarlo tan alto como su GPU pueda caber. Establecer el `logging_steps`y `save_steps`en 200, lo que significa que vamos a realizar una evaluación y guardar los pesos del modelo en cada 200 pasos de entrenamiento.

Puede consultar  esta página  para obtener información más detallada sobre los parámetros de entrenamiento disponibles.

``````trainer = Trainer(
model=model,                         # the instantiated Transformers model to be trained
args=training_args,                  # training arguments, defined above
train_dataset=train_dataset,         # training dataset
eval_dataset=valid_dataset,          # evaluation dataset
compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)``````

Entrenamiento del modelo:

``````# train the model
trainer.train()``````

El entrenamiento tarda unas horas en finalizar, dependiendo de su GPU. Si está en la versión gratuita de Colab, debería tomar una hora con NVIDIA Tesla K80. Aquí está la salida:

``````***** Running training *****
Num examples = 14628
Num Epochs = 1
Instantaneous batch size per device = 10
Total train batch size (w. parallel, distributed & accumulation) = 10
Total optimization steps = 1463
[1463/1463 41:07, Epoch 1/1]
Step	Training Loss	Validation Loss	Accuracy
200		0.250800		0.100533		0.983867
400		0.027600		0.043009		0.993437
600		0.023400		0.017812		0.997539
800		0.014900		0.030269		0.994258
1000	0.022400		0.012961		0.998086
1200	0.009800		0.010561		0.998633
1400	0.007700		0.010300		0.998633
***** Running Evaluation *****
Num examples = 3657
Batch size = 20
Saving model checkpoint to ./results/checkpoint-200
Configuration saved in ./results/checkpoint-200/config.json
Model weights saved in ./results/checkpoint-200/pytorch_model.bin
<SNIPPED>
***** Running Evaluation *****
Num examples = 3657
Batch size = 20
Saving model checkpoint to ./results/checkpoint-1400
Configuration saved in ./results/checkpoint-1400/config.json
Model weights saved in ./results/checkpoint-1400/pytorch_model.bin

Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=1463, training_loss=0.04888018785440506, metrics={'train_runtime': 2469.1722, 'train_samples_per_second': 5.924, 'train_steps_per_second': 0.593, 'total_flos': 3848788517806080.0, 'train_loss': 0.04888018785440506, 'epoch': 1.0})``````

Evaluación del modelo

Dado que `load_best_model_at_end`está configurado en `True`, los mejores pesos se cargarán cuando se complete el entrenamiento. Vamos a evaluarlo con nuestro conjunto de validación:

``````# evaluate the current model after training
trainer.evaluate()``````

Producción:

``````***** Running Evaluation *****
Num examples = 3657
Batch size = 20
[183/183 02:11]
{'epoch': 1.0,
'eval_accuracy': 0.998632759092152,
'eval_loss': 0.010299865156412125,
'eval_runtime': 132.0374,
'eval_samples_per_second': 27.697,
'eval_steps_per_second': 1.386}``````

Guardando el modelo y el tokenizador:

``````# saving the fine tuned model & tokenizer
model_path = "fake-news-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)``````

Aparecerá una nueva carpeta que contiene la configuración del modelo y los pesos después de ejecutar la celda anterior. Si desea realizar una predicción, simplemente use el `from_pretrained()`método que usamos cuando cargamos el modelo, y ya está listo.

A continuación, hagamos una función que acepte el texto del artículo como argumento y devuelva si es falso o no:

``````def get_prediction(text, convert_to_label=False):
# prepare our text into tokenized sequence
inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
# perform inference to our model
outputs = model(**inputs)
# get output probabilities by doing softmax
probs = outputs[0].softmax(1)
# executing argmax function to get the candidate label
d = {
0: "reliable",
1: "fake"
}
if convert_to_label:
return d[int(probs.argmax())]
else:
return int(probs.argmax())``````

Tomé un ejemplo de `test.csv`que el modelo nunca vio para realizar inferencias, lo verifiqué y es un artículo real de The New York Times:

``````real_news = """
Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The New York Times",Daniel Victor,"If at first you don’t succeed, try a different sport. Tim Tebow, who was a Heisman   quarterback at the University of Florida but was unable to hold an N. F. L. job, is pursuing a career in Major League Baseball. <SNIPPED>
"""``````

El texto original está en el entorno de Colab si desea copiarlo, ya que es un artículo completo. Vamos a pasarlo al modelo y ver los resultados:

``get_prediction(real_news, convert_to_label=True)``

Producción:

``reliable``

Apéndice: Creación de un archivo de envío para Kaggle

En esta sección, predeciremos todos los artículos en el `test.csv`para crear un archivo de envío para ver nuestra precisión en la prueba establecida en la competencia Kaggle :

``````# read the test set
# make a copy of the testing set
new_df = test_df.copy()
# add a new column that contains the author, title and article content
new_df["new_text"] = new_df["author"].astype(str) + " : " + new_df["title"].astype(str) + " - " + new_df["text"].astype(str)
# get the prediction of all the test set
new_df["label"] = new_df["new_text"].apply(get_prediction)
# make the submission file
final_df = new_df[["id", "label"]]
final_df.to_csv("submit_final.csv", index=False)``````

Después de concatenar el autor, el título y el texto del artículo, pasamos la `get_prediction()`función a la nueva columna para llenar la `label`columna, luego usamos `to_csv()`el método para crear el archivo de envío para Kaggle. Aquí está mi puntaje de presentación:

Obtuvimos una precisión del 99,78 % y del 100 % en las tablas de clasificación privadas y públicas. ¡Eso es genial!

Conclusión

Muy bien, hemos terminado con el tutorial. Puede consultar esta página para ver varios parámetros de entrenamiento que puede modificar.

Si tiene un conjunto de datos de noticias falsas personalizado para ajustarlo, simplemente tiene que pasar una lista de muestras al tokenizador como lo hicimos nosotros, no cambiará ningún otro código después de eso.

Consulta el código completo aquí , o el entorno de Colab aquí .