Lawson  Wehner

Lawson Wehner

1679558460

Html5gum: A WHATWG-compliant HTML5 tokenizer & Tag Soup Parser

Html5gum 

html5gum is a WHATWG-compliant HTML tokenizer.

use std::fmt::Write;
use html5gum::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in Tokenizer::new(html).infallible() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", String::from_utf8_lossy(&tag.name)).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", String::from_utf8_lossy(&hello_world)).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", String::from_utf8_lossy(&tag.name)).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

What a tokenizer does and what it does not do

html5gum fully implements 13.2.5 of the WHATWG HTML spec, i.e. is able to tokenize HTML documents and passes html5lib's tokenizer test suite. Since it is just a tokenizer, this means:

  • html5gum does not implement charset detection. This implementation takes and returns bytes, but assumes UTF-8. It recovers gracefully from invalid UTF-8.
  • html5gum does not correct mis-nested tags.
  • html5gum does not recognize implicitly self-closing elements like <img>, as a tokenizer it will simply emit a start token. It does however emit a self-closing tag for <img .. />.
  • html5gum doesn't implement the DOM, and unfortunately in the HTML spec, constructing the DOM ("tree construction") influences how tokenization is done. For an example of which problems this causes see this example code.
  • html5gum does not generally qualify as a browser-grade HTML parser as per the WHATWG spec. This can change in the future, see issue 21.

With those caveats in mind, html5gum can pretty much parse tokenize anything that browsers can.

The Emitter trait

A distinguishing feature of html5gum is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.

Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.

Other features

  • No unsafe Rust
  • Only dependency is jetscii, and can be disabled via crate features (see Cargo.toml)

Alternative HTML parsers

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

use quick-xml or xmlparser with some hacks to make either one not choke on bad HTML. For some (rather large) set of HTML input this works well (particularly quick-xml can be configured to be very lenient about parsing errors) and parsing speed is stellar. But neither can parse all HTML.

For my own usecase html5gum is about 2x slower than quick-xml.

use html5ever's own tokenizer to avoid as much tree-building overhead as possible. This was functional but had poor performance for my own usecase (10-15x slower than quick-xml).

use lol-html, which would probably perform at least as well as html5gum, but comes with a closure-based API that I didn't manage to get working for my usecase.

Etymology

Why is this library called html5gum?

G.U.M: Giant Unreadable Match-statement

<insert "how it feels to chew 5 gum parse HTML" meme here>


Download Details:

Author: Untitaker
Source Code: https://github.com/untitaker/html5gum 
License: MIT license

#html5 #html #xml #tokenize 

Html5gum: A WHATWG-compliant HTML5 tokenizer & Tag Soup Parser
Rupert  Beatty

Rupert Beatty

1676661480

Mustard: A Swift Library for tokenizing Strings When Splitting

Mustard 🌭

Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Quick start using character sets

Foundation includes the String method components(separatedBy:) that allows us to get substrings divided up by certain characters:

let sentence = "hello 2017 year"
let words = sentence.components(separatedBy: .whitespaces)
// words.count -> 3
// words = ["hello", "2017", "year"]

Mustard provides a similar feature, but with the opposite approach, where instead of matching by separators you can match by one or more character sets, which is useful if separators simply don't exist:

import Mustard

let sentence = "hello2017year"
let words = sentence.components(matchedWith: .letters, .decimalDigits)
// words.count -> 3
// words = ["hello", "2017", "year"]

If you want more than just the substrings, you can use the tokens(matchedWith: CharacterSet...) method which will return an array of TokenType.

As a minimum, TokenType requires properties for text (the substring matched), and range (the range of the substring in the original string). When using CharacterSets as a tokenizer, the more specific type CharacterSetToken is returned, which includes the property set which contains the instance of CharacterSet that was used to create the match.

import Mustard

let tokens = "123Hello world&^45.67".tokens(matchedWith: .decimalDigits, .letters)
// tokens: [CharacterSet.Token]
// tokens.count -> 5 (characters '&', '^', and '.' are ignored)
//
// second token..
// token[1].text -> "Hello"
// token[1].range -> Range<String.Index>(3..<8)
// token[1].set -> CharacterSet.letters
//
// last token..
// tokens[4].text -> "67"
// tokens[4].range -> Range<String.Index>(19..<21)
// tokens[4].set -> CharacterSet.decimalDigits

Advanced matching with custom tokenizers

Mustard can do more than match from character sets. You can create your own tokenizers with more sophisticated matching behavior by implementing the TokenizerType and TokenType protocols.

Here's an example of using DateTokenizer (see example for implementation) that finds substrings that match a MM/dd/yy format.

DateTokenizer returns tokens with the type DateToken. Along with the substring text and range, DateToken includes a Date object corresponding to the date in the substring:

import Mustard

let text = "Serial: #YF 1942-b 12/01/17 (Scanned) 12/03/17 (Arrived) ref: 99/99/99"

let tokens = text.tokens(matchedWith: DateTokenizer())
// tokens: [DateTokenizer.Token]
// tokens.count -> 2
// ('99/99/99' is *not* matched by `DateTokenizer` because it's not a valid date)
//
// first date
// tokens[0].text -> "12/01/17"
// tokens[0].date -> Date(2017-12-01 05:00:00 +0000)
//
// last date
// tokens[1].text -> "12/03/17"
// tokens[1].date -> Date(2017-12-03 05:00:00 +0000)

Documentation & Examples

Roadmap

  •  Include detailed examples and documentation
  •  Ability to skip/ignore characters within match
  •  Include more advanced pattern matching for matching tokens
  •  Make project logo 🌭
  •  Performance testing / benchmarking against Scanner
  •  Include interface for working with Character tokenizers

Requirements

  • Swift 4.1

Contributing

Feedback, or contributions for bug fixing or improvements are welcome. Feel free to submit a pull request or open an issue.

Download Details:

Author: Mathewsanders
Source Code: https://github.com/mathewsanders/Mustard 
License: MIT license

#swift #tokenize #sub #strings 

Mustard: A Swift Library for tokenizing Strings When Splitting

Chatbot en Python desde cero con código fuente

Un chatbot es un software basado en IA diseñado para interactuar con humanos en sus lenguajes naturales. Estos chatbots generalmente se comunican a través de métodos auditivos o textuales, y pueden imitar sin esfuerzo los lenguajes humanos para comunicarse con los seres humanos de una manera similar a la humana. Podría decirse que un chatbot es una de las mejores aplicaciones de procesamiento de lenguaje natural.

En los últimos años, los chatbots en Python se han vuelto muy populares en los sectores tecnológico y empresarial. Estos bots inteligentes son tan hábiles para imitar los lenguajes humanos naturales y conversar con los humanos, que las empresas de varios sectores industriales los están adoptando. Desde empresas de comercio electrónico hasta instituciones de atención médica, todos parecen estar aprovechando esta ingeniosa herramienta para generar beneficios comerciales. En este artículo, aprenderemos sobre el chatbot usando Python y cómo hacer un chatbot en python . 

Para crear un Chatbot en Python desde cero seguimos estos pasos.

  • Paso 1: Importe y cargue el archivo de datos
  • Paso 2: Preprocesar datos
  • Paso 3: Crear datos de entrenamiento y prueba
  • Paso 4: construye el modelo
  • Paso 5: Predecir la respuesta

Paso 1: Importe y cargue el archivo de datos

Primero, debe crear un archivo llamado train_chatbot.py. Traemos los paquetes que necesita nuestro chatbot y configuramos las variables que usaremos en nuestro proyecto de Python.

El archivo de datos está en JSONformato, por lo que usamos json packagepara leer el JSONarchivo en Python.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

Paso 2: Preprocesar datos

Antes de que podamos crear un modelo de aprendizaje automático o aprendizaje profundo a partir de datos de texto, debemos procesar los datos de diferentes maneras. Dependiendo de las necesidades, tenemos que usar diferentes operaciones para preprocesar los datos.

La tokenización de datos de texto es lo primero y más básico que puede hacer con ellos. Tokenizar es el proceso de romper un texto en pedazos pequeños, como palabras.

Aquí, revisamos los patrones, usamos la nltk.word_tokenize()función para dividir la oración en palabras y agregamos cada palabra a la lista de palabras. También hacemos una lista de las clases a las que pertenecen nuestras etiquetas.

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

Ahora, averiguaremos qué significa cada palabra y nos desharemos de las palabras que ya están en la lista. La lematización es el proceso de cambiar una palabra a su forma de lema y luego crear un archivo pickle para almacenar los objetos de Python que usaremos al predecir.

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

Paso 3: Crear datos de entrenamiento y prueba

Ahora, crearemos los datos de entrenamiento, que incluirán tanto las entradas como las salidas. El patrón será nuestra entrada, y la clase a la que pertenece ese patrón será nuestra salida. Pero la computadora no puede leer palabras, así que convertiremos las palabras en números.

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Paso 4: construye el modelo

Ahora que nuestros datos de entrenamiento están listos, construiremos una red neuronal profunda de 3 capas. Hacemos esto con la KerasAPI secuencial. Después de entrenar el modelo durante 200 iteraciones, fue 100 % preciso. Vamos a nombrar el archivo “ chatbot model.h5” y guardarlo.

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

Paso 5: Predecir la respuesta

Para predecir las oraciones y obtener una respuesta del usuario que nos permita crear un nuevo archivo llamado “ chatapp.py.”

Cargaremos el modelo entrenado y luego usaremos una interfaz gráfica de usuario para predecir la respuesta del bot. El modelo solo nos dirá a qué clase pertenece, por lo que haremos algunas funciones que descubrirán la clase y luego elegirán una respuesta aleatoria de la lista de respuestas.

Nuevamente, cargamos los archivos pickle ' words.pkl' y ' classes.pkl' que creamos cuando entrenamos nuestro modelo:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

Para predecir la clase, tendremos que dar entrada de la misma manera que lo hicimos durante el entrenamiento. Entonces, haremos algunas funciones que preprocesarán el texto y luego adivinarán la clase.

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

Después de predecir la clase, obtendremos una respuesta aleatoria de la lista de intenciones.

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

Ahora, crearemos una interfaz gráfica de usuario (GUI). Usemos la biblioteca Tkinter, que viene con muchas otras bibliotecas GUI útiles.

Tomaremos el mensaje del usuario y usaremos las funciones auxiliares que hemos creado para obtener la respuesta del bot y mostrarla en la GUI. Aquí está el código fuente completo de la GUI.

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

Ejecutar Python Chatbot

Para ejecutar el chatbot, tenemos dos archivos principales; tren_chatbot.py y chatapp.py.

Primero, entrenamos el modelo usando el comando en la terminal:

python train_chatbot.py


¡DESCARGA EL CÓDIGO FUENTE COMPLETO!

Chatbot en Python From Scratch avec code source

Un chatbot est un logiciel basé sur l'IA conçu pour interagir avec les humains dans leur langue naturelle. Ces chatbots sont généralement conversés via des méthodes auditives ou textuelles, et ils peuvent imiter sans effort les langues humaines pour communiquer avec les êtres humains d'une manière humaine. Un chatbot est sans doute l'une des meilleures applications de traitement du langage naturel.

Au cours des dernières années, les chatbots en Python sont devenus très populaires dans les secteurs de la technologie et des affaires. Ces robots intelligents sont si aptes à imiter les langages humains naturels et à converser avec les humains que des entreprises de divers secteurs industriels les adoptent. Des entreprises de commerce électronique aux établissements de santé, tout le monde semble tirer parti de cet outil astucieux pour générer des avantages commerciaux. Dans cet article, nous allons découvrir le chatbot utilisant Python et comment créer un chatbot en python . 

Pour créer un Chatbot en Python à partir de zéro, nous suivons ces étapes.

  • Étape 1 : Importer et charger le fichier de données
  • Étape 2 : prétraiter les données
  • Étape 3 : Créer des données d'entraînement et de test
  • Étape 4 : Construire le modèle
  • Étape 5 : prédire la réponse

Étape 1 : Importer et charger le fichier de données

Tout d'abord, vous devez créer un fichier appelé train_chatbot.py. Nous apportons les packages dont notre chatbot a besoin et configurons les variables que nous utiliserons dans notre projet Python.

Le fichier de données est au JSONformat , nous avons donc utilisé le json packagepour lire le JSONfichier en Python.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

Étape 2 : prétraiter les données

Avant de pouvoir créer un modèle d'apprentissage automatique ou d'apprentissage en profondeur à partir de données textuelles, nous devons traiter les données de différentes manières. Selon les besoins, nous devons utiliser différentes opérations pour prétraiter les données.

La tokenisation des données textuelles est la première et la plus élémentaire chose que vous puissiez faire avec. La tokenisation est le processus qui consiste à diviser un texte en petits morceaux, comme des mots.

Ici, nous passons en revue les modèles, utilisons la nltk.word_tokenize()fonction pour diviser la phrase en mots et ajoutons chaque mot à la liste de mots. Nous dressons également une liste des classes auxquelles appartiennent nos balises.

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

Maintenant, nous allons comprendre ce que signifie chaque mot et nous débarrasser de tous les mots qui sont déjà sur la liste. La lemmatisation est le processus consistant à transformer un mot en sa forme lemmaire, puis à créer un fichier pickle pour stocker les objets Python que nous utiliserons lors de la prédiction.

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

Étape 3 : Créer des données d'entraînement et de test

Maintenant, nous allons créer les données d'apprentissage, qui incluront à la fois les entrées et les sorties. Le motif sera notre entrée et la classe à laquelle appartient le motif sera notre sortie. Mais l'ordinateur ne peut pas lire les mots, nous allons donc transformer les mots en nombres.

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Étape 4 : Construire le modèle

Maintenant que nos données d'entraînement sont prêtes, nous allons construire un réseau neuronal profond à 3 couches. Nous le faisons avec l' KerasAPI séquentielle. Après avoir entraîné le modèle pendant 200 itérations, il était précis à 100 %. Nommons le fichier « chatbot model.h5» et sauvegardons-le.

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

Étape 5 : prédire la réponse

Pour prédire les phrases et obtenir une réponse de l'utilisateur, laissez-nous créer un nouveau fichier nommé " chatapp.py."

Nous allons charger le modèle formé, puis utiliser une interface utilisateur graphique pour prédire la réponse du bot. Le modèle ne nous dira qu'à quelle classe il appartient, nous allons donc créer des fonctions qui détermineront la classe, puis choisirons une réponse aléatoire dans la liste des réponses.

Encore une fois, nous chargeons les fichiers pickle ' words.pkl' et ' classes.pkl' que nous avons créés lorsque nous avons entraîné notre modèle :

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

Pour prédire la classe, nous devrons donner des informations de la même manière que nous l'avons fait lors de la formation. Nous allons donc créer des fonctions qui effectueront un prétraitement sur le texte, puis devinerons la classe.

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

Après avoir prédit la classe, nous obtiendrons une réponse aléatoire à partir de la liste des intentions.

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

Maintenant, nous allons créer une interface utilisateur graphique (GUI). Utilisons la bibliothèque Tkinter, qui est fournie avec de nombreuses autres bibliothèques d'interface graphique utiles.

Nous prendrons le message de l'utilisateur et utiliserons les fonctions d'assistance que nous avons créées pour obtenir la réponse du bot et l'afficher sur l'interface graphique. Voici le code source complet de l'interface graphique.

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

Exécutez le chatbot Python

Pour exécuter le chatbot, nous avons deux fichiers principaux ; train_chatbot.py et chatapp.py.

Tout d'abord, nous formons le modèle à l'aide de la commande dans le terminal :

python train_chatbot.py


TÉLÉCHARGEZ LE CODE SOURCE COMPLET !

 

中條 美冬

1665826260

ソースコードを使用したゼロからの Python のチャットボット

チャットボットは、自然言語で人間と対話するように設計された AI ベースのソフトウェアです。これらのチャットボットは通常、聴覚またはテキストの方法で会話し、人間の言語を簡単に模倣して、人間のような方法で人間と通信できます。チャットボットは、間違いなく自然言語処理の最良のアプリケーションの 1 つです。

ここ数年、Python のチャットボットは、テクノロジーおよびビジネスの分野で非常に人気が高まっています。これらのインテリジェントなボットは、自然な人間の言語を模倣し、人間と会話することに長けているため、さまざまな産業部門の企業が採用しています。e コマース企業から医療機関まで、誰もがこの気の利いたツールを活用してビジネス上の利益を上げているようです。この記事では、 Python を使用したチャットボットと、Pythonでチャットボットを作成する方法について説明します。 

Python でゼロからチャットボットを作成するには、次の手順に従います。

  • ステップ 1: データ ファイルをインポートしてロードする
  • ステップ 2: データの前処理
  • ステップ 3: トレーニング データとテスト データを作成する
  • ステップ 4: モデルを構築する
  • ステップ 5: 応答を予測する

ステップ 1: データ ファイルをインポートしてロードする

まず、というファイルを作成する必要がありますtrain_chatbot.py。チャットボットに必要なパッケージを取り込み、Python プロジェクトで使用する変数を設定します。

データ ファイルは のJSON形式であるため、 を使用してファイルを Pythonjson packageに読み込みました。JSON

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

ステップ 2: データの前処理

テキスト データから機械学習またはディープ ラーニング モデルを作成する前に、さまざまな方法でデータを処理する必要があります。必要に応じて、さまざまな操作を使用してデータを前処理する必要があります。

テキスト データのトークン化は、テキスト データを使ってできる最初の最も基本的なことです。トークン化とは、テキストを単語などの小さな断片に分割するプロセスです。

ここでは、パターンを調べ、nltk.word_tokenize()関数を使用して文を単語に分割し、各単語を単語リストに追加します。タグが属するクラスのリストも作成します。

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

次に、各単語の意味を把握し、既にリストにある単語を削除します。レマタイズとは、単語をそのレンマ形式に変更し、予測時に使用する Python オブジェクトを格納するための pickle ファイルを作成するプロセスです。

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

ステップ 3: トレーニング データとテスト データを作成する

次に、入力と出力の両方を含むトレーニング データを作成します。パターンが入力になり、パターンが属するクラスが出力になります。でも、コンピューターは単語を読めないので、単語を数字に変換します。

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

ステップ 4: モデルを構築する

トレーニング データの準備ができたので、3 層のディープ ニューラル ネットワークを構築します。これはKerasシーケンシャル API で行います。モデルを 200 回反復してトレーニングした後、100% 正確になりました。ファイルに「 」という名前を付けて保存しましょうchatbot model.h5

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

ステップ 5: 応答を予測する

文を予測し、ユーザーからの応答を取得して、「chatapp.py.」という名前の新しいファイルを作成できるようにします。

トレーニング済みのモデルを読み込み、グラフィカル ユーザー インターフェイスを使用してボットの応答を予測します。モデルはそれがどのクラスに属しているかだけを教えてくれるので、クラスを特定し、応答のリストからランダムな応答を選択する関数をいくつか作成します。

ここでも、モデルをトレーニングしたときに作成した ' words.pkl' と ' classes.pkl' の pickle ファイルを読み込みます。

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

クラスを予測するには、トレーニング中に行ったのと同じ方法で入力を行う必要があります。そこで、テキストを前処理してからクラスを推測する関数をいくつか作成します。

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

クラスを予測した後、インテントのリストからランダムな応答を取得します。

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

次に、グラフィカル ユーザー インターフェイス (GUI) を作成します。他にも多くの便利な GUI ライブラリが付属している Tkinter ライブラリを使用してみましょう。

ユーザーのメッセージを取得し、作成したヘルパー関数を使用してボットから回答を取得し、GUI に表示します。GUI の完全なソース コードは次のとおりです。

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

Python チャットボットを実行する

チャットボットを実行するために、2 つの主要なファイルがあります。train_chatbot.py と chatapp.py。

まず、ターミナルで次のコマンドを使用してモデルをトレーニングします。

python train_chatbot.py


完全なソースコードをダウンロードしてください!
 

Чат-бот на Python с нуля с исходным кодом

Чат-бот — это программное обеспечение на основе искусственного интеллекта, предназначенное для общения с людьми на их естественных языках. Эти чат-боты обычно общаются с помощью слуховых или текстовых методов, и они могут легко имитировать человеческие языки, чтобы общаться с людьми по-человечески. Чат-бот, возможно, является одним из лучших приложений для обработки естественного языка.

За последние несколько лет чат-боты на Python стали очень популярны в технологическом и бизнес-секторе. Эти интеллектуальные боты настолько искусны в имитации естественного человеческого языка и общении с людьми, что компании в различных отраслях промышленности перенимают их. Кажется, что все, от фирм электронной коммерции до медицинских учреждений, используют этот отличный инструмент для получения преимуществ для бизнеса. В этой статье мы узнаем о чат- боте с использованием Python и о том, как создать чат-бота на python . 

Чтобы создать чат-бота на Python с нуля, выполните следующие действия.

  • Шаг 1: Импортируйте и загрузите файл данных
  • Шаг 2. Предварительная обработка данных
  • Шаг 3. Создайте данные для обучения и тестирования
  • Шаг 4: Постройте модель
  • Шаг 5: Предскажите реакцию

Шаг 1: Импортируйте и загрузите файл данных

Во-первых, вам нужно создать файл с именем train_chatbot.py. Мы загружаем пакеты, которые нужны нашему чат-боту, и настраиваем переменные, которые мы будем использовать в нашем проекте Python.

Файл данных находится в JSONформате, поэтому мы использовали json packageдля чтения JSONфайла в Python.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

Шаг 2. Предварительная обработка данных

Прежде чем мы сможем создать модель машинного обучения или глубокого обучения из текстовых данных, мы должны обработать данные разными способами. В зависимости от потребностей мы должны использовать различные операции для предварительной обработки данных.

Токенизация текстовых данных — это первое и самое основное, что вы можете с ними сделать. Токенизация — это процесс разбиения текста на мелкие части, такие как слова.

Здесь мы просматриваем шаблоны, используем nltk.word_tokenize()функцию, чтобы разбить предложение на слова и добавить каждое слово в список слов. Мы также составляем список классов, к которым принадлежат наши теги.

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

Теперь мы выясним, что означает каждое слово, и избавимся от всех слов, которые уже есть в списке. Лемматизация — это процесс преобразования слова в его форму леммы, а затем создание файла рассола для хранения объектов Python, которые мы будем использовать при прогнозировании.

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

Шаг 3. Создайте данные для обучения и тестирования

Теперь мы создадим обучающие данные, которые будут включать как входные, так и выходные данные. Шаблон будет нашим входом, а класс, к которому принадлежит шаблон, будет нашим выходом. Но компьютер не может читать слова, поэтому мы превратим слова в числа.

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Шаг 4: Постройте модель

Теперь, когда наши обучающие данные готовы, мы построим трехслойную глубокую нейронную сеть. Мы делаем это с помощью Kerasпоследовательного API. После обучения модели на 200 итераций она была точной на 100%. Назовем файл « chatbot model.h5» и сохраним его.

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

Шаг 5: Предскажите реакцию

Чтобы предсказать предложения и получить ответ от пользователя, давайте создадим новый файл с именем « chatapp.py

Мы загрузим обученную модель, а затем с помощью графического пользовательского интерфейса предскажем реакцию бота. Модель только скажет нам, к какому классу она принадлежит, поэтому мы создадим несколько функций, которые определят класс, а затем выберут случайный ответ из списка ответов.

Опять же, мы загружаем файлы pickle ' words.pkl' и ' classes.pkl', которые мы создали при обучении нашей модели:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

Чтобы предсказать класс, нам нужно будет вводить данные так же, как мы это делали во время обучения. Итак, мы создадим несколько функций, которые будут выполнять предварительную обработку текста, а затем угадывать класс.

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

После предсказания класса мы получим случайный ответ из списка намерений.

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

Теперь мы создадим графический интерфейс пользователя (GUI). Давайте воспользуемся библиотекой Tkinter, которая поставляется с множеством других полезных библиотек графического интерфейса.

Мы возьмем сообщение пользователя и используем созданные нами вспомогательные функции, чтобы получить ответ от бота и показать его в графическом интерфейсе. Вот полный исходный код GUI.

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

Запустите чат-бот Python

Для запуска чат-бота у нас есть два основных файла; train_chatbot.py и chatapp.py.

Сначала обучаем модель с помощью команды в терминале:

python train_chatbot.py


СКАЧАТЬ ПОЛНЫЙ ИСХОДНЫЙ КОД!

Duck Hwan

1665815400

소스 코드를 사용하여 처음부터 Python의 챗봇

챗봇은 자연어로 인간과 상호 작용하도록 설계된 AI 기반 소프트웨어입니다. 이러한 챗봇은 일반적으로 청각 또는 텍스트 방식으로 대화하며 인간의 언어를 쉽게 모방하여 인간과 같은 방식으로 인간과 의사 소통할 수 있습니다. 챗봇은 틀림없이 자연어 처리의 최고의 응용 프로그램 중 하나입니다.

지난 몇 년 동안 Python의 챗봇은 기술 및 비즈니스 부문에서 큰 인기를 얻었습니다. 이 지능형 봇은 인간의 자연스러운 언어를 모방하고 인간과 대화하는 데 매우 능숙하여 다양한 산업 분야의 기업에서 이를 채택하고 있습니다. 전자 상거래 회사에서 의료 기관에 이르기까지 모든 사람이 이 멋진 도구를 활용하여 비즈니스 이점을 얻고 있는 것 같습니다. 이 기사에서는 Python을 사용 하는 챗봇 과 Python 에서 챗봇을 만드는 방법에 대해 알아봅니다 . 

Scratch에서 Python으로 챗봇을 만들려면 다음 단계를 따릅니다.

  • 1단계: 데이터 파일 가져오기 및 로드
  • 2단계: 데이터 전처리
  • 3단계: 학습 및 테스트 데이터 생성
  • 4단계: 모델 구축
  • 5단계: 응답 예측

1단계: 데이터 파일 가져오기 및 로드

먼저 라는 파일을 만들어야 합니다 train_chatbot.py. 챗봇에 필요한 패키지를 가져오고 Python 프로젝트에서 사용할 변수를 설정합니다.

데이터 파일은 JSON형식이므로 파일을 Python json package으로 읽는 데 사용했습니다.JSON

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

2단계: 데이터 전처리

텍스트 데이터에서 머신 러닝 또는 딥 러닝 모델을 만들려면 먼저 데이터를 다양한 방식으로 처리해야 합니다. 필요에 따라 다른 작업을 사용하여 데이터를 사전 처리해야 합니다.

텍스트 데이터를 토큰화하는 것은 텍스트 데이터로 할 수 있는 첫 번째이자 가장 기본적인 작업입니다. 토큰화는 텍스트를 단어와 같이 작은 조각으로 나누는 과정입니다.

여기에서는 패턴을 살펴보고 nltk.word_tokenize()함수를 사용하여 문장을 단어로 나누고 각 단어를 단어 목록에 추가합니다. 또한 태그가 속한 클래스 목록도 만듭니다.

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

이제 각 단어의 의미를 파악하고 이미 목록에 있는 단어를 제거합니다. 보조 정리는 단어를 보조 정리 형식으로 변경한 다음 예측할 때 사용할 Python 개체를 저장하기 위해 피클 파일을 만드는 프로세스입니다.

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

3단계: 학습 및 테스트 데이터 생성

이제 입력과 출력을 모두 포함하는 훈련 데이터를 만들 것입니다. 패턴이 입력이 되고 패턴이 속한 클래스가 출력이 됩니다. 하지만 컴퓨터는 단어를 읽을 수 없기 때문에 우리는 단어를 숫자로 바꿀 것입니다.

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

4단계: 모델 구축

이제 훈련 데이터가 준비되었으므로 3계층 심층 신경망을 구축합니다. Keras순차 API 를 사용하여 이 작업을 수행합니다 . 200번의 반복을 위해 모델을 훈련시킨 후 100% 정확했습니다. 파일 이름을 " chatbot model.h5"로 지정하고 저장해 보겠습니다.

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

5단계: 응답 예측

문장을 예측하고 사용자로부터 응답을 받아 " chatapp.py." 라는 새 파일을 만들도록 합니다.

훈련된 모델을 로드한 다음 그래픽 사용자 인터페이스를 사용하여 봇의 응답을 예측합니다. 모델은 그것이 속한 클래스만 알려주므로 클래스를 파악한 다음 응답 목록에서 임의의 응답을 선택하는 몇 가지 함수를 만들 것입니다.

다시, 우리는 모델을 훈련할 때 만든 ' words.pkl' 및 ' ' 피클 파일을 로드합니다.classes.pkl

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

클래스를 예측하려면 훈련 중에 했던 것과 같은 방식으로 입력해야 합니다. 그래서, 우리는 텍스트에 대한 전처리를 하고 클래스를 추측하는 몇 가지 함수를 만들 것입니다.

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

클래스를 예측한 후 의도 목록에서 임의의 응답을 받습니다.

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

이제 그래픽 사용자 인터페이스(GUI)를 만들어 보겠습니다. 다른 많은 유용한 GUI 라이브러리와 함께 제공되는 Tkinter 라이브러리를 사용합시다.

사용자의 메시지를 받아 봇에서 답변을 얻고 GUI에 표시하기 위해 만든 도우미 기능을 사용합니다. 다음은 GUI의 전체 소스 코드입니다.

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

파이썬 챗봇 실행

챗봇을 실행하기 위해 두 개의 기본 파일이 있습니다. train_chatbot.py 및 chatapp.py.

먼저 터미널에서 다음 명령을 사용하여 모델을 훈련합니다.

python train_chatbot.py


완전한 소스 코드를 다운로드하십시오!

Chatbot in Python von Grund auf neu mit Quellcode

Ein Chatbot ist eine KI-basierte Software, die entwickelt wurde, um mit Menschen in ihrer natürlichen Sprache zu interagieren. Diese Chatbots unterhalten sich normalerweise über auditive oder textuelle Methoden und können mühelos menschliche Sprachen nachahmen, um mit Menschen auf menschenähnliche Weise zu kommunizieren. Ein Chatbot ist wohl eine der besten Anwendungen für die Verarbeitung natürlicher Sprache.

In den letzten Jahren sind Chatbots in Python im Technologie- und Geschäftssektor sehr beliebt geworden. Diese intelligenten Bots sind so geschickt darin, natürliche menschliche Sprachen zu imitieren und sich mit Menschen zu unterhalten, dass Unternehmen aus verschiedenen Industriezweigen sie übernehmen. Von E-Commerce-Unternehmen bis hin zu Gesundheitseinrichtungen scheint jeder dieses raffinierte Tool zu nutzen, um geschäftliche Vorteile zu erzielen. In diesem Artikel erfahren wir mehr über Chatbots mit Python und wie man Chatbots in Python erstellt . 

Um einen Chatbot in Python von Grund auf neu zu erstellen, folgen wir diesen Schritten.

  • Schritt 1: Importieren und laden Sie die Datendatei
  • Schritt 2: Daten vorverarbeiten
  • Schritt 3: Trainings- und Testdaten erstellen
  • Schritt 4: Erstellen Sie das Modell
  • Schritt 5: Sagen Sie die Antwort voraus

Schritt 1: Importieren und laden Sie die Datendatei

Zuerst müssen Sie eine Datei namens train_chatbot.py. Wir bringen die Pakete ein, die unser Chatbot benötigt, und richten die Variablen ein, die wir in unserem Python-Projekt verwenden werden.

Die Datendatei hat das JSONFormat, also haben wir die verwendet , um die Datei in Python json packageeinzulesen .JSON

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

Schritt 2: Daten vorverarbeiten

Bevor wir aus Textdaten ein Machine-Learning- oder Deep-Learning-Modell erstellen können, müssen wir die Daten auf unterschiedliche Weise verarbeiten. Je nach Bedarf müssen wir verschiedene Operationen verwenden, um die Daten vorzuverarbeiten.

Das Tokenisieren von Textdaten ist das erste und grundlegendste, was Sie damit tun können. Tokenisierung ist der Prozess, einen Text in kleine Stücke, wie Wörter, zu zerlegen.

Hier gehen wir die Muster durch, verwenden die nltk.word_tokenize()Funktion, um den Satz in Wörter zu zerlegen, und fügen jedes Wort der Wortliste hinzu. Wir erstellen auch eine Liste der Klassen, zu denen unsere Tags gehören.

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

Jetzt werden wir herausfinden, was jedes Wort bedeutet, und alle Wörter entfernen, die bereits auf der Liste stehen. Lemmatisierung ist der Prozess, ein Wort in seine Lemmaform umzuwandeln und dann eine Pickle-Datei zu erstellen, um die Python-Objekte zu speichern, die wir bei der Vorhersage verwenden werden.

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

Schritt 3: Trainings- und Testdaten erstellen

Jetzt erstellen wir die Trainingsdaten, die sowohl die Eingaben als auch die Ausgaben enthalten. Das Muster ist unsere Eingabe, und die Klasse, zu der das Muster gehört, ist unsere Ausgabe. Aber der Computer kann keine Wörter lesen, also verwandeln wir die Wörter in Zahlen.

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Schritt 4: Erstellen Sie das Modell

Nachdem unsere Trainingsdaten nun fertig sind, werden wir ein dreischichtiges tiefes neuronales Netzwerk aufbauen. Wir tun dies mit der Kerassequentiellen API. Nach dem Training des Modells für 200 Iterationen war es zu 100 % genau. Nennen wir die Datei „ chatbot model.h5“ und speichern sie.

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

Schritt 5: Sagen Sie die Antwort voraus

Um die Sätze vorherzusagen und eine Antwort vom Benutzer zu erhalten, damit wir eine neue Datei mit dem Namen „ chatapp.py“ erstellen können.

Wir laden das trainierte Modell und verwenden dann eine grafische Benutzeroberfläche, um die Antwort des Bots vorherzusagen. Das Modell teilt uns nur mit, zu welcher Klasse es gehört, also erstellen wir einige Funktionen, die die Klasse herausfinden und dann eine zufällige Antwort aus der Liste der Antworten auswählen.

Wieder laden wir die ' words.pkl'- und ' classes.pkl'-Pickle-Dateien, die wir beim Trainieren unseres Modells erstellt haben:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

Um die Klasse vorherzusagen, müssen wir genauso Eingaben machen wie während des Trainings. Also werden wir einige Funktionen erstellen, die den Text vorverarbeiten und dann die Klasse erraten.

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

Nachdem wir die Klasse vorhergesagt haben, erhalten wir eine zufällige Antwort aus der Liste der Absichten.

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

Jetzt erstellen wir eine grafische Benutzeroberfläche (GUI). Verwenden wir die Tkinter-Bibliothek, die mit vielen anderen nützlichen GUI-Bibliotheken geliefert wird.

Wir nehmen die Nachricht des Benutzers und verwenden die von uns erstellten Hilfsfunktionen, um die Antwort vom Bot zu erhalten und auf der GUI anzuzeigen. Hier ist der vollständige Quellcode der GUI.

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

Führen Sie den Python-Chatbot aus

Um den Chatbot auszuführen, haben wir zwei Hauptdateien; train_chatbot.py und chatapp.py.

Zuerst trainieren wir das Modell mit dem Befehl im Terminal:

python train_chatbot.py


VOLLSTÄNDIGEN QUELLCODE HERUNTERLADEN!

Chatbot em Python do zero com código-fonte

Um chatbot é um software baseado em IA projetado para interagir com humanos em seus idiomas naturais. Esses chatbots geralmente conversam por meio de métodos auditivos ou textuais e podem imitar sem esforço as linguagens humanas para se comunicar com seres humanos de maneira humana. Um chatbot é sem dúvida uma das melhores aplicações de processamento de linguagem natural.

Nos últimos anos, os chatbots em Python se tornaram muito populares nos setores de tecnologia e negócios. Esses bots inteligentes são tão hábeis em imitar linguagens humanas naturais e conversar com humanos, que empresas de vários setores industriais os estão adotando. De empresas de comércio eletrônico a instituições de saúde, todos parecem estar aproveitando essa ferramenta bacana para gerar benefícios comerciais. Neste artigo, vamos aprender sobre chatbot usando Python e como fazer chatbot em python . 

Para criar um Chatbot em Python a partir do zero, seguimos estes passos.

  • Etapa 1: importar e carregar o arquivo de dados
  • Etapa 2: pré-processar dados
  • Etapa 3: criar dados de treinamento e teste
  • Etapa 4: construir o modelo
  • Etapa 5: prever a resposta

Etapa 1: importar e carregar o arquivo de dados

Primeiro, você precisa criar um arquivo chamado train_chatbot.py. Trazemos os pacotes que nosso chatbot precisa e configuramos as variáveis ​​que usaremos em nosso projeto Python.

O arquivo de dados está no JSONformato, então usamos o json packagepara ler o JSONarquivo em Python.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

Etapa 2: pré-processar dados

Antes de podermos criar um modelo de aprendizado de máquina ou aprendizado profundo a partir de dados de texto, precisamos processar os dados de maneiras diferentes. Dependendo das necessidades, temos que usar diferentes operações para pré-processar os dados.

Tokenizar dados de texto é a primeira e mais básica coisa que você pode fazer com eles. Tokenização é o processo de quebrar um texto em pequenos pedaços, como palavras.

Aqui, passamos pelos padrões, usamos a nltk.word_tokenize()função para quebrar a frase em palavras e adicionamos cada palavra à lista de palavras. Também fazemos uma lista das classes às quais nossas tags pertencem.

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

Agora, vamos descobrir o que cada palavra significa e nos livrar de todas as palavras que já estão na lista. Lematizar é o processo de mudar uma palavra para sua forma de lema e então criar um arquivo pickle para armazenar os objetos Python que usaremos na previsão.

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

Etapa 3: criar dados de treinamento e teste

Agora, faremos os dados de treinamento, que incluirão as entradas e saídas. O padrão será nossa entrada e a classe à qual o padrão pertence será nossa saída. Mas o computador não pode ler palavras, então vamos transformar as palavras em números.

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Etapa 4: construir o modelo

Agora que nossos dados de treinamento estão prontos, construiremos uma rede neural profunda de 3 camadas. Fazemos isso com a KerasAPI sequencial. Após treinar o modelo por 200 iterações, ele ficou 100% preciso. Vamos nomear o arquivo “ chatbot model.h5” e salvá-lo.

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

Etapa 5: prever a resposta

Para prever as frases e obter uma resposta do usuário para criar um novo arquivo chamado “ chatapp.py.”

Carregaremos o modelo treinado e, em seguida, usaremos uma interface gráfica do usuário para prever a resposta do bot. O modelo apenas nos dirá a qual classe ele pertence, então faremos algumas funções que descobrirão a classe e então escolheremos uma resposta aleatória da lista de respostas.

Novamente, carregamos os arquivos ' words.pkl' e classes.pkl' pickle que criamos quando treinamos nosso modelo:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

Para prever a classe, teremos que dar entrada da mesma forma que fizemos durante o treinamento. Então, vamos fazer algumas funções que vão fazer o pré-processamento do texto e depois adivinhar a classe.

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

Depois de prever a classe, obteremos uma resposta aleatória da lista de intents.

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

Agora, vamos fazer uma interface gráfica do usuário (GUI). Vamos usar a biblioteca Tkinter, que vem com muitas outras bibliotecas GUI úteis.

Pegaremos a mensagem do usuário e usaremos as funções auxiliares que criamos para obter a resposta do bot e mostrá-la na GUI. Aqui está o código-fonte completo da GUI.

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

Execute o Python Chatbot

Para rodar o chatbot, temos dois arquivos principais; train_chatbot.py e chatapp.py.

Primeiro, treinamos o modelo usando o comando no terminal:

python train_chatbot.py


BAIXE O CÓDIGO FONTE COMPLETO!

CODE VN

CODE VN

1665804480

Chatbot bằng Python từ Scratch với mã nguồn

Chatbot là một phần mềm dựa trên AI được thiết kế để tương tác với con người bằng ngôn ngữ tự nhiên của họ. Những chatbot này thường trò chuyện thông qua các phương pháp thính giác hoặc văn bản và chúng có thể dễ dàng bắt chước ngôn ngữ của con người để giao tiếp với con người theo cách giống như con người. Chatbot được cho là một trong những ứng dụng xử lý ngôn ngữ tự nhiên tốt nhất.

Trong vài năm qua, chatbots bằng Python đã trở nên cực kỳ phổ biến trong lĩnh vực công nghệ và kinh doanh. Các bot thông minh này rất giỏi trong việc bắt chước các ngôn ngữ tự nhiên của con người và trò chuyện với con người, đến mức các công ty trong các lĩnh vực công nghiệp khác nhau đang áp dụng chúng. Từ các công ty thương mại điện tử đến các tổ chức chăm sóc sức khỏe, mọi người dường như đang tận dụng công cụ tiện lợi này để thúc đẩy lợi ích kinh doanh. Trong bài viết này, chúng ta sẽ tìm hiểu về chatbot sử dụng Pythoncách tạo chatbot trong python . 

Để tạo một Chatbot bằng Python từ Scratch, chúng ta làm theo các bước sau.

  • Bước 1: Nhập và tải tệp dữ liệu
  • Bước 2: Xử lý trước dữ liệu
  • Bước 3: Tạo dữ liệu đào tạo và thử nghiệm
  • Bước 4: Xây dựng mô hình
  • Bước 5: Dự đoán phản ứng

Bước 1: Nhập và tải tệp dữ liệu

Đầu tiên, bạn cần tạo một tệp có tên train_chatbot.py. Chúng tôi cung cấp các gói mà chatbot của chúng tôi cần và thiết lập các biến mà chúng tôi sẽ sử dụng trong dự án Python của mình.

Tệp dữ liệu có JSONđịnh dạng, vì vậy chúng tôi đã sử dụng json packageđể đọc JSONtệp sang Python.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

Bước 2: Xử lý trước dữ liệu

Trước khi có thể tạo mô hình học máy hoặc học sâu từ dữ liệu văn bản, chúng ta phải xử lý dữ liệu theo những cách khác nhau. Tùy theo nhu cầu mà chúng ta phải sử dụng các thao tác khác nhau để xử lý trước dữ liệu.

Mã hóa dữ liệu văn bản là điều đầu tiên và cơ bản nhất bạn có thể làm với nó. Mã hóa là quá trình chia một văn bản thành các phần nhỏ, giống như các từ.

Ở đây, chúng ta cùng xem qua các mẫu, sử dụng nltk.word_tokenize()chức năng ngắt câu thành từ và thêm từng từ vào danh sách từ. Chúng tôi cũng tạo một danh sách các lớp mà các thẻ của chúng tôi thuộc về.

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

Bây giờ, chúng ta sẽ tìm ra ý nghĩa của từng từ và loại bỏ bất kỳ từ nào đã có trong danh sách. Bổ đề là quá trình thay đổi một từ thành dạng bổ đề của nó và sau đó tạo một tệp nhỏ để lưu trữ các đối tượng Python mà chúng ta sẽ sử dụng khi dự đoán.

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

Bước 3: Tạo dữ liệu đào tạo và thử nghiệm

Bây giờ, chúng tôi sẽ tạo dữ liệu đào tạo, dữ liệu này sẽ bao gồm cả đầu vào và đầu ra. Mẫu sẽ là đầu vào của chúng ta và lớp thuộc về mẫu sẽ là đầu ra của chúng ta. Nhưng máy tính không thể đọc các từ, vì vậy chúng tôi sẽ chuyển các từ thành số.

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Bước 4: Xây dựng mô hình

Bây giờ dữ liệu đào tạo của chúng tôi đã sẵn sàng, chúng tôi sẽ xây dựng một mạng nơ-ron sâu 3 lớp. Chúng tôi thực hiện điều này với KerasAPI tuần tự. Sau khi huấn luyện mô hình trong 200 lần lặp, nó chính xác 100%. Hãy đặt tên tệp là “ chatbot model.h5” và lưu nó.

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

Bước 5: Dự đoán phản ứng

Để dự đoán các câu và nhận được phản hồi từ người dùng, hãy để chúng tôi tạo một tệp mới có tên “ chatapp.py.”

Chúng tôi sẽ tải mô hình được đào tạo và sau đó sử dụng giao diện người dùng đồ họa để dự đoán phản ứng của bot. Mô hình sẽ chỉ cho chúng ta biết nó thuộc về lớp nào, vì vậy chúng ta sẽ thực hiện một số hàm để tìm ra lớp đó và sau đó chọn một phản hồi ngẫu nhiên từ danh sách các phản hồi.

Một lần nữa, chúng tôi tải các tệp ' words.pkl' và ' classes.pkl' mà chúng tôi đã thực hiện khi đào tạo mô hình của mình:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

Để dự đoán lớp học, chúng tôi sẽ phải đưa ra đầu vào giống như cách chúng tôi đã làm trong quá trình đào tạo. Vì vậy, chúng tôi sẽ tạo một số hàm sẽ thực hiện tiền xử lý trên văn bản và sau đó đoán lớp.

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

Sau khi dự đoán lớp, chúng tôi sẽ nhận được phản hồi ngẫu nhiên từ danh sách các ý định.

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

Bây giờ, chúng tôi sẽ tạo giao diện người dùng đồ họa (GUI). Hãy sử dụng thư viện Tkinter, đi kèm với rất nhiều thư viện GUI hữu ích khác.

Chúng tôi sẽ nhận thông báo của người dùng và sử dụng các chức năng trợ giúp mà chúng tôi đã thực hiện để nhận câu trả lời từ bot và hiển thị nó trên GUI. Đây là mã nguồn đầy đủ của GUI.

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

Chạy Chatbot Python

Để chạy chatbot, chúng tôi có hai tệp chính; train_chatbot.py và chatapp.py.

Đầu tiên, chúng tôi đào tạo mô hình bằng cách sử dụng lệnh trong thiết bị đầu cuối:

python train_chatbot.py


TẢI XUỐNG HOÀN TOÀN MÃ NGUỒN!

Michio JP

Michio JP

1665799396

Chatbot in Python From Scratch with Source Code

A chatbot is an AI-based software designed to interact with humans in their natural languages. These chatbots are usually converse via auditory or textual methods, and they can effortlessly mimic human languages to communicate with human beings in a human-like manner. A chatbot is arguably one of the best applications of natural language processing.

In the past few years, chatbots in Python have become wildly popular in the tech and business sectors. These intelligent bots are so adept at imitating natural human languages and conversing with humans, that companies across various industrial sectors are adopting them. From e-commerce firms to healthcare institutions, everyone seems to be leveraging this nifty tool to drive business benefits. In this article, we will learn about chatbot using Python and how to make chatbot in python. 

To create a Chatbot in Python from Scratch we follow these steps.

  • Step 1: Import and load the data file
  • Step 2: Preprocess data
  • Step 3: Create training and testing data
  • Step 4: Build the model
  • Step 5: Predict the response

Step 1: Import and load the data file

First, you need to make a file called train_chatbot.py. We bring in the packages our chatbot needs and set up the variables we will use in our Python project.

The data file is in the JSON format, so we used the json package to read the JSON file into Python.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD
import random
words=[]
classes = []
documents = []
ignore_words = ['?', '!']
data_file = open('intents.json').read()
intents = json.loads(data_file)

Step 2: Preprocess data

Before we can make a machine learning or deep learning model from text data, we have to process the data in different ways. Depending on the needs, we have to use different operations to preprocess the data.

Tokenizing text data is the first and most basic thing you can do with it. Tokenizing is the process of breaking a text into small pieces, like words.

Here, we go through the patterns, use the nltk.word_tokenize() function to break the sentence into words, and add each word to the words list. We also make a list of the classes our tags belong to.

for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

Now, we’ll figure out what each word means and get rid of any words that are already on the list. Lemmatizing is the process of changing a word into its lemma form and then making a pickle file to store the Python objects we will use when predicting.

# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)
pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

Step 3: Create training and testing data

Now, we’ll make the training data, which will include both the inputs and outputs. The pattern will be our input, and the class that pattern belongs to will be our output. But the computer can’t read words, so we’ll turn the words into numbers.

# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Step 4: Build the model

Now that our training data is ready, we will build a 3-layer deep neural network. We do this with the Keras sequential API. After training the model for 200 iterations, it was 100% accurate. Let’s name the file “chatbot model.h5” and save it.

# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)
print("model created")

Step 5: Predict the response

To predict the sentences and get a response from the user to let us create a new file named “chatapp.py.”

We will load the trained model and then use a graphical user interface to predict the bot’s response. The model will only tell us what class it belongs to, so we will make some functions that will figure out the class and then pick a random response from the list of responses.

Again, we load the ‘words.pkl‘ and ‘classes.pkl‘ pickle files that we made when we trained our model:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pickle
import numpy as np
from keras.models import load_model
model = load_model('chatbot_model.h5')
import json
import random
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

To predict the class, we will have to give input the same way we did during training. So, we’ll make some functions that will do preprocessing on the text and then guess the class.

def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words
# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

After predicting the class, we’ll get a random response from the list of intents.

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result
def chatbot_response(text):
    ints = predict_class(text, model)
    res = getResponse(ints, intents)
    return res

Now, we’ll make a graphical user interface (GUI). Let’s use the Tkinter library, which comes with a lot of other useful GUI libraries.

We’ll take the user’s message and use the helper functions we’ve made to get the answer from the bot and show it on the GUI. Here is the GUI’s full source code.

#Creating GUI with tkinter
import tkinter
from tkinter import *
def send():
    msg = EntryBox.get("1.0",'end-1c').strip()
    EntryBox.delete("0.0",END)
    if msg != '':
        ChatLog.config(state=NORMAL)
        ChatLog.insert(END, "You: " + msg + '\n\n')
        ChatLog.config(foreground="#442265", font=("Verdana", 12 ))
        res = chatbot_response(msg)
        ChatLog.insert(END, "Bot: " + res + '\n\n')
        ChatLog.config(state=DISABLED)
        ChatLog.yview(END)
base = Tk()
base.title("Hello")
base.geometry("400x500")
base.resizable(width=FALSE, height=FALSE)
#Create Chat window
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial",)
ChatLog.config(state=DISABLED)
#Bind scrollbar to Chat window
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set
#Create Button to send message
SendButton = Button(base, font=("Verdana",12,'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b",fg='#ffffff',
                    command= send )
#Create the box to enter message
EntryBox = Text(base, bd=0, bg="white",width="29", height="5", font="Arial")
#EntryBox.bind("<Return>", send)
#Place all components on the screen
scrollbar.place(x=376,y=6, height=386)
ChatLog.place(x=6,y=6, height=386, width=370)
EntryBox.place(x=128, y=401, height=90, width=265)
SendButton.place(x=6, y=401, height=90)
base.mainloop()

Run Python Chatbot

To run the chatbot, we have two main files; train_chatbot.py and chatapp.py.

First, we train the model using the command in the terminal:

python train_chatbot.py


DOWNLOAD COMPLETE SOURCE CODE!

#python 

Chatbot in Python From Scratch with Source Code

A Small Library for Converting tokenized PHP Source Code into XML

Tokenizer

A small library for converting tokenized PHP source code into XML.

Installation

You can add this library as a local, per-project dependency to your project using Composer:

composer require theseer/tokenizer

If you only need this library during development, for instance to run your project's test suite, then you should add it as a development-time dependency:

composer require --dev theseer/tokenizer

Usage examples

$tokenizer = new TheSeer\Tokenizer\Tokenizer();
$tokens = $tokenizer->parse(file_get_contents(__DIR__ . '/src/XMLSerializer.php'));

$serializer = new TheSeer\Tokenizer\XMLSerializer();
$xml = $serializer->toXML($tokens);

echo $xml;

The generated XML structure looks something like this:

<?xml version="1.0"?>
<source xmlns="https://github.com/theseer/tokenizer">
 <line no="1">
  <token name="T_OPEN_TAG">&lt;?php </token>
  <token name="T_DECLARE">declare</token>
  <token name="T_OPEN_BRACKET">(</token>
  <token name="T_STRING">strict_types</token>
  <token name="T_WHITESPACE"> </token>
  <token name="T_EQUAL">=</token>
  <token name="T_WHITESPACE"> </token>
  <token name="T_LNUMBER">1</token>
  <token name="T_CLOSE_BRACKET">)</token>
  <token name="T_SEMICOLON">;</token>
 </line>
</source>

Download Details:

Author: Theseer
Source Code: https://github.com/theseer/tokenizer 
License: View license

#php #xml #tokenize 

A Small Library for Converting tokenized PHP Source Code into XML

Tokenize.jl: Tokenization for Julia Source Code

Tokenize

Tokenize is a Julia package that serves a similar purpose and API as the tokenize module in Python but for Julia. This is to take a string or buffer containing Julia code, perform lexical analysis and return a stream of tokens.

The goals of this package is to be

  • Fast, it currently lexes all of Julia source files in ~0.25 seconds (580 files, 2 million Tokens)
  • Round trippable, that is, from a stream of tokens the original string should be recoverable exactly.
  • Non error throwing. Instead of throwing errors a certain error token is returned.

API

Tokenization

The function tokenize is the main entrypoint for generating Tokens. It takes a string or a buffer and creates an iterator that will sequentially return the next Token until the end of string or buffer. The argument to tokenize can either be a String, IOBuffer or an IOStream.

julia> collect(tokenize("function f(x) end"))
 1,1-1,8          KEYWORD        "function"
 1,9-1,9          WHITESPACE     " "
 1,10-1,10        IDENTIFIER     "f"
 1,11-1,11        LPAREN         "("
 1,12-1,12        IDENTIFIER     "x"
 1,13-1,13        RPAREN         ")"
 1,14-1,14        WHITESPACE     " "
 1,15-1,17        KEYWORD        "end"
 1,18-1,17        ENDMARKER      ""

Tokens

Each Token is represented by where it starts and ends, what string it contains and what type it is.

The API for a Token (non exported from the Tokenize.Tokens module) is.

startpos(t)::Tuple{Int, Int} # row and column where the token start
endpos(t)::Tuple{Int, Int}   # row and column where the token ends
startbyte(T)::Int            # byte offset where the token start
endbyte(t)::Int              # byte offset where the token ends
untokenize(t)::String        # string representation of the token
kind(t)::Token.Kind          # kind of the token
exactkind(t)::Token.Kind     # exact kind of the token

The difference between kind and exactkind is that kind returns OP for all operators and KEYWORD for all keywords while exactkind returns a unique kind for all different operators and keywords, ex;

julia> tok = collect(tokenize("⇒"))[1];

julia> Tokens.kind(tok)
OP::Tokenize.Tokens.Kind = 90

julia> Tokens.exactkind(tok)
RIGHTWARDS_DOUBLE_ARROW::Tokenize.Tokens.Kind = 128

All the different Token.Kind can be seen in the token_kinds.jl file

Download Details:

Author: JuliaLang
Source Code: https://github.com/JuliaLang/Tokenize.jl/ 
License: View license

#julia #tokenize

Tokenize.jl: Tokenization for Julia Source Code
Royce  Reinger

Royce Reinger

1659368242

A Multilingual tokenizer To Split A String into Tokens

Pragmatic Tokenizer

Pragmatic Tokenizer is a multilingual tokenizer to split a string into tokens.

Installation

Add this line to your application's Gemfile:

Ruby

gem install pragmatic_tokenizer

Ruby on Rails
Add this line to your application’s Gemfile:

gem 'pragmatic_tokenizer'

Usage

  • If no language is specified, the library will default to English.
  • To specify a language use its two character ISO 639-1 code.
  • Pragmatic Tokenizer will unescape any HTML entities.

Example Usage

text = "\"I said, 'what're you? Crazy?'\" said Sandowsky. \"I can't afford to do that.\""

PragmaticTokenizer::Tokenizer.new.tokenize(text)
# => ["\"", "i", "said", ",", "'", "what're", "you", "?", "crazy", "?", "'", "\"", "said", "sandowsky", ".", "\"", "i", "can't", "afford", "to", "do", "that", ".", "\""]

# You can pass many different options to #initialize:
options = {
  language:            :en, # the language of the string you are tokenizing
  abbreviations:       ['a.b', 'a'], # a user-supplied array of abbreviations (downcased with ending period removed)
  stop_words:          ['is', 'the'], # a user-supplied array of stop words (downcased)
  remove_stop_words:   true, # remove stop words
  contractions:        { "i'm" => "i am" }, # a user-supplied hash of contractions (key is the contracted form; value is the expanded form - both the key and value should be downcased)
  expand_contractions: true, # (i.e. ["isn't"] will change to two tokens ["is", "not"])
  filter_languages:    [:en, :de], # process abbreviations, contractions and stop words for this array of languages
  punctuation:         :none, # see below for more details
  numbers:             :none, # see below for more details
  remove_emoji:        :true, # remove any emoji tokens
  remove_urls:         :true, # remove any urls
  remove_emails:       :true, # remove any emails
  remove_domains:      :true, # remove any domains
  hashtags:            :keep_and_clean, # remove the hastag prefix
  mentions:            :keep_and_clean, # remove the @ prefix
  clean:               true, # remove some special characters
  classic_filter:      true, # removes dots from acronyms and 's from the end of tokens
  downcase:            false, # do not downcase tokens
  minimum_length:      3, # remove any tokens less than 3 characters
  long_word_split:     10 # split tokens longer than 10 characters at hypens or underscores
}

Options

language

default = 'en'

  • To specify a language use its two character ISO 639-1 code as a symbol (i.e. :en) or string (i.e. 'en')

abbreviations

default = nil

  • You can pass an array of abbreviations to overide or compliment the abbreviations that come stored in this gem. Each element of the array should be a downcased String with the ending period removed.

stop_words

default = nil

  • You can pass an array of stop words to overide or compliment the stop words that come stored in this gem. Each element of the array should be a downcased String.

contractions

default = nil

  • You can pass a hash of contractions to overide or compliment the contractions that come stored in this gem. Each key is the contracted form downcased and each value is the expanded form downcased.

remove_stop_words

default = false

  • true
    Removes all stop words.
  • false
    Does not remove stop words.

expand_contractions

default = false

  • true
    Expands contractions (i.e. i'll -> i will).
  • false
    Leaves contractions as is.

filter_languages

default = nil

  • You can pass an array of languages of which you would like to process abbreviations, stop words and contractions. This language can be indepedent of the language of the string you are tokenizing (for example your text might be German but contain some English stop words that you want to remove). If you supply your own abbreviations, stop words or contractions they will be merged with the abbreviations, stop words and contractions of any languages you add in this option. You can pass an array of symbols or strings (i.e. [:en, :de] or ['en', 'de'])

punctuation

default = 'all'

  • :all
    Does not remove any punctuation from the result.
  • :semi
    Removes full stops (i.e. periods) ['。', '.', '.'].
  • :none
    Removes all punctuation from the result.
  • :only
    Removes everything except punctuation. The returned result is an array of only the punctuation.

numbers

default = 'all'

  • :all
    Does not remove any numbers from the result
  • :semi
    Removes tokens that include only digits
  • :none
    Removes all tokens that include a number from the result (including Roman numerals)
  • :only
    Removes everything except tokens that include a number

remove_emoji

default = false

  • true
    Removes any token that contains an emoji.
  • false
    Leaves tokens as is.

remove_urls

default = false

  • true
    Removes any token that contains a URL.
  • false
    Leaves tokens as is.

remove_domains

default = false

  • true
    Removes any token that contains a domain.
  • false
    Leaves tokens as is.

remove_domains

default = false

  • true
    Removes any token that contains a domain.
  • false
    Leaves tokens as is.

clean

default = false

  • true
    Removes tokens consisting of only hypens, underscores, or periods as well as some special characters (®, ©, ™). Also removes long tokens or tokens with a backslash.
  • false
    Leaves tokens as is.

hashtags

default = :keep_original

  • :keep_original
    Does not alter the token at all.
  • :keep_and_clean
    Removes the hashtag (#) prefix from the token.
  • :remove
    Removes the token completely.

mentions

default = :keep_original

  • :keep_original
    Does not alter the token at all.
  • :keep_and_clean
    Removes the mention (@) prefix from the token.
  • :remove
    Removes the token completely.

classic_filter

default = false

  • true
    Removes dots from acronyms and 's from the end of tokens.
  • false
    Leaves tokens as is.

downcase

default = true


minimum_length

default = 0
The minimum number of characters a token should be.


long_word_split

default = nil
The number of characters after which a token should be split at hypens or underscores.

Language Support

The following lists the current level of support for different languages. Pull requests or help for any languages that are not fully supported would be greatly appreciated.

N.B. - contractions might not be applicable for all languages below - in that case the CONTRACTIONS hash should stay empty.

English

Specs: Yes
Abbreviations: Yes
Stop Words: Yes
Contractions: Yes

Arabic

Specs: No
Abbreviations: Yes
Stop Words: Yes
Contractions: No

Bulgarian

Specs: More needed
Abbreviations: Yes
Stop Words: Yes
Contractions: No

Catalan

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Czech

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Danish

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Deutsch

Specs: More needed
Abbreviations: Yes
Stop Words: Yes
Contractions: No

Finnish

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

French

Specs: More needed
Abbreviations: Yes
Stop Words: Yes
Contractions: No

Greek

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Indonesian

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Italian

Specs: No
Abbreviations: Yes
Stop Words: Yes
Contractions: No

Latvian

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Norwegian

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Persian

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Polish

Specs: No
Abbreviations: Yes
Stop Words: Yes
Contractions: No

Portuguese

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Romanian

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Russian

Specs: No
Abbreviations: Yes
Stop Words: Yes
Contractions: No

Slovak

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Spanish

Specs: No
Abbreviations: Yes
Stop Words: Yes
Contractions: Yes

Swedish

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Turkish

Specs: No
Abbreviations: No
Stop Words: Yes
Contractions: No

Resources

Contributing

  1. Fork it ( https://github.com/diasks2/pragmatic_tokenizer/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Download Details: 

Author: Diasks2
Source Code: https://github.com/diasks2/pragmatic_tokenizer 
License: MIT license

#ruby #tokenize 

A Multilingual tokenizer To Split A String into Tokens
Royce  Reinger

Royce Reinger

1658763196

Naive Bayes Text Classification Implementation As OmniCat

OmniCat Bayes 

A Naive Bayes text classification implementation as an OmniCat classifier strategy.

Installation

Add this line to your application's Gemfile:

gem 'omnicat-bayes'

And then execute:

$ bundle

Or install it yourself as:

$ gem install omnicat-bayes

Usage

See rdoc for detailed usage.

Configurations

Optional configuration sample:

OmniCat.configure do |config|
  # you can enable auto train mode by :unique or :continues
  # unique: only uniq docs will be added to training docs on prediction
  # continues: always add docs to training docs on prediction
  config.auto_train = :off
  config.exclude_tokens = ['something', 'anything'] # exclude token list
  config.token_patterns = {
    # exclude tokens with Regex patterns
    minus: [/[\s\t\n\r]+/, /(@[\w\d]+)/],
    # include tokens with Regex patterns
    plus: [/[\p{L}\-0-9]{2,}/, /[\!\?]/, /[\:\)\(\;\-\|]{2,3}/]
  }
end

Bayes classifier

Create a classifier object with Bayes strategy.

# If you need to change strategy on runtime, you should prefer this inialization
bayes = OmniCat::Classifier.new(OmniCat::Classifiers::Bayes.new)

or

# If you only need to use only Bayes classification, then you can use
bayes = OmniCat::Classifiers::Bayes.new

Create categories

Create a classification category.

bayes.add_category('positive')
bayes.add_category('negative')

Train

Train category with a document.

bayes.train('positive', 'great if you are in a slap happy mood .')
bayes.train('negative', 'bad tracking issue')

Untrain

Untrain category with a document.

bayes.untrain('positive', 'great if you are in a slap happy mood .')
bayes.untrain('negative', 'bad tracking issue')

Train batch

Train category with multiple documents.

bayes.train_batch('positive', [
  'a feel-good picture in the best sense of the term...',
  'it is a feel-good movie about which you can actually feel good.',
  'love and money both of them are good choises'
])
bayes.train_batch('negative', [
  'simplistic , silly and tedious .',
  'interesting , but not compelling . ',
  'seems clever but not especially compelling'
])

Untrain batch

Untrain category with multiple documents.

bayes.untrain_batch('positive', [
  'a feel-good picture in the best sense of the term...',
  'it is a feel-good movie about which you can actually feel good.',
  'love and money both of them are good choises'
])
bayes.untrain_batch('negative', [
  'simplistic , silly and tedious .',
  'interesting , but not compelling . ',
  'seems clever but not especially compelling'
])

Classify

Classify a document.

result = bayes.classify('I feel so good and happy')
=> #<OmniCat::Result:0x007febb152af68 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb152add8 @key="positive", @value=6.813226744186048e-09, @percentage=58>, "negative"=>#<OmniCat::Score:0x007febb152ac70 @key="negative", @value=4.875003449064939e-09, @percentage=42>}, @total_score=1.1688230193250986e-08>
result.to_hash
=> {:top_score_key=>"positive", :scores=>{"positive"=>{:key=>"positive", :value=>6.813226744186048e-09, :percentage=>58}, "negative"=>{:key=>"negative", :value=>4.875003449064939e-09, :percentage=>42}}, :total_score=>1.1688230193250986e-08}
result.top_score
=> #<OmniCat::Score:0x007febb152add8 @key="positive", @value=6.813226744186048e-09, @percentage=58>
result.top_score.to_hash
=> {:key=>"positive", :value=>6.813226744186048e-09, :percentage=>58}

Classify batch

Classify multiple documents at a time.

results = bayes.classify_batch(
  [
    'the movie is silly so not compelling enough',
    'a good piece of work'
  ]
)
=> [#<OmniCat::Result:0x007febb14f3680 @top_score_key="negative", @scores={"positive"=>#<OmniCat::Score:0x007febb14f34a0 @key="positive", @value=7.971480930520432e-14, @percentage=22>, "negative"=>#<OmniCat::Score:0x007febb14f32c0 @key="negative", @value=2.834304330851709e-13, @percentage=78>}, @total_score=3.6314524239037524e-13>, #<OmniCat::Result:0x007febb14f2aa0 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb14f2960 @key="positive", @value=3.802731206057328e-07, @percentage=72>, "negative"=>#<OmniCat::Score:0x007febb14f2820 @key="negative", @value=1.4625010347194818e-07, @percentage=28>}, @total_score=5.26523224077681e-07>]

Convert to hash

Convert full Bayes object to hash.

# For storing, restoring modal data
bayes_hash = bayes.to_hash
=> {:categories=>{"positive"=>{:doc_count=>4, :docs=>{"28fd29bbf840c86db65e510ff3cd07a9"=>{:content=>"great if you are in a slap happy mood .", :content_md5=>"28fd29bbf840c86db65e510ff3cd07a9", :count=>1, :tokens=>{"great"=>1, "if"=>1, "you"=>1, "are"=>1, "in"=>1, "slap"=>1, "happy"=>1, "mood"=>1}}, "82b4cd9513f448dea0024f2d0e2ccd44"=>{:content=>"a feel-good picture in the best sense of the term...", :content_md5=>"82b4cd9513f448dea0024f2d0e2ccd44", :count=>1, :tokens=>{"feel-good"=>1, "picture"=>1, "in"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>1, "term"=>1}}, "f917bf1cf1256c78c5436d850dab3104"=>{:content=>"it is a feel-good movie about which you can actually feel good.", :content_md5=>"f917bf1cf1256c78c5436d850dab3104", :count=>1, :tokens=>{"it"=>1, "is"=>1, "feel-good"=>1, "movie"=>1, "about"=>1, "which"=>1, "you"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>1}}, "4343bbe84c035733708c3f58136f321e"=>{:content=>"love and money both of them are good choises", :content_md5=>"4343bbe84c035733708c3f58136f321e", :count=>1, :tokens=>{"love"=>1, "and"=>1, "money"=>1, "both"=>1, "of"=>1, "them"=>1, "are"=>1, "good"=>1, "choises"=>1}}}, :name=>"positive", :tokens=>{"great"=>1, "if"=>1, "you"=>2, "are"=>2, "in"=>2, "slap"=>1, "happy"=>1, "mood"=>1, "feel-good"=>2, "picture"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>2, "term"=>1, "it"=>1, "is"=>1, "movie"=>1, "about"=>1, "which"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>2, "love"=>1, "and"=>1, "money"=>1, "both"=>1, "them"=>1, "choises"=>1}, :token_count=>37, :prior=>0.5}, "negative"=>{:doc_count=>4, :docs=>{"89b36e774579662591ea21b3283d9b35"=>{:content=>"bad tracking issue", :content_md5=>"89b36e774579662591ea21b3283d9b35", :count=>1, :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1}}, "b0ec48bc87527e285b26d6cce8e278e7"=>{:content=>"simplistic , silly and tedious .", :content_md5=>"b0ec48bc87527e285b26d6cce8e278e7", :count=>1, :tokens=>{"simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1}}, "ae9d4fbaf40906614ca712a888648c5f"=>{:content=>"interesting , but not compelling . ", :content_md5=>"ae9d4fbaf40906614ca712a888648c5f", :count=>1, :tokens=>{"interesting"=>1, "but"=>1, "not"=>1, "compelling"=>1}}, "0e495f5d88d8049746a1b6961bf3cc90"=>{:content=>"seems clever but not especially compelling", :content_md5=>"0e495f5d88d8049746a1b6961bf3cc90", :count=>1, :tokens=>{"seems"=>1, "clever"=>1, "but"=>1, "not"=>1, "especially"=>1, "compelling"=>1}}}, :name=>"negative", :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1, "simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1, "interesting"=>1, "but"=>2, "not"=>2, "compelling"=>2, "seems"=>1, "clever"=>1, "especially"=>1}, :token_count=>17, :prior=>0.5}}, :category_count=>2, :category_size_limit=>0, :doc_count=>8, :token_count=>54, :unique_token_count=>43, :k_value=>1.0}

Load from hash

Load full Bayes object from hash.

another_bayes_obj = OmniCat::Classifiers::Bayes.new(bayes_hash)
=> #<OmniCat::Classifiers::Bayes:0x007febb14d15a8 @categories={"positive"=>#<OmniCat::Classifiers::BayesInternals::Category:0x007febb14d1530 @doc_count=4, @docs={"28fd29bbf840c86db65e510ff3cd07a9"=>{:content=>"great if you are in a slap happy mood .", :content_md5=>"28fd29bbf840c86db65e510ff3cd07a9", :count=>1, :tokens=>{"great"=>1, "if"=>1, "you"=>1, "are"=>1, "in"=>1, "slap"=>1, "happy"=>1, "mood"=>1}}, "82b4cd9513f448dea0024f2d0e2ccd44"=>{:content=>"a feel-good picture in the best sense of the term...", :content_md5=>"82b4cd9513f448dea0024f2d0e2ccd44", :count=>1, :tokens=>{"feel-good"=>1, "picture"=>1, "in"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>1, "term"=>1}}, "f917bf1cf1256c78c5436d850dab3104"=>{:content=>"it is a feel-good movie about which you can actually feel good.", :content_md5=>"f917bf1cf1256c78c5436d850dab3104", :count=>1, :tokens=>{"it"=>1, "is"=>1, "feel-good"=>1, "movie"=>1, "about"=>1, "which"=>1, "you"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>1}}, "4343bbe84c035733708c3f58136f321e"=>{:content=>"love and money both of them are good choises", :content_md5=>"4343bbe84c035733708c3f58136f321e", :count=>1, :tokens=>{"love"=>1, "and"=>1, "money"=>1, "both"=>1, "of"=>1, "them"=>1, "are"=>1, "good"=>1, "choises"=>1}}}, @name="positive", @tokens={"great"=>1, "if"=>1, "you"=>2, "are"=>2, "in"=>2, "slap"=>1, "happy"=>1, "mood"=>1, "feel-good"=>2, "picture"=>1, "the"=>2, "best"=>1, "sense"=>1, "of"=>2, "term"=>1, "it"=>1, "is"=>1, "movie"=>1, "about"=>1, "which"=>1, "can"=>1, "actually"=>1, "feel"=>1, "good"=>2, "love"=>1, "and"=>1, "money"=>1, "both"=>1, "them"=>1, "choises"=>1}, @token_count=37, @prior=0.5>, "negative"=>#<OmniCat::Classifiers::BayesInternals::Category:0x007febb14d14e0 @doc_count=4, @docs={"89b36e774579662591ea21b3283d9b35"=>{:content=>"bad tracking issue", :content_md5=>"89b36e774579662591ea21b3283d9b35", :count=>1, :tokens=>{"bad"=>1, "tracking"=>1, "issue"=>1}}, "b0ec48bc87527e285b26d6cce8e278e7"=>{:content=>"simplistic , silly and tedious .", :content_md5=>"b0ec48bc87527e285b26d6cce8e278e7", :count=>1, :tokens=>{"simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1}}, "ae9d4fbaf40906614ca712a888648c5f"=>{:content=>"interesting , but not compelling . ", :content_md5=>"ae9d4fbaf40906614ca712a888648c5f", :count=>1, :tokens=>{"interesting"=>1, "but"=>1, "not"=>1, "compelling"=>1}}, "0e495f5d88d8049746a1b6961bf3cc90"=>{:content=>"seems clever but not especially compelling", :content_md5=>"0e495f5d88d8049746a1b6961bf3cc90", :count=>1, :tokens=>{"seems"=>1, "clever"=>1, "but"=>1, "not"=>1, "especially"=>1, "compelling"=>1}}}, @name="negative", @tokens={"bad"=>1, "tracking"=>1, "issue"=>1, "simplistic"=>1, "silly"=>1, "and"=>1, "tedious"=>1, "interesting"=>1, "but"=>2, "not"=>2, "compelling"=>2, "seems"=>1, "clever"=>1, "especially"=>1}, @token_count=17, @prior=0.5>}, @category_count=2, @category_size_limit=0, @doc_count=8, @token_count=54, @unique_token_count=43, @k_value=1.0>
another_bayes_obj.classify('best senses')
=> #<OmniCat::Result:0x007febb14c0fc8 @top_score_key="positive", @scores={"positive"=>#<OmniCat::Score:0x007febb14c0ed8 @key="positive", @value=0.00029069767441860465, @percentage=52>, "negative"=>#<OmniCat::Score:0x007febb14c0de8 @key="negative", @value=0.0002704164413196322, @percentage=48>}, @total_score=0.0005611141157382368>

Best practices

For bayes classification always try to train same amount of documents for each category. So, do not activate auto training mode, because it make overages on balance of trained docs and makes algorithm go crazy :). To get best results on text classification you should apply some cleaning actions like spellchecking, stemming, stop words cleaning before training and prediction actions.

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

Author: Mustafaturan
Source Code: https://github.com/mustafaturan/omnicat-bayes 
License: MIT license

#ruby #tokenize #text #classification 

Naive Bayes Text Classification Implementation As OmniCat