Text Similarity : Python-sklearn on MongoDB Collection

Text Similarity : Python-sklearn on MongoDB Collection

Check out some Python code that can calculate the similarity of an indexed field between all the documents of a MongoDB collection.

Check out some Python code that can calculate the similarity of an indexed field between all the documents of a MongoDB collection.

Overview

In this article, I set up a Python script that allows us to calculate the similarity of an indexed field between all the documents of a MongoDB collection. In the process I parallelized the executions on four threads to improve performance.

The script is detailed below, I hope it will be useful.

Python Script

import multiprocessing
import threading
import json, sys
import pymongo
import nltk, string

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
class SimilarityThread (threading.Thread):
   def __init__(self, threadID, data_array, totalSize, similarity_collection,startIndex):
   threading.Thread.__init__(self)
   self.threadID = threadID
   self.data_array = data_array
   self.totalSize = totalSize
   self.similarity_collection = similarity_collection
   self.startIndex = startIndex

   def run(self):
      clacluateSimilarity( self.data_array, self.totalSize, self.similarity_collection,self.startIndex)

def clacluateDistance(txt1,txt2):
return euclidean_distances(txt1,txt2)[0][0]

def clacluateSimilarity( data_array, totalSize, similarity_collection, startIndex):
vectorizer = CountVectorizer()
for idx in range(startIndex,totalSize):
h = data_array[idx]
for idx1 in range((idx+1),totalSize):
h1 = data_array[idx1]
hSimilarity = {}
hSimilarity['idOrigin']=h['id']
hSimilarity['idTarget']=h1['id']
corpus = []
corpus.append(h['text'])
corpus.append(h1['text'])
features = vectorizer.fit_transform(corpus).todense()
distance = clacluateDistance(features[0],features[1])
hSimilarity['distance'] = distance
print(hSimilarity)
if distance < 4:
print("Distance ====> %d " % distance)
similarity_collection.insert_one(hSimilarity)

def processTextSimilarity(totalSize, data_array,similarity_collection):

num_cores = multiprocessing.cpu_count()
print(":::num cores ==> %d " % num_cores)
threadList = ["Thread-1", "Thread-2", "Thread-3", "Thread-4"]
threadID = 1;
threads=[]
rootIndex = round(totalSize/4)
startIndex = 0
for tName in threadList:
thread = SimilarityThread(threadID, data_array, startIndex+rootIndex, similarity_collection,startIndex)
thread.start()
startIndex+=rootIndex
threads.append(thread)
threadID += 1


# Wait for all threads to complete
for t in threads:
t.join()



def main():
print('****** Text Similarity::start ******')
connection = pymongo.MongoClient("mongodb://localhost")
db = connection.kalamokomnoor
article = db.article
article_similarity = db.article_similarity

data_array = article.find({}).sort("id",pymongo.ASCENDING)
totalSize =  article.count_documents({}) 


print('###### :: totalSize : %d ' % totalSize)


processTextSimilarity(totalSize,data_array,article_similarity)

print('****** Text Similarity::Ending ******')




if __name__ == '__main__':
main()

python mongodb data-science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Applied Data Science with Python Certification Training Course -IgmGuru

Master Applied Data Science with Python and get noticed by the top Hiring Companies with IgmGuru's Data Science with Python Certification Program. Enroll Now

50 Data Science Jobs That Opened Just Last Week

Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Data Science With Python | Python For Data Science | Data Science For Beginners

This Data Science with Python Tutorial will help you understand what is Data Science, basics of Python for data analysis, why learn Python, how to install Python, Python libraries for data analysis, exploratory analysis using Pandas, introduction to series and dataframe, loan prediction problem, data wrangling using Pandas, building a predictive model using Scikit-Learn and implementing logistic regression model using Python.

Data Science Course in Dallas

Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...