Python+MongoDB = Rapid & scalable app development

Python+MongoDB = Rapid & scalable app development

Accessing MongoDB from Python applications is easy and familiar to many Python developers. PyMongo uses the rich dictionary support in Python to create a similar API as MongoDB’s native JavaScript query syntax. After all, there some understanding for execution and performance. There is also a second API built upon the atomic operators of MongoDB $set, $push, etc. which truly leverages the full power of MongoDB and its aggregate-level atomicity.

MongoDB from Python:
Python and PyMongo allow direct coding against MongoDB from Python. This is most appropriately compared to programming at the level of raw SQL for RDBMSes. That level is a necessary building block, but for most applications working at a higher level and building upon custom classes is more appropriate. This module explores one of the most popular Object-Data Mappers for Python and MongoDB: MongoEngine.
MongoDB from Python:
Designing entities in MongoDB and document databases more generally is very different than 3rd-normal-form from SQL tables. To be successful with MongoDB, as a developer you will need to master this skill. Getting your entity design correct is key to high performance and flexible applications.
MongoDB from Python:
One thing that’s nice about the pymongo connection is that it’s automatically pooled. What this means is that pymongo maintains a pool of connections to the mongodb server that it reuses over the lifetime of your application. This is good for performance since it means pymongo doesn’t need to go through the overhead of establishing a connection each time it does an operation. Mostly, this happens automatically. You do, however, need to be aware of the connection pooling, however, since you need to manually notify pymongo that you’re “done” with a connection in the pool so it can be reused.

The easiest way to connec to to a MongoDB database from python is below :

In: import pymongo

In: conn = pymongo.Connection()

Inserting documents begins by selecting a database. To create a database, you do… well, nothing, actually. The first time you refer to a database, the MongoDB server creates it for you automatically. So once you have your database, you need to decide which “collection” in which to store your documents. To create a collection, you do… right – nothing. 

In: db = conn.tutorial

In: db.test

Out: Collection(Database(Connection('localhost', 27017), u'tutorial'), u'test')

In: db.test.insert({'name': 'My Document', 'ids': [1,2,3], 'subdocument': {'a':2}})  

Out: ObjectId('4f25bcffeb033049af000000')

 

here the insert command returned us an ObjectId value. This is the value that pymongo generated for the _idproperty, the “primary key” of a MongoDB document. We can also manually specify the _id if we want and we don’t have to use ObjectIds:

In: db.test.insert({'_id': 42, 'name': 'My Document', 'ids': [1,2,3], 'subdocument': {'a':2}})

Out: 42

MongoDB from Python:
Simply put, indexes are the single biggest contributor for extremely high performance MongoDB deployments and applications. Make sure your applications use indexes to full advantage. Finding the queries that need optimized can be tricky, especially when there is a translation layer in the middle such as MongoEngine and an ODM.

MongoDB has an *extremely *fast query that it can use in some cases where it doesn’t have to scan *any *objects, only the index entries. This happens when the only data you’re returning from a query is part of the index:

In: db.test.find({'a':2}, {'a':1, '_id':0}).explain()

Out: 

...

u'indexBounds': {u'a': [[2, 2]]},

u'indexOnly': True,

u'isMultiKey': False,

...

here the indexOnly field is true, specifying that MongoDB only had to inspect the index (and not the actual collection data) to satisfy the query. 

MongoDB from Python:
MongoDB has a facility to store, classify, and query files of virtually unlimited size in binary data, text data, etc. GridFS and show you how to work with it from Python. You can upload, download, and list files in GridFS. Also you can create custom classes and store them within our GridFS files which can then be used for rich reporting and querying that does not exist in standard file systems.

creating a GridFS instance to use:

>>> from pymongo import MongoClient

>>> import gridfs

>>>

>>> db = MongoClient().gridfs_example

>>> fs = gridfs.GridFS(db)

Every instance is created with and will operate on a specific database instance.

MongoDB from Python:
The simplest way to work with GridFs is to use its key/value interface. To write data to GridFS, use put()

>>> a = fs.put("hello world")

put() creates a new file in GridFS, and returns the value of the file document’s "_id" key. Given that "_id" we can use get()to get back the contents of the file 

>>> fs.get(a).read()

'hello world'

 

In addition to putting a str as a GridFS file, we can also put any file-like object (an object with a read() method). GridFS will handle reading the file in chunk-sized segments automatically. We can also add additional attributes to the file as keyword arguments:

>>> b = fs.put(fs.get(a), filename="foo", bar="baz")

>>> out = fs.get(b)

>>> out.read()

'hello world'

>>> out.filename

u'foo'

>>> out.bar

u'baz'

>>> out.upload_date

datetime.datetime(...)

The attributes we set in put()are stored in the file document, and retrievable after calling get(). Some attributes (like "filename") are special and are defined in the GridFS specification. 

MongoDB from Python:
The aggregation framework in MongoDB allows you to execute rich queries and transformations on the server. While normal queries leverage documents in the exact structure, aggregation similar to map-reduce is much more flexible. It can transform, group, and query data as well as act as a data pipeline on the server.
MongoDB from Python:> MongoDB from Python:
Replication is key to MongoDB’s fault tolerance. It can also be used for data locality across data centers, scaled-out reads, offsite backups, reporting without performance degradation, and more. PyMongo makes working with replica setseasy. Here we’ll launch a new replica set and show how to handle both initialization and normal connections with PyMongo.
MongoDB from Python:
MongoDB is a high performance database even in single-server mode. However, to truly leverage MongoDB’s performance potential, you will need to use sharding. This technique allows you to run a cluster of MongoDB servers working in concert to each hold some portion of the data and share some portion of the queries. It is sharding that gives MongoDB the ability to scale horizontally on commodity hardware.

To actually add the shards to the cluster, go through the query routers, which are now configured to act as our interface with the cluster. You can do this by connecting to *any *of the query routers like this:

mongo --host query0.example.com --port 27017

This will connect to the appropriate query router and open a mongo prompt. You will add all of shard servers from this prompt.

To add first shard, type:

sh.addShard( "shard0.example.com:27017" )

You can then add your remaining shard droplets in this same interface. You do not need to log into each shard server individually. 

sh.addShard( "shard1.example.com:27017" )

sh.addShard( "shard2.example.com:27017" )

sh.addShard( "shard3.example.com:27017" )

If you are configuring a production cluster, complete with replication sets, you have to instead specify the replication set name and a replication set member to establish each set as a distinct shard. The syntax would look something like this:

sh.addShard( "rep_set_name/rep_set_member:27017" )

How to integrate MongoDB with Python Applications

How to integrate MongoDB with Python Applications

In this article, you'll learn how to integrate MongoDB with your Python applications.

MongoDB is a leading open-source NoSQL database that is written in C++. This tutorial will give the reader a better understanding of MongoDB concepts needed in integrating MongoDB in your Python applications.

The SQL vs. NoSQL Difference

SQL databases use Structured Query Language(SQL) in defining and manipulating data. When using SQL, we need a Relational Database Management System(RDBMS) server such as SQL Server, MySQL server or MS Access. Data in RDBMS is stored in database objects called tables. A table is a collection of related data entries, and it consists of columns and rows.

A NoSQL database has a dynamic schema for unstructured data. In NoSQL, data is stored in several ways: it can be column-oriented, document-oriented, graph-based or organized as a key-value store. A NoSQL database has the following advantages:

  • Documents can be created without having to first define their structure
  • Each document can have its own unique structure
  • The syntax can vary from database to database
  • Large volumes of structured, semi-structured, and unstructured data
  • Object-oriented programming that is easy to use and flexible
  • It is horizontally scalable
NoSQL Database Types

The following are the different types of NoSQL databases:

  • Document databases pair each key with a complex data structure known as a document. A document is a set of key-value pairs. MongoDB is an example of a document store database. A group of MongoDB documents is known as a collection. This is the equivalent of an RDBMS table.
  • Graph stores are used to store information about networks of data, for instance, social connections. Graph stores include Neo4J and Giraph.
  • Key-value stores databases store every single item in the database as a key together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as an integer, which adds functionality.
  • Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.
Comparing MongoDB to RDBMS

In order to get a thorough understanding of the terms used in MongoDB, we'll compare them with the equivalent in RDBMS.

MongoDB and Python

In order to start using MongoDB, we first have to install it. Installation instructions are found at the official MongoDB documentation. To run a quick install on Ubuntu run the commands below:

sudo apt update sudo apt install -y mongodb

Once this is done we'll check the service and database by running this command on the terminal:

sudo systemctl status mongodb

● mongodb.service - An object/document-oriented database
   Loaded: loaded (/lib/systemd/system/mongodb.service; enabled; vendor preset:
   Active: active (running) since Thu 2018-09-20 13:14:02 EAT; 23h ago
     Docs: man:mongod(1)
 Main PID: 11446 (mongod)
    Tasks: 27 (limit: 4915)
   CGroup: /system.slice/mongodb.service
           └─11446 /usr/bin/mongod --unixSocketPrefix=/run/mongodb --config /etc

Sep 20 13:14:02 derrick systemd[1]: Started An object/document-oriented database
lines 1-10/10 (END)

The message above means that all is well and that we are set to start using MongoDB.

Now that we have MongoDB installed we need a way to interact with it in our Python code. The official Python MongoDB driver is called PyMongo. We can install it using pip as shown below:

pip install pymongo

Its possible for us to interact with MongoDB from the terminal, however for the purposes of this tutorial we'll run all our code in a Jupyter Notebook.

Making a Connection with MongoClient

The first thing we need to do is import pymongo. The import should run without any errors to signify that we've done our installation well.

import pymongo

Establishing a connection in MongoDB requires us to create a MongoClient to the running MongoDB instance.

from pymongo import MongoClient
client = MongoClient()

The above code will connect to the default host and port, but we can specify the host and port as shown below:

client = MongoClient("localhost", 27017)

MongoDB also has a URI format for doing this.

client = MongoClient('mongodb://localhost:27017/')

Creating a Database

To create a database in MongoDB, we use the MongoClient instance and specify a database name. MongoDB will create a database if it doesn't exist and connect to it.

db = client['datacampdb']

It is important to note that databases and collections are created lazily in MongoDB. This means that the collections and databases are created when the first document is inserted into them.

Data in MongoDB

Data in MongoDB is represented and stored using JSON-Style documents. In PyMongo we use dictionaries to represent documents. Let's show an example of a PyMongo document below:

 article = {"author": "Derrick Mwiti",
            "about": "Introduction to MongoDB and Python",
            "tags":
                ["mongodb", "python", "pymongo"]}

Inserting a Document

To insert a document into a collection, we use the insert_one() method. As we saw earlier, a collection is similar to a table in RDBMS while a document is similar to a row.

articles = db.articles
result = articles.insert_one(article)

When the document is inserted, a special key _id is generated and its unique to this document. We can print the document ID as shown below:

print("First article key is: {}".format(result.inserted_id))
First article key is: 5ba5c05e2e8ca029163417f8

The articles collection is created after inserting the first document. We can confirm this using the list_collection_names method.

db.list_collection_names()
['articles', 'user']

We can insert multiple documents to a collection using the insert_many() method as shown below.

article1 = {"author": "Emmanuel Kens",
            "about": "Knn and Python",
            "tags":
                ["Knn","pymongo"]}
article2 = {"author": "Daniel Kimeli",
            "about": "Web Development and Python",
            "tags":
                ["web", "design", "HTML"]}
new_articles = articles.insert_many([article1, article2])
print("The new article IDs are {}".format(new_articles.inserted_ids))
The new article IDs are [ObjectId('5ba5c0c52e8ca029163417fa'), ObjectId('5ba5c0c52e8ca029163417fb')]
Retrieving a Single Document with find_one()

find_one() returns a single document matching the query or none if it doesn't exist. This method returns the first match that it comes across. When we call the method below, we get the first article we inserted into our collection.

print(articles.find_one())
{'_id': ObjectId('5ba5c0b52e8ca029163417f9'), 'author': 'Derrick Mwiti', 'about': 'Introduction to MongoDB and Python', 'tags': ['mongodb', 'python', 'pymongo']}
Finding all Documents in a Collection

MongoDB also allows us to retrieve all documents in a collection using the find method.

for article in articles.find():
  print(article)
{'_id': ObjectId('5ba5c0b52e8ca029163417f9'), 'author': 'Derrick Mwiti', 'about': 'Introduction to MongoDB and Python', 'tags': ['mongodb', 'python', 'pymongo']}
{'_id': ObjectId('5ba5c0c52e8ca029163417fa'), 'author': 'Emmanuel Kens', 'about': 'Knn and Python', 'tags': ['Knn', 'pymongo']}
{'_id': ObjectId('5ba5c0c52e8ca029163417fb'), 'author': 'Daniel Kimeli', 'about': 'Web Development and Python', 'tags': ['web', 'design', 'HTML']}

When building web applications, we usually get document IDs from the URL and try to retrieve them from our MongoDB collection. In order to achieve this, we first have to convert the obtained string ID into an ObjectId.

from bson.objectid import ObjectId
def get(post_id):
    document = client.db.collection.find_one({'_id': ObjectId(post_id)})

Return Some Fields Only

Sometimes we might not want to return all the fields from our documents. Let's show we'd fetch specific fields. In our case we use 0 to specify that the _id should not be fetched and 1 to specify that author and about should be fetched. MongoDB doesn't allow us to specify zero twice. For example, specify tags to 0 below will generate an error. We are not allowed to specify both 0 and 1 values in the same object (unless one of the fields is the _id field). When we specify a field with the value 0, all other fields get the value 1.

for article in articles.find({},{ "_id": 0, "author": 1, "about": 1}):
  print(article)
{'author': 'Derrick Mwiti', 'about': 'Introduction to MongoDB and Python'}
{'author': 'Emmanuel Kens', 'about': 'Knn and Python'}
{'author': 'Daniel Kimeli', 'about': 'Web Development and Python'}
Sorting the Results

We can use the sort() method to sort the results in ascending or descending order. The default order is ascending. We use 1 to signify ascending and -1 to signify descending.

doc = articles.find().sort("author", -1)

for x in doc:
  print(x)
{'_id': ObjectId('5ba5c0c52e8ca029163417fa'), 'author': 'Emmanuel Kens', 'about': 'Knn and Python', 'tags': ['Knn', 'pymongo']}
{'_id': ObjectId('5ba5c0b52e8ca029163417f9'), 'author': 'Derrick Mwiti', 'about': 'Introduction to MongoDB and Python', 'tags': ['mongodb', 'python', 'pymongo']}
{'_id': ObjectId('5ba5c0c52e8ca029163417fb'), 'author': 'Daniel Kimeli', 'about': 'Web Development and Python', 'tags': ['web', 'design', 'HTML']}
Updating a Document

We update a document using the update_one() method. The first parameter taken by this function is a query object defining the document to be updated. If the method finds more than one document, it will only update the first one. Let's update the name of the author in the article written by Derrick.

query = { "author": "Derrick Mwiti" }
new_author = { "$set": { "author": "John David" } }

articles.update_one(query, new_author)

for article in articles.find():
  print(article)
{'_id': ObjectId('5ba5c0b52e8ca029163417f9'), 'author': 'John David', 'about': 'Introduction to MongoDB and Python', 'tags': ['mongodb', 'python', 'pymongo']}
{'_id': ObjectId('5ba5c0c52e8ca029163417fa'), 'author': 'Emmanuel Kens', 'about': 'Knn and Python', 'tags': ['Knn', 'pymongo']}
{'_id': ObjectId('5ba5c0c52e8ca029163417fb'), 'author': 'Daniel Kimeli', 'about': 'Web Development and Python', 'tags': ['web', 'design', 'HTML']}
Limiting the Result

MongoDB enables us to limit the result of our query using the limit method. In our query below we'll limit the result to one record.

limited_result = articles.find().limit(1)
for x in limited_result:
    print(x)
{'_id': ObjectId('5ba5c0b52e8ca029163417f9'), 'author': 'John David', 'about': 'Introduction to MongoDB and Python', 'tags': ['mongodb', 'python', 'pymongo']}
MongoDB Delete Document

We use the delete_one() method to delete a document in MongoDB. The first parameter for this method is the query object of the document we want to delete. If this method finds more than one document, it deletes only the first one found. Let's delete the article with the id 5ba4cbe42e8ca029163417ce.

db.articles.delete_one({"_id":ObjectId("5ba4d00e2e8ca029163417d4")})

Deleting Many Documents

In order to delete many documents, we use the delete_many() method. Passing an empty query object will delete all the documents.

delete_articles = articles.delete_many({})
print(delete_articles.deleted_count, " articles deleted.")
3  articles deleted.
Dropping a Collection

In MongoDB, we can delete a collection using the drop() method.

articles.drop()

We can confirm that the collection has been deleted since when we call the list_collection_names, we get an empty list.

db.list_collection_names()
[]

It is impossible for us to go through all the MongoDB methods in this tutorial. I would recommend that the reader visits the official documentation of PyMongo and MongoDB to learn more.

MongoDB object document mapper (ODM)

In SQL we have object relational mapper (ORM) mappers that provides an abstraction when working with SQL. MongoDB has something similar know as object document mapper(ODM). MongoEngine is a library that provides a high-level abstraction on top of PyMongo. Run the command below to install it using pip.

pip install mongoengine

There are quite a number of other MongoDB ODMs that we can experiment with and choose the best option for our use. Examples of other MongoDB ODMs include ming, minimongo and, mongokit.

After we have imported mongoengine, we use the connect function and specify the database, port, and the host in order to establish a connection with the MongoDB instance.

from mongoengine import *
connect('datacampdb', host='localhost', port=27017)
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())
Defining our Documents

Let's assume that we are developing a social site that will allow users to post messages. This means that we need a users and a comments document. Just as if we were using a relational database with an ORM, we define the fields a user will have and the data types. We create the document by sub-classing the Document class from mongoengine. required=True means that we have to specify this field when creating a user. Otherwise, an exception will be thrown.

class User(Document):
    email = StringField(required=True)
    first_name = StringField(max_length=30)
    last_name = StringField(max_length=30)

Now let's show how we'd create a posts document and reference the users document. The ReferenceField enables us to make reference from one document to another in mongoengine.

class Post(Document):
    title = StringField(max_length=120, required=True)
    author = ReferenceField(User)
Saving Documents

To save the document to the database, call the save() method. If the document does not exist in the database, it will be created. If it does already exist, then changes will be updated atomically.

user = User(email="[email protected]", first_name="Derrick", last_name="Mwiti")
user.save()

Accessing the just created is very similar to other ORMs

print(user.id, user.email, user.first_name, user.last_name)
5ba5c3bf2e8ca029163417fc [email protected] Derrick Mwiti
Conclusion

In this tutorial, we have learned how we can use MongoDB in Python. We've also introduced mongoengine, an Object Document Mapper that makes it easier for us to interact with MongoDB in Python.

PyMongo Tutorial: Testing MongoDB Failover in Your Python App

PyMongo Tutorial: Testing MongoDB Failover in Your Python App

Python is a powerful and flexible programming language used by millions of developers around the world to&nbsp;build&nbsp;their applications. It comes as no surprise that Python developers commonly leverage&nbsp;<a href="https://scalegrid.io/mongodb.html" target="_blank">MongoDB hosting</a>, the&nbsp;<a href="https://scalegrid.io/blog/2019-database-trends-sql-vs-nosql-top-databases-single-vs-multiple-database-use/" target="_blank">most popular NoSQL database</a>,&nbsp;for their deployments due to its flexible nature and lack of schema requirements.

Python is a powerful and flexible programming language used by millions of developers around the world to build their applications. It comes as no surprise that Python developers commonly leverage MongoDB hosting, the most popular NoSQL database, for their deployments due to its flexible nature and lack of schema requirements.

So, what's the best way to use MongoDB with Python? PyMongo is a Python distribution containing tools for working with MongoDB, and the recommended Python MongoDB driver. It is a fairly mature driver that supports most of the common operations with the database, and you can check out this tutorial for an introduction to the PyMongo driver.

When deploying in production, it's highly recommended to setup in a MongoDB replica set configuration so your data is geographically distributed for high availability. It is also recommended that SSL connections be enabled to encrypt the client-database traffic. We often undertake testing of failover characteristics of various MongoDB drivers to qualify them for production use cases, or when our customers ask us for advice. In this post, we show you how to connect to an SSL-enabled MongoDB replica set configured with self-signed certificates using PyMongo, and how to test MongoDB failover behavior in your code.

Connecting to MongoDB SSL Using Self-Signed Certificates

The first step is to ensure that the right versions of PyMongo and its dependencies are installed. This guide helps you in sorting out the dependencies, and the driver compatibility matrix can be found here.

The mongo_client.MongoClient parameters that are of interest to us are ssland ss_ca_cert. In order to connect to an SSL-enabled MongoDB endpoint that uses a self-signed certificate, ssl must be set to True and ss_ca_cert must point to the CA certificate file.

If you are a ScaleGrid customer, you can download the CA certificate file for your MongoDB clusters from the ScaleGrid console as shown here:

So, a connection snippet would look like:

>>> import pymongo
>>> MONGO_URI = 'mongodb://rwuser:@SG-example-0.servers.mongodirector.com:27017,SG-example-1.servers.mongodirector.com:27017,SG-example-2.servers.mongodirector.com:27017/admin?replicaSet=RS-example&ssl=true'
>>> client = pymongo.MongoClient(MONGO_URI, ssl = True, ssl_ca_certs = '')
>>> print("Databases - " + str(client.list_database_names()))
Databases - ['admin', 'local', 'test']
>>> client.close()
>>>

If you are using your own self-signed certificates where hostname verification might fail, you will also have to set the ssl_match_hostnameparameter to False. Like the driver documentation says, this is not recommended as it makes the connection susceptible to man-in-the-middle attacks.

Testing Failover Behavior

With MongoDB deployments, failovers aren't considered major events as they were with traditional database management systems. Although most MongoDB drivers try to abstract this event, developers should understand and design their applications for such behavior, as applications should expect transient network errors and retry before percolating errors up.

You can test the resilience of your applications by inducing failovers while your workload runs. The easiest way to induce failover is to run the rs.stepDown() command:

RS-example-0:PRIMARY> rs.stepDown()
2019-04-18T19:44:42.257+0530 E QUERY [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host 'SG-example-1.servers.mongodirector.com:27017' :
[email protected]/mongo/shell/db.js:168:1
[email protected]/mongo/shell/db.js:185:1
[email protected]/mongo/shell/utils.js:1305:12
@(shell):1:1
2019-04-18T19:44:42.261+0530 I NETWORK [thread1] trying reconnect to SG-example-1.servers.mongodirector.com:27017 (X.X.X.X) failed
2019-04-18T19:44:43.267+0530 I NETWORK [thread1] reconnect SG-example-1.servers.mongodirector.com:27017 (X.X.X.X) ok
RS-example-0:SECONDARY>

One of the ways I like to test the behavior of drivers is by writing a simple 'perpetual' writer app. This would be simple code that keeps writing to the database unless interrupted by the user, and would print all exceptions it encounters to help us understand the driver and database behavior. I also keep track of the data it writes to ensure that there's no unreported data loss in the test. Here's the relevant part of test code we will use to test our MongoDB failover behavior:

import logging
import traceback
...
import pymongo
...
logger = logging.getLogger("test")

MONGO_URI = 'mongodb://rwuser:@SG-example-0.servers.mongodirector.com:48273,SG-example-1.servers.mongodirector.com:27017,SG-example-2.servers.mongodirector.com:27017/admin?replicaSet=RS-example-0&ssl=true'

try:
logger.info("Attempting to connect...")
client = pymongo.MongoClient(MONGO_URI, ssl = True, ssl_ca_certs = 'path-to-cacert.pem')
db = client['test']
collection = db['test']
i = 0
while True:
try:
text = ''.join(random.choices(string.ascii_uppercase + string.digits, k = 3))
doc = { "idx": i, "date" : datetime.utcnow(), "text" : text}
i += 1
id = collection.insert_one(doc).inserted_id
logger.info("Record inserted - id: " + str(id))
sleep(3)
except pymongo.errors.ConnectionFailure as e:
logger.error("ConnectionFailure seen: " + str(e))
traceback.print_exc(file = sys.stdout)
logger.info("Retrying...")

logger.info("Done...")

except Exception as e:
logger.error("Exception seen: " + str(e))
traceback.print_exc(file = sys.stdout)
finally:
client.close()

The sort of entries that this writes look like:

RS-example-0:PRIMARY> db.test.find()
{ "_id" : ObjectId("5cb6d6269ece140f18d05438"), "idx" : 0, "date" : ISODate("2019-04-17T07:30:46.533Z"), "text" : "400" }
{ "_id" : ObjectId("5cb6d6299ece140f18d05439"), "idx" : 1, "date" : ISODate("2019-04-17T07:30:49.755Z"), "text" : "X63" }
{ "_id" : ObjectId("5cb6d62c9ece140f18d0543a"), "idx" : 2, "date" : ISODate("2019-04-17T07:30:52.976Z"), "text" : "5BX" }
{ "_id" : ObjectId("5cb6d6329ece140f18d0543c"), "idx" : 4, "date" : ISODate("2019-04-17T07:30:58.001Z"), "text" : "TGQ" }
{ "_id" : ObjectId("5cb6d63f9ece140f18d0543d"), "idx" : 5, "date" : ISODate("2019-04-17T07:31:11.417Z"), "text" : "ZWA" }
{ "_id" : ObjectId("5cb6d6429ece140f18d0543e"), "idx" : 6, "date" : ISODate("2019-04-17T07:31:14.654Z"), "text" : "WSR" }
..

Handling the ConnectionFailure Exception

Notice that we catch the ConnectionFailure exception to deal with all network-related issues we may encounter due to failovers - we print the exception and continue to attempt to write to the database. The driver documentation recommends that:

If an operation fails because of a network error, ConnectionFailure is raised and the client reconnects in the background. Application code should handle this exception (recognizing that the operation failed) and then continue to execute.

Let's run this and do a database failover while it executes. Here's what happens:

04/17/2019 12:49:17 PM INFO Attempting to connect...
04/17/2019 12:49:20 PM INFO Record inserted - id: 5cb6d3789ece145a2408cbc7
04/17/2019 12:49:23 PM INFO Record inserted - id: 5cb6d37b9ece145a2408cbc8
04/17/2019 12:49:27 PM INFO Record inserted - id: 5cb6d37e9ece145a2408cbc9
04/17/2019 12:49:30 PM ERROR PyMongoError seen: connection closed
Traceback (most recent call last):
id = collection.insert_one(doc).inserted_id
File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\collection.py", line 693, in insert_one
session=session),
...
File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\network.py", line 173, in receive_message
_receive_data_on_socket(sock, 16))
File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\network.py", line 238, in _receive_data_on_socket
raise AutoReconnect("connection closed")
pymongo.errors.AutoReconnect: connection closed
04/17/2019 12:49:30 PM INFO Retrying...
04/17/2019 12:49:42 PM INFO Record inserted - id: 5cb6d3829ece145a2408cbcb
04/17/2019 12:49:45 PM INFO Record inserted - id: 5cb6d3919ece145a2408cbcc
04/17/2019 12:49:49 PM INFO Record inserted - id: 5cb6d3949ece145a2408cbcd
04/17/2019 12:49:52 PM INFO Record inserted - id: 5cb6d3989ece145a2408cbce

Notice that the driver takes about 12 seconds to understand the new topology, connect to the new primary, and continue writing. The exception raised is errors.AutoReconnect which is a subclass of ConnectionFailure.

You could do a few more runs to see what other exceptions are seen. For example, here's another exception trace I encountered:

    id = collection.insert_one(doc).inserted_id
File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\collection.py", line 693, in insert_one
session=session),
...
File "C:\Users\Randome\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\network.py", line 150, in command
parse_write_concern_error=parse_write_concern_error)
File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\helpers.py", line 132, in _check_command_response
raise NotMasterError(errmsg, response)
pymongo.errors.NotMasterError: not master

This exception is also a sub class of ConnectionFailure.

'retryWrites' Parameter

Another area to test MongoDB failover behavior would be seeing how other parameter variations affect the results. One parameter that is relevant is 'retryWrites':

retryWrites: (boolean) Whether supported write operations executed within this MongoClient will be retried once after a network error on MongoDB 3.6+. Defaults to False.

Let's see how this parameter works with a failover. The only change made to the code is:

client = pymongo.MongoClient(MONGO_URI, ssl = True, ssl_ca_certs = 'path-to-cacert.pem', retryWrites = True)

Let's run it now, and then do a database system failover:

04/18/2019 08:49:30 PM INFO Attempting to connect...
04/18/2019 08:49:35 PM INFO Record inserted - id: 5cb895869ece146554010c77
04/18/2019 08:49:38 PM INFO Record inserted - id: 5cb8958a9ece146554010c78
04/18/2019 08:49:41 PM INFO Record inserted - id: 5cb8958d9ece146554010c79
04/18/2019 08:49:44 PM INFO Record inserted - id: 5cb895909ece146554010c7a
04/18/2019 08:49:48 PM INFO Record inserted - id: 5cb895939ece146554010c7b <<< Failover around this time
04/18/2019 08:50:04 PM INFO Record inserted - id: 5cb895979ece146554010c7c
04/18/2019 08:50:07 PM INFO Record inserted - id: 5cb895a79ece146554010c7d
04/18/2019 08:50:10 PM INFO Record inserted - id: 5cb895aa9ece146554010c7e
04/18/2019 08:50:14 PM INFO Record inserted - id: 5cb895ad9ece146554010c7f
...

Notice how the insert after the failover takes about 12 seconds, but goes through successfully as the retryWrites parameter ensures the failed write is retried. Remember that setting this parameter doesn't absolve you from handling the ConnectionFailure exception - you need to worry about reads and other operations whose behavior is not affected by this parameter. It also doesn't completely solve the issue, even for supported operations - sometimes failovers can take longer to complete and retryWrites alone will not be enough.

Configuring the Network Timeout Values

rs.stepDown() induces a rather quick failover, as the replica set primary is intructed to become a secondary, and the secondaries hold an election to determine the new primary. In production deployments, network load, partition, and other such issues delay the detection of unavailability of the primary server, thus, prolonging your failover time. You would also often run into PyMongo errors like errors.ServerSelectionTimeoutError, errors.NetworkTimeout, etc. during network issues and failovers.

If this occurs very often, you must look to tweak the timeout parameters. The related MongoClient timeout parameters are serverSelectionTimeoutMS, connectTimeoutMS, and socketTimeoutMS. Of these, selecting a larger value for serverSelectionTimeoutMS most often helps in dealing with errors during failovers:

serverSelectionTimeoutMS: (integer) Controls how long (in milliseconds) the driver will wait to find an available, appropriate server to carry out a database operation; while it is waiting, multiple server monitoring operations may be carried out, each controlled by connectTimeoutMS. Defaults to 30000 (30 seconds).

Ready to use MongoDB in your Python application? Check out our Getting Started with Python and MongoDB article to see how you can get up and running in just 5 easy steps. ScaleGrid is the only MongoDB DBaaS providerthat it gives you full SSH access to your instances so you can run your Python server on the same machine as your MongoDB server. Automate your MongoDB cloud deployments on AWSAzure, or DigitalOcean with dedicated servers, high availability, and disaster recovery so you can focus on developing your Python application.

Originally published by  ScaleGrid at dev.to

----------------------------------------------------------------------------------------------------------

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Machine Learning with Python, Jupyter, KSQL and TensorFlow

☞ Introduction to Python Microservices with Nameko

☞ Comparing Python and SQL for Building Data Pipelines

☞ Python Tutorial - Complete Programming Tutorial for Beginners (2019)

☞ Python and HDFS for Machine Learning

☞ Build a chat widget with Python and JavaScript

☞ Complete Python Bootcamp: Go from zero to hero in Python 3

☞ Complete Python Masterclass

☞ Learn Python by Building a Blockchain & Cryptocurrency

☞ Python and Django Full Stack Web Developer Bootcamp

☞ The Python Bible™ | Everything You Need to Program in Python

☞ Learning Python for Data Analysis and Visualization

☞ Python for Financial Analysis and Algorithmic Trading

☞ The Modern Python 3 Bootcamp

Text Similarity : Python-sklearn on MongoDB Collection

Text Similarity : Python-sklearn on MongoDB Collection

Check out some Python code that can calculate the similarity of an indexed field between all the documents of a MongoDB collection.

Originally published by Anis Hajri at dzone.com

Overview

In this article, I set up a Python script that allows us to calculate the similarity of an indexed field between all the documents of a MongoDB collection. In the process I parallelized the executions on four threads to improve performance.

The script is detailed below, I hope it will be useful.

Python Script

import multiprocessing
import threading
import json, sys
import pymongo
import nltk, string

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
class SimilarityThread (threading.Thread):
def init(self, threadID, data_array, totalSize, similarity_collection,startIndex):
threading.Thread.init(self)
self.threadID = threadID
self.data_array = data_array
self.totalSize = totalSize
self.similarity_collection = similarity_collection
self.startIndex = startIndex

def run(self):
clacluateSimilarity( self.data_array, self.totalSize, self.similarity_collection,self.startIndex)

def clacluateDistance(txt1,txt2):
return euclidean_distances(txt1,txt2)[0][0]

def clacluateSimilarity( data_array, totalSize, similarity_collection, startIndex):
vectorizer = CountVectorizer()
for idx in range(startIndex,totalSize):
h = data_array[idx]
for idx1 in range((idx+1),totalSize):
h1 = data_array[idx1]
hSimilarity = {}
hSimilarity['idOrigin']=h['id']
hSimilarity['idTarget']=h1['id']
corpus = []
corpus.append(h['text'])
corpus.append(h1['text'])
features = vectorizer.fit_transform(corpus).todense()
distance = clacluateDistance(features[0],features[1])
hSimilarity['distance'] = distance
print(hSimilarity)
if distance < 4:
print("Distance ====> %d " % distance)
similarity_collection.insert_one(hSimilarity)

def processTextSimilarity(totalSize, data_array,similarity_collection):

num_cores = multiprocessing.cpu_count()
print(":::num cores ==> %d " % num_cores)
threadList = ["Thread-1", "Thread-2", "Thread-3", "Thread-4"]
threadID = 1;
threads=[]
rootIndex = round(totalSize/4)
startIndex = 0
for tName in threadList:
thread = SimilarityThread(threadID, data_array, startIndex+rootIndex, similarity_collection,startIndex)
thread.start()
startIndex+=rootIndex
threads.append(thread)
threadID += 1

Wait for all threads to complete

for t in threads:
t.join()

def main():
print('****** Text Similarity::start ******')
connection = pymongo.MongoClient("mongodb://localhost")
db = connection.kalamokomnoor
article = db.article
article_similarity = db.article_similarity

data_array = article.find({}).sort("id",pymongo.ASCENDING)
totalSize = article.count_documents({})

print('###### :: totalSize : %d ' % totalSize)

processTextSimilarity(totalSize,data_array,article_similarity)

print('****** Text Similarity::Ending ******')

if name == 'main':
main()


Originally published by Anis Hajri at dzone.com

=================================================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Complete Python Bootcamp: Go from zero to hero in Python 3

☞ Python for Time Series Data Analysis

☞ Python Programming For Beginners From Scratch

☞ MongoDB - The Complete Developer’s Guide

☞ The Complete Developers Guide to MongoDB

☞ Python Network Programming | Network Apps & Hacking Tools

☞ Intro To SQLite Databases for Python Programming

☞ Ethical Hacking With Python, JavaScript and Kali Linux

☞ Beginner’s guide on Python: Learn python from scratch! (New)

☞ Python for Beginners: Complete Python Programming

☞ Learn MongoDB : Leading NoSQL Database from scratch

☞ MongoDB Essentials - Complete MongoDB Guide

☞ Introduction to MongoDB