What you see is what you guess

Browsing through the menu of services that Microsoft’s cloud services platform offers, I came across Computer Vision (CV). I decided to play a bit with it to have a feeling of what the service has to offer.

So I created a free account using my github credentials and got successfully onboarded (link for the US) after the usual verification steps. Then I followed the instructions to create a CV instance and get my API key.

The service analyzes content in images and video either supplied by the user or publicly available in the internet. The service is free for the first 12 months and with an upper limit of 5,000 transactions per month at a maximum rate of 20 per minute, which is more that in enough to carry out some tests.

Technically, the service is delivered through a REST web service, which made me think that I could start testing it pretty fast, which I did.

What it delivers

Of the various services within CV, I will focus on image analysis. In this case, CV provides you with the following information:

Tags: From a universe of thousands of tags, CV lists the identified object types in your image, such as dog, _tree _or car.
Objects: Whenever possible, CV also provides a list of objects bounded by rectangles in your picture. So if there are three identified bicycles, it will give the coordinates of the bounding box for each of the three.
Brands: It can detect logos from a set of thousands of well-known brands.
Category: From a fixed list of predefined 86 categories, CV will assign to your picture the category that fits best (example: food_grilled or outdoor_street)
Description: It gives you a description of the whole image in the language you select. Actually this is the feature I was interested in, so I will stop the enumeration here. For the full list see this.

Enough literature, let us write some python code and throw some Unsplash images to CV to start the fun

A simplified Computer Vision client

REST web services can be consumed in basically any general purpose programming language. We’ll be using Python here with the image above this lines. Have a look and read the comments:

import requests
import json

## connection details
## Replace this silly pun with your API
azure_cv_api_key = "MyAPI Heat"
## same here
azure_cv_endpoint = "somesubdomain.cognitiveservices.azure.com"
azure_cv_resource = "vision/v3.1/analyze"
language = "en"
## We just ask for some features
visual_features = "Objects,Categories,Description"
image_path = "c:/work/images/tobias-adam-Twm64rH8wdc-unsplash.jpg"

azure_cv_url = "https://{}/{}".format(azure_cv_endpoint,
                                      azure_cv_resource)
headers = {'Ocp-Apim-Subscription-Key': azure_cv_api_key,
           'Content-Type': 'application/octet-stream'}

params = {"visualFeatures": visual_features, "language": language}

## We need to read the image as a byte stream
image_data = open(image_path, "rb").read()

response = requests.post(azure_cv_url, params=params, data=image_data, headers=headers)

## assume you get a 200 status (ok)
content = json.loads(response.content.decode(response.encoding))

## This is where the picture description can be found
print("Description\n{}".format(content["description"]["captions"][0]["text"]))
## Which objects have you found?
for o in content["objects"]:
    print("Object {} Parent {} Grandparent {}".format(o["object"], o["parent"]["object"]), o["parent"]["parent"]["object"])

We run it and get:

Description

a baby elephant walks next to its mother

Object African elephant Parent elephant Grandparent mammal

Wow — I know, the bigger elephant could be the father or the aunt, but it sounds really good.

#ocr #azure

What it delivers

A simplified Computer Vision client

towardsdatascience.com

What you see is what you guess