Philian Mateo

Philian Mateo


TensorFlow 2.0 Case Study

Imagine being the moderator of an online news forum and you're responsible for determining the source (publisher) of the news article. Doing this manually can be a very tedious task as you'll have to read the news articles and then derive the source. So, what if you could automate this task? So, at a very diluted level, the problem statement becomes, can we predict the publisher's name from a given article?

The problem can now be modeled as a text classification problem. In the rest of the article, you will be building a machine learning model to solve this. The summary of the steps looks like so:

  • Gather data
  • Preprocess the dataset
  • Get the data ready for feeding to a sequence model
  • Build, train and evaluate the model

System setup

You will be using Google Cloud Platform (GCP) as the infrastructure. It's easy to configure a system you would need for this project, starting from the data to the libraries for building the model(s). You will begin by spinning off a Jupyter Lab instance which comes as a part of GCP's AI Platform. To be able to spin off a Jupyter Lab instance on GCP's AI Platform, you will need a billing-enabled GCP Project. One can navigate to the Notebooks section on the AI Platform very easily:

After clicking on the Notebooks, a dashboard like the following appears:

You will be using TensorFlow 2.0 for this project, so choose accordingly:

After clicking on With 1 NVIDIA Tesla K80, you will be shown a basic configuration window. Keep it default, just check off the GPU driver installation box and then click on CREATE.

It will take some time to get the instance (~ 5 minutes). You just need to click on OPEN JUPYTERLAB to access the notebook instance after the instance is ready.

You will also be using BigQuery in this project and that too via the notebooks. So, as soon as you get the notebook instance, open up a terminal to install the BigQuery notebook extension:

pip3 install --upgrade google-cloud-bigquery

That's it for system setup part.

BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and machine learning built-in.

Where do we get the data?

It may not always be the case that the data will be readily available for the problem you're trying to solve. Fortunately, in this case, there is already a dataset, which is good enough to start with.

The dataset you are going to use is already available as a BigQuery public dataset. But the dataset needs to be shaped a bit aligning to with respect to the problem statement. You'll come to this later.

This dataset contains all stories and comments from Hacker News from its launch in 2006 to present. Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

To get the data right in my notebook instance, you'll need to configure the GCP Project within the notebook's environment:

# Set your Project ID
import os
PROJECT = 'your-project-name'
os.environ['PROJECT'] = PROJECT

Replace you-project-name with the name of your GCP project. You are now ready to run a query which would access the BigQuery dataset:

%%bigquery --project $PROJECT data_preview
  url, title, score
  LENGTH(title) > 10
  AND score > 10
  AND LENGTH(url) > 0

Let's break a few things down here:

  • %%bigquery --project $PROJECT data_preview%%bigquery is a magic command which lets you run SQL like queries (compatible for BigQuery) from your notebook. --project $PROJECT is used to guide BigQuery which GCP Project you're using. data_preview is the name of the Pandas DataFrame to which you're going to save results of the query (isn't this very useful?).

  • hacker_news is the name of the BigQuey public dataset and stories is the name of the table residing inside it.
  • Three columns only: url of the article, title of the article, and score of the article. You'll be using the article titles to determine their sources.

You chose to include only those entries where the length of the article title and article's corresponding URL is greater than 10. The query returned 402 MB of data.

Here are the first five rows from the DataFrame data_preview:

The data collection part is now done for the project. At this stage, we are good to proceed to the next steps: cleaning and preprocessing!

Beginning data wrangling

The problem with the current data is in place of url, you need the source of the URL. For example, should appear as github. You would also want to rename the url column to the source. But before doing that, we figured out the distribution in the titles belongs to several sources.

%%bigquery --project $PROJECT source_num_articles
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  COUNT(title) AS num_articles
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
ORDER BY num_articles DESC

Preview the source_num_articles DataFrame:


BigQuery provides several functions like ARRAY_REVERSE()REGEXP_EXTRACT() and so on for useful tasks. With the above query, we first split the URLs with respect to // and / and then extracted the domains from the URLs.

But the project needs different data - a dataset which will contain the articles along with their sources. The stories table contains a lot of article sources other than the ones shown above. So, to keep it a bit more light-weighted, let's go with these five: blogpost, github, techcrunch, youtube, and nytimes.

%%bigquery --project $PROJECT full_data
SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
WHERE (source = 'github' OR source = 'nytimes' OR
       source = 'techcrunch' or source = 'blogspot' OR
       source = 'youtube')

Previewing the full_data DataFrame, you get:


Data understanding is vital for machine learning modeling to work well and to be understood. Let's take some time out and perform some basic EDA.

Data understanding

You will start the process of EDA by investigating the dimensions of the dataset. In this case, the dataset prepared in the above step had 168437 rows including 2 columns as can be seen in the preview.

The following is the class distribution of the articles:

Fortunately enough, there are no missing values in the dataset, and the following little tweedle can help you knowing that:

# Missing value inspection full_data.isna().sum()

source 0 
title 0 
dtype: int64

A common question that arises while dealing with text data like this is - how is the length of the titles distributed?

Fortunately, Pandas provides a lot of useful functions to answer questions like this.


count  168437.000000
mean     46.663174
std     17.080766
min     11.000000
25%     34.000000
50%     46.000000
75%     59.000000
max     138.000000
Name: title, dtype: float64

You have a minimum length of 11 and a maximum length of 138. We will come to this again in a moment.

EDA is incomplete without plots! In this case, a very useful plot could be Count vs. Title lengths:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns‘ggplot’)
%matplotlib inline

text_lens = full_data[‘title’].apply(len).values
g = sns.distplot(text_lens, kde=False, hist_kws={‘rwidth’:1})
g.set_xlabel(‘Title length’)

/usr/local/lib/python3.5/site-packages/scipy/stats/ FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.
 return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Almost a bell, isn’t it? From the plot, it is evident that the counts are skewed for title lengths < 20 and > 80. So, you may have to be careful in tackling them. Let’s perform some manual inspections to figure out:

  • How many titles fall above the minimum title length (11)?

  • How many titles have the maximum length (138)?

Let’s find out.

(text_lens <= 11).sum(), (text_lens == 138).sum()

(513, 1)

You should be getting 513 and 1 respectively. You will now remove the entry denoting the maximum article length from the dataset since it’s just 1:

full_data = full_data[text_lens < 138].reset_index(drop=True)

The last thing you’ll be doing in this step was splitting the dataset into train/validation/test sets in a ratio of 80:10:10.

# 80% for train
train = full_data.sample(frac=0.8)
full_data.drop(train.index, axis=0, inplace=True)

10% for validation

valid = full_data.sample(frac=0.5)
full_data.drop(valid.index, axis=0, inplace=True)

10% for test

test = full_data
train.shape, valid.shape, test.shape

((134749, 2), (16844, 2), (16843, 2))

The new data dimensions are: ((110070, 2), (13759, 2), (13759, 2)). To be a little more certain on the class distribution, you will now verify that across the three sets:

The distributions are relatively same across the three sets. Let’s serialize these three sets to Pandas DataFrames.

train.to_csv(‘data/train.csv’, index=False)
valid.to_csv(‘data/valid.csv’, index-False)
test.to_csv(‘data/test.csv’, index=False)

There’s still some amount of data preprocessing need - as computers only understand numbers, you’ll to prepare the data accordingly to stream to the machine learning model:

  • Encoding the classes to some numbers (label encoding/one-hot encoding)

  • Creating a vocabulary from the training corpus - tokenization
  • Numericalizing the titles and pad them to a fixed-length
  • Preparing the embedding matrix with respect to pre-trained embeddings

Let’s proceed accordingly.

Additional data preprocessing

First, you’ll define the constants that would be necessary here:

# Label encode
CLASSES = {‘blogspot’: 0, ‘github’: 1, ‘techcrunch’: 2, ‘nytimes’: 3, ‘youtube’: 4}

Maximum vocabulary size used for tokenization

TOP_K = 20000

Sentences will be truncated/padded to this length


Now, you’ll define a tiny helper function which would take a Pandas DataFrame and would

  • prepare a list of titles from the DataFrame (needed for further preprocessing)

  • take the sources from the DataFrame, map them to integers and append to a NumPy array
def return_data(df):
  return list(df[‘title’]), np.array(df[‘source’].map(CLASSES))

Apply it to the three splits

train_text, train_labels = return_data(train)
valid_text, valid_labels = return_data(valid)
test_text, test_labels = return_data(test)

print(train_text[0], train_labels[0])

the davos question. what one thing must be done to make the world a better place in 2008 4

The result is as expected.

You’ll use the text and sequence modules provided by tensorflow.keras.preprocessing to tokenize and pad the titles. You’ll start with tokenization:

# TensorFlow imports
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras import models
from tensorflow.keras.layers import Dense, Dropout, Embedding, Conv1D, MaxPooling1D, GlobalAveragePooling1D

# Create a vocabulary from training corpus
tokenizer = text.Tokenizer(num_words=TOP_K)

You’ll be using the GloVe embeddings to represent the words in the titles to a dense representation. The embeddings’ file is of more than 650 MB, and the GCP team has it stored in a Google Storage Bucket. This was incredibly helpful since it would allow you to directly use it in the notebook at a very fast speed. You’ll be using the gsutil command (available in the Notebooks) to aid this.

!gsutil cp gs://cloud-training-demos/courses/machine_learning/deepdive/09_sequence/text_classification/glove.6B.200d.txt glove.6B.200d.txt

You would need a helper function which would map the words in the titles with respect to the Glove embeddings.

def get_embedding_matrix(word_index, embedding_path, embedding_dim):
  embedding_matrix_all = {}
  with open(embedding_path) as f:
    for line in f: # Every line contains word followed by the vector value
      values = line.split()
      word = values[0]
      coefs = np.asarray(values[1:], dtype=‘float32’)
      embedding_matrix_all[word] = coefs

Prepare embedding matrix with just the words in our word_index dictionary

  num_words = min(len(word_index) + 1, TOP_K)
  embedding_matrix = np.zeros((num_words, embedding_dim))
  for word, i in word_index.items():
    if i >= TOP_K:
    embedding_vector = embedding_matrix_all.get(word)
    if embedding_vector is not None:
      # Words not found in embedding index will be all-zeros.
      embedding_matrix[i] = embedding_vector

  return embedding_matrix

This is all you will need to stream the text data to the yet-to-be-built machine learning model.

Building the Horcrux: A sequential language model

Let’s specify a couple of hyperparameter values towards the very beginning of the modeling process.

# Specify the hyperparameters
embedding_path = ‘glove.6B.200d.txt’

You’ll be using a Convolutional Neural Network based model which would basically start by convolving on the embeddings fed to it. Locality is important in sequential data, and CNNs would allow you to capture that effectively. The trick is to do all the fundamental CNN operations (convolution, pooling) in 1D.

You’ll be following the typical Keras paradigm - you’ll first instantiate the model, then you’ll define the topology and compile the model accordingly.

# Create model instance
model = models.Sequential()
num_features = min(len(word_index) + 1, TOP_K)

Add embedding layer - GloVe embeddings

               embedding_path, embedding_dim)],
model.add(Conv1D(filters=filters * 2,
model.add(Dense(len(CLASSES), activation=‘softmax’))

Compile model with learning parameters.

optimizer = tf.keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss=‘sparse_categorical_crossentropy’, metrics=[‘acc’])

The architecture looks like so:

One more step that was remaining at this point was Numericalizing the titles and pad them to a fixed-length.

# Preprocess the train, validation and test sets

Tokenize and pad sentences

preproc_train = tokenizer.texts_to_sequences(train_text)
preproc_train = sequence.pad_sequences(preproc_train, maxlen=MAX_SEQUENCE_LENGTH)

preproc_valid = tokenizer.texts_to_sequences(valid_text)
preproc_valid = sequence.pad_sequences(preproc_valid, maxlen=MAX_SEQUENCE_LENGTH)

preproc_test = tokenizer.texts_to_sequences(test_text)
preproc_test = sequence.pad_sequences(preproc_test, maxlen=MAX_SEQUENCE_LENGTH)

And finally, you’re prepared to kickstart the training process!

H =,
     validation_data=(preproc_valid, valid_labels),

Here’s a snap of the training log:

The network does overfit and the training graph also confirms it:

Overall, the model yields an accuracy of ~66% which is not up to the mark given the developments of the hour. But it is a good start. Let’s now write a little function to use the network to predict the on individual samples:

# Helper function to test on single samples
def test_on_single_sample(text):
  category = None
  text_tokenized = tokenizer.texts_to_sequences(text)
  text_tokenized = sequence.pad_sequences(text_tokenized,maxlen=50)
  prediction = int(model.predict_classes(text_tokenized))
  for key, value in CLASSES.items():
  if value==prediction:

 return category

Prepare the samples accordingly:

# Prepare the samples
github=[‘Invaders game in 512 bytes’]
nytimes = [‘Michael Bloomberg Promises $500M to Help End Coal’]
techcrunch = [‘Facebook plans June 18th cryptocurrency debut’]
blogspot = [‘Android Security: A walk-through of SELinux’]

Finally, test testonsingle_sample() on the above samples:

for sample in [github, nytimes, techcrunch, blogspot]:


That was it for this project. In the next section, you’ll find my comments on the future directions for this project and then some references used for this project.

Future directions and references

Just like in the computer vision domain, we expect models that understand the domain to be robust against certain transformations like rotation and translation. In the sequence domain, it’s important then models be robust to changes in the length of the pattern. Keeping that in mind, here’s a list of what I would try in the near future:

  • Try other sequence models
  • A bit of hyperparameter tuning
  • Learn the embeddings from scratch
  • Try different embeddings like universal sentence encoder, nnlm-128 and so on

After I have a decent model (with at least ~80% accuracy), I plan to serve the model as a REST API and deploy it on AppEngine.

Thank you for taking the time to read the article. Hope this tutorial will surely help and you!

Originally published on

#python #tensorflow #machine-learning #database #cloud

What is GEEK

Buddha Community

TensorFlow 2.0 Case Study
최 호민

최 호민


파이썬 코딩 무료 강의 - 이미지 처리, 얼굴 인식을 통한 캐릭터 씌우기를 해보아요

파이썬 코딩 무료 강의 (활용편6) - 이미지 처리, 얼굴 인식을 통한 캐릭터 씌우기를 해보아요

파이썬 무료 강의 (활용편6 - 이미지 처리)입니다.
OpenCV 를 이용한 다양한 이미지 처리 기법과 재미있는 프로젝트를 진행합니다.
누구나 볼 수 있도록 쉽고 재미있게 제작하였습니다. ^^

(0:00:00) 0.Intro
(0:00:31) 1.소개
(0:02:18) 2.활용편 6 이미지 처리 소개

[OpenCV 전반전]
(0:04:36) 3.환경설정
(0:08:41) 4.이미지 출력
(0:21:51) 5.동영상 출력 #1 파일
(0:29:58) 6.동영상 출력 #2 카메라
(0:34:23) 7.도형 그리기 #1 빈 스케치북
(0:39:49) 8.도형 그리기 #2 영역 색칠
(0:42:26) 9.도형 그리기 #3 직선
(0:51:23) 10.도형 그리기 #4 원
(0:55:09) 11.도형 그리기 #5 사각형
(0:58:32) 12.도형 그리기 #6 다각형
(1:09:23) 13.텍스트 #1 기본
(1:17:45) 14.텍스트 #2 한글 우회
(1:24:14) 15.파일 저장 #1 이미지
(1:29:27) 16.파일 저장 #2 동영상
(1:39:29) 17.크기 조정
(1:50:16) 18.이미지 자르기
(1:57:03) 19.이미지 대칭
(2:01:46) 20.이미지 회전
(2:06:07) 21.이미지 변형 - 흑백
(2:11:25) 22.이미지 변형 - 흐림
(2:18:03) 23.이미지 변형 - 원근 #1
(2:27:45) 24.이미지 변형 - 원근 #2

[반자동 문서 스캐너 프로젝트]
(2:32:50) 25.미니 프로젝트 1 - #1 마우스 이벤트 등록
(2:42:06) 26.미니 프로젝트 1 - #2 기본 코드 완성
(2:49:54) 27.미니 프로젝트 1 - #3 지점 선 긋기
(2:55:24) 28.미니 프로젝트 1 - #4 실시간 선 긋기

[OpenCV 후반전]
(3:01:52) 29.이미지 변형 - 이진화 #1 Trackbar
(3:14:37) 30.이미지 변형 - 이진화 #2 임계값
(3:20:26) 31.이미지 변형 - 이진화 #3 Adaptive Threshold
(3:28:34) 32.이미지 변형 - 이진화 #4 오츠 알고리즘
(3:32:22) 33.이미지 변환 - 팽창
(3:41:10) 34.이미지 변환 - 침식
(3:45:56) 35.이미지 변환 - 열림 & 닫힘
(3:54:10) 36.이미지 검출 - 경계선
(4:05:08) 37.이미지 검출 - 윤곽선 #1 기본
(4:15:26) 38.이미지 검출 - 윤곽선 #2 찾기 모드
(4:20:46) 39.이미지 검출 - 윤곽선 #3 면적

[카드 검출 & 분류기 프로젝트]
(4:27:42) 40.미니프로젝트 2

(4:31:57) 41.퀴즈

[얼굴인식 프로젝트]
(4:41:25) 42.환경설정 및 기본 코드 정리
(4:54:48) 43.눈과 코 인식하여 도형 그리기
(5:10:42) 44.그림판 이미지 씌우기
(5:20:52) 45.캐릭터 이미지 씌우기
(5:33:10) 46.보충설명
(5:40:53) 47.마치며 (학습 참고 자료)
(5:42:18) 48.Outro

수업에 필요한 이미지, 동영상 자료 링크입니다.

고양이 이미지 : 
크기 : 640 x 390  
파일명 : img.jpg

고양이 동영상 : 
크기 : SD (360 x 640)  
파일명 : video.mp4

신문 이미지 : 
크기 : 1280 x 853  
파일명 : newspaper.jpg

카드 이미지 1 : 
크기 : 1280 x 1019  
파일명 : poker.jpg

책 이미지 : 
크기 : Small (640 x 853)  
파일명 : book.jpg

눈사람 이미지 : 
크기 : 1280 x 904  
파일명 : snowman.png

카드 이미지 2 : 
크기 : 640 x 408  
파일명 : card.png

퀴즈용 동영상 : 
크기 : HD (1280 x 720)  
파일명 : city.mp4

프로젝트용 동영상 : 
크기 : Full HD (1920 x 1080)  
파일명 : face_video.mp4

프로젝트용 캐릭터 이미지 :  
파일명 : right_eye.png (100 x 100), left_eye.png (100 x 100), nose.png (300 x 100)

무료 이미지 편집 도구 :
(Pixlr E -Advanced Editor)

#python #opencv 

Semantic Similarity Framework for Knowledge Graph


Sematch is an integrated framework for the development, evaluation, and application of semantic similarity for Knowledge Graphs (KGs). It is easy to use Sematch to compute semantic similarity scores of concepts, words and entities. Sematch focuses on specific knowledge-based semantic similarity metrics that rely on structural knowledge in taxonomy (e.g. depth, path length, least common subsumer), and statistical information contents (corpus-IC and graph-IC). Knowledge-based approaches differ from their counterpart corpus-based approaches relying on co-occurrence (e.g. Pointwise Mutual Information) or distributional similarity (Latent Semantic Analysis, Word2Vec, GLOVE and etc). Knowledge-based approaches are usually used for structural KGs, while corpus-based approaches are normally applied in textual corpora.

In text analysis applications, a common pipeline is adopted in using semantic similarity from concept level, to word and sentence level. For example, word similarity is first computed based on similarity scores of WordNet concepts, and sentence similarity is computed by composing word similarity scores. Finally, document similarity could be computed by identifying important sentences, e.g. TextRank.


KG based applications also meet similar pipeline in using semantic similarity, from concept similarity (e.g. to entity similarity (e.g. Furthermore, in computing document similarity, entities are extracted and document similarity is computed by composing entity similarity scores.


In KGs, concepts usually denote ontology classes while entities refer to ontology instances. Moreover, those concepts are usually constructed into hierarchical taxonomies, such as DBpedia ontology class, thus quantifying concept similarity in KG relies on similar semantic information (e.g. path length, depth, least common subsumer, information content) and semantic similarity metrics (e.g. Path, Wu & Palmer,Li, Resnik, Lin, Jiang & Conrad and WPath). In consequence, Sematch provides an integrated framework to develop and evaluate semantic similarity metrics for concepts, words, entities and their applications.

Getting started: 20 minutes to Sematch

Install Sematch

You need to install scientific computing libraries numpy and scipy first. An example of installing them with pip is shown below.

pip install numpy scipy

Depending on different OS, you can use different ways to install them. After sucessful installation of numpy and scipy, you can install sematch with following commands.

pip install sematch
python -m

Alternatively, you can use the development version to clone and install Sematch with setuptools. We recommend you to update your pip and setuptools.

git clone
cd sematch
python install

We also provide a Sematch-Demo Server. You can use it for experimenting with main functionalities or take it as an example for using Sematch to develop applications. Please check our Documentation for more details.

Computing Word Similarity

The core module of Sematch is measuring semantic similarity between concepts that are represented as concept taxonomies. Word similarity is computed based on the maximum semantic similarity of WordNet concepts. You can use Sematch to compute multi-lingual word similarity based on WordNet with various of semantic similarity metrics.

from sematch.semantic.similarity import WordNetSimilarity
wns = WordNetSimilarity()

# Computing English word similarity using Li method
wns.word_similarity('dog', 'cat', 'li') # 0.449327301063
# Computing Spanish word similarity using Lin method
wns.monol_word_similarity('perro', 'gato', 'spa', 'lin') #0.876800984373
# Computing Chinese word similarity using  Wu & Palmer method
wns.monol_word_similarity('狗', '猫', 'cmn', 'wup') # 0.857142857143
# Computing Spanish and English word similarity using Resnik method
wns.crossl_word_similarity('perro', 'cat', 'spa', 'eng', 'res') #7.91166650904
# Computing Spanish and Chinese word similarity using Jiang & Conrad method
wns.crossl_word_similarity('perro', '猫', 'spa', 'cmn', 'jcn') #0.31023804699
# Computing Chinese and English word similarity using WPath method
wns.crossl_word_similarity('狗', 'cat', 'cmn', 'eng', 'wpath')#0.593666388463

Computing semantic similarity of YAGO concepts.

from sematch.semantic.similarity import YagoTypeSimilarity
sim = YagoTypeSimilarity()

#Measuring YAGO concept similarity through WordNet taxonomy and corpus based information content
sim.yago_similarity('','', 'wpath') #0.642
sim.yago_similarity('','', 'wpath') #0.544
#Measuring YAGO concept similarity based on graph-based IC
sim.yago_similarity('','', 'wpath_graph') #0.423
sim.yago_similarity('','', 'wpath_graph') #0.328

Computing semantic similarity of DBpedia concepts.

from sematch.semantic.graph import DBpediaDataTransform, Taxonomy
from sematch.semantic.similarity import ConceptSimilarity
concept = ConceptSimilarity(Taxonomy(DBpediaDataTransform()),'models/dbpedia_type_ic.txt')
concept.similarity('','', 'path')
concept.similarity('','', 'wup')
concept.similarity('','', 'li')
concept.similarity('','', 'res')
concept.similarity('','', 'lin')
concept.similarity('','', 'jcn')
concept.similarity('','', 'wpath')

Computing semantic similarity of DBpedia entities.

from sematch.semantic.similarity import EntitySimilarity
sim = EntitySimilarity()
sim.similarity('','') #0.409923677282

Evaluate semantic similarity metrics with word similarity datasets

from sematch.evaluation import WordSimEvaluation
from sematch.semantic.similarity import WordNetSimilarity
evaluation = WordSimEvaluation()
wns = WordNetSimilarity()
# define similarity metrics
wpath = lambda x, y: wns.word_similarity_wpath(x, y, 0.8)
# evaluate similarity metrics with SimLex dataset
evaluation.evaluate_metric('wpath', wpath, 'noun_simlex')
# performa Steiger's Z significance Test
evaluation.statistical_test('wpath', 'path', 'noun_simlex')
# define similarity metrics for Spanish words
wpath_es = lambda x, y: wns.monol_word_similarity(x, y, 'spa', 'path')
# define cross-lingual similarity metrics for English-Spanish
wpath_en_es = lambda x, y: wns.crossl_word_similarity(x, y, 'eng', 'spa', 'wpath')
# evaluate metrics in multilingual word similarity datasets
evaluation.evaluate_metric('wpath_es', wpath_es, 'rg65_spanish')
evaluation.evaluate_metric('wpath_en_es', wpath_en_es, 'rg65_EN-ES')

Evaluate semantic similarity metrics with category classification

Although the word similarity correlation measure is the standard way to evaluate the semantic similarity metrics, it relies on human judgements over word pairs which may not have same performance in real applications. Therefore, apart from word similarity evaluation, the Sematch evaluation framework also includes a simple aspect category classification. The task classifies noun concepts such as pasta, noodle, steak, tea into their ontological parent concept FOOD, DRINKS.

from sematch.evaluation import AspectEvaluation
from sematch.application import SimClassifier, SimSVMClassifier
from sematch.semantic.similarity import WordNetSimilarity

# create aspect classification evaluation
evaluation = AspectEvaluation()
# load the dataset
X, y = evaluation.load_dataset()
# define word similarity function
wns = WordNetSimilarity()
word_sim = lambda x, y: wns.word_similarity(x, y)
# Train and evaluate metrics with unsupervised classification model
simclassifier = SimClassifier.train(zip(X,y), word_sim)
evaluation.evaluate(X,y, simclassifier)

macro averge:  (0.65319812882333839, 0.7101245049198579, 0.66317566364913016, None)
micro average:  (0.79210167952791644, 0.79210167952791644, 0.79210167952791644, None)
weighted average:  (0.80842645056024054, 0.79210167952791644, 0.79639496616636352, None)
accuracy:  0.792101679528
             precision    recall  f1-score   support

    SERVICE       0.50      0.43      0.46       519
 RESTAURANT       0.81      0.66      0.73       228
       FOOD       0.95      0.87      0.91      2256
   LOCATION       0.26      0.67      0.37        54
   AMBIENCE       0.60      0.70      0.65       597
     DRINKS       0.81      0.93      0.87       752

avg / total       0.81      0.79      0.80      4406

Matching Entities with type using SPARQL queries

You can use Sematch to download a list of entities having a specific type using different languages. Sematch will generate SPARQL queries and execute them in DBpedia Sparql Endpoint.

from sematch.application import Matcher
matcher = Matcher()
# matching scientist entities from DBpedia
matcher.match_type('científico', 'spa')
matcher.match_type('科学家', 'cmn')
matcher.match_entity_type('movies with Tom Cruise')

Example of automatically generated SPARQL query.

SELECT DISTINCT ?s, ?label, ?abstract WHERE {
    ?s <> <> . }
 UNION {  
    ?s <> <> . }
 UNION {  
    ?s <> <> . }
 UNION {  
    ?s <> <> . }
 UNION {  
    ?s <> <> . } 
    ?s <> <> . 
    ?s <> ?label . 
    FILTER( lang(?label) = "en") . 
    ?s <> ?abstract . 
    FILTER( lang(?abstract) = "en") .
} LIMIT 5000

Entity feature extraction with Similarity Graph

Apart from semantic matching of entities from DBpedia, you can also use Sematch to extract features of entities and apply semantic similarity analysis using graph-based ranking algorithms. Given a list of objects (concepts, words, entities), Sematch compute their pairwise semantic similarity and generate similarity graph where nodes denote objects and edges denote similarity scores. An example of using similarity graph for extracting important words from an entity description.

from sematch.semantic.graph import SimGraph
from sematch.semantic.similarity import WordNetSimilarity
from sematch.nlp import Extraction, word_process
from sematch.semantic.sparql import EntityFeatures
from collections import Counter
tom = EntityFeatures().features('')
words = Extraction().extract_nouns(tom['abstract'])
words = word_process(words)
wns = WordNetSimilarity()
word_graph = SimGraph(words, wns.word_similarity)
word_scores = word_graph.page_rank()
words, scores =zip(*Counter(word_scores).most_common(10))
print words
(u'picture', u'action', u'number', u'film', u'post', u'sport', 
u'program', u'men', u'performance', u'motion')


Ganggao Zhu, and Carlos A. Iglesias. "Computing Semantic Similarity of Concepts in Knowledge Graphs." IEEE Transactions on Knowledge and Data Engineering 29.1 (2017): 72-85.

Oscar Araque, Ganggao Zhu, Manuel Garcia-Amado and Carlos A. Iglesias Mining the Opinionated Web: Classification and Detection of Aspect Contexts for Aspect Based Sentiment Analysis, ICDM sentire, 2016.

Ganggao Zhu, and Carlos Angel Iglesias. "Sematch: Semantic Entity Search from Knowledge Graph." SumPre-HSWI@ ESWC. 2015.


You can post bug reports and feature requests in Github issues. Make sure to read our guidelines first. This project is still under active development approaching to its goals. The project is mainly maintained by Ganggao Zhu. You can contact him via gzhu [at]

Why this name, Sematch and Logo?

The name of Sematch is composed based on Spanish "se" and English "match". It is also the abbreviation of semantic matching because semantic similarity metrics helps to determine semantic distance of concepts, words, entities, instead of exact matching.

The logo of Sematch is based on Chinese Yin and Yang which is written in I Ching. Somehow, it correlates to 0 and 1 in computer science.

Author: Gsi-upm
Source Code: 
License: View license

#python #jupyternotebook #graph 

Shardul Bhatt

Shardul Bhatt


Python for Freight Forwarding: Proven Case Study for Logistics Company

Python is a popular web development language for enterprise and customer-centric applications. It is one of the top programming languages, according to TIOBE’s index. It has applications in web development, Machine Learning, Data Science, and other domains. The versatility of Python web development makes it the perfect language for applications in every project.

Amidst the hundreds of languages for web application development, Python stands out. It is powerful, scalable, and easy-to-learn. Python’s capabilities are useful in every sector — technology, FinTechHealthTechfreight forwarding industry, and more. The core functionality of Python takes care of all the programming tasks for every feature that needs to be added.

In this article, we will focus on the major aspects of Python that make it suitable for web applications of all kinds. We will then highlight the proficiency of Python using a proven case study that Python developers at BoTree have built. It is a freight forwarding software for international logistics service provider that uses Python in the main technology stack.

Checkout Top 10 real-world Python Use Cases and Applications

Let’s look at the case study and capabilities of Python in detail.

Why choose Python for Web Development

Python is now the first choice for web development, Unlike Ruby on Rails, it offers more flexibility in the process, Here are a few reasons why companies should choose Python for web development -

  • Readable: Python has an easily readable syntax. It is similar to the english language. Python developers admire the programming language as it is easy to read, write, and understand. You don’t have to write additional code to express concepts with ease. The emphasis on code readability, which enables you to maintain and update the code.
  • Multi-programming paradigms: Like all the other object-oriented and open-source programming languages, Python supports multi-programming paradigms. There’s a dynamic type system and automatic memory management. It simplifies the process of building large and complex enterprise scale applications.
  • Scalable: Python is highly scalable. Because of its in-built capabilities to minimize the errors during the development process, it is perfect for freight forwarding software solutions that require processing bills at a huge scale. It is also suitable for enterprise dashboards and other applications that need to handle massive server requests at once.
  • Versatile: Python is a heavily versatile programming language. It has diverse applications in various domains, including statistical analysis, numerical computations, data analytics and more. Companies can use it for web development or Machine Learning applications. Today, Python plays a crucial role in building data science models and intelligent algorithms.
  • Library
    One of the biggest reasons to choose Python is because of its library set. Python has libraries for almost everything — there’s TensorFlow, Selenium, Apache Spark, Requests, Theano, Py Torch and many more. The libraries enable adding functionalities and features, simplifying the process of building high-quality web applications.

Checkout Top Python Libraries for Data Science to use in 2020

As Python grows in popularity, its community also grows. There are more developers than any other programming language. They provide support for different development problems, support, and training for multiple projects.

Let’s look at a proven case study by BoTree Technologies that showcases Python’s capabilities in web development.

Python: Proven Case Study of a Logistics Company

At BoTree, we use Python development services for building dynamic web applications. Today we will discuss a case study on the freight forwarding services industry. We developed it using Python and other technologies. Let’s understand it better.

About the Case Study

We designed the freight forwarding software for a leading international logistics services provider. The system we created would collect the information from different freight forwarding websites using bill of lading or the container number. The information is then entered into the centralized system automatically for better management of the freight.

The main challenge was the manual processing of bills of lading. The information had to be gathered from a large number of websites. Each website had hundreds and thousands of bills. The manual process was lengthy and time-consuming. Because the freight forwarding companies were based out of different geographical locations, the client also faced language barriers while processing the B/L.

Our Technology Stack

The technology stack to add freight forwarding features was simple and powerful. We used Python, Postgresql, AWS SQS, EC2m, Puppeteer and Virtual Private Cloud. We offered web development, software testing, and continuous support and maintenance.

The technology stack we used was focused on simplifying the complications in the freight forwarding system. Because the solution had to be scalable, Python was the probably choice for building the web application.

Our Solution

We built a fully server-les architecture. It performs the mapping of the websites and analyzes the different fields for assessing the required details in freight forwarding.

The solution parses data from different websites and matches the fields with the required information. It also takes into account previously parsed data for making the decision.

The collected information is structurally arranged into a format. The entire data system is then pushed back to a centralized ERP system. All the data is accumulated at a single place, making it easier to process the B/L without any hassle.

The freight forwarding solution consisted of the following features built using Python -

Core Features

  • B/L Processing: The system could easily parse 15000 B/L in a single day.
  • Efficiency delivery: The process became efficient by 30% for processing the B/L.
  • Activity log maintenance: There’s a proper record of all the records that take place in the system.
  • Multiple languages: The freight forwarding software could easily parse B/L in different languages.


Python is a powerful programming language for enterprise-grade applications. Logistics companies heavily benefit from investing in freight forwarding solutions. Shipping systems are essential for managing the timely delivery of products and services. An internal system for B/L processing can enable you to reap the benefits of swift deliveries.

BoTree Technologies is a custom software development company that has Python experts who can build quality applications for enterprises. We have experience in the logistics, healthcare, fintech, education, and multiple other industries.

Connect with us today for a FREE CONSULTATION in the next 24 hours!

Originally published at on May 11, 2021.

#python case study for logistics company #b/l processing system #freight forwarding case study #logistics case study #case study for logistics company #python web development

Ashish parmar

Ashish parmar


Case study on mobile app; DreamG

Dream-G application will allow user to chat, voice calls and video calls to random people through the mobile application. The User can create a profile and perform all these actions in addition to searching for a person using their name.

Client Requirement
The client came with the requirement of developing a unique mobile application for users to chat with others and make voice and video calls. Furthermore, the user should be able to subscribe to the plan by paying a certain amount.

App Features and Functionalities
The User can see the list of the people and able to view the profile of a particular person and able to chat, voice call, and video call.
The user can see the list of entertainers and can chat, Voice call and Video call them.
User can search for any person by entering the name.
Through the chat option, the user can see the past history of the chat with all the users. The user can also open any chat and again send messages.
The user can see the profile details and able to edit or modify the profile photo, name, and other details. The user can see the call log details.
The user can see the number of coins available with them and through these coins, the user will able to make voice and video calls.
The user can purchase the plan listed in the application according to the requirements, and will be able to chat with the people.
The User can refer the mobile application to other people and earn rewarding coins.

To create a unique user experience for the Chat, Voice, and Video Calls.

Technical Specification & Implementation
Integration with the payment Gateway
Android: Android Studio with Java
We successfully developed and implemented the Dream-G mobile application through which the user will able to chat, voice call, and video call to other people. The user will also be able to purchase the subscription plan and refer the application to other people.

Read more:

#case #study #case-study-on-mobile-app #mobile-app-case-study

伊藤  直子

伊藤 直子


【 初心者向け】C言語でのマルチスレッド の概要

ニューヨークで働き、ウォール街中のプログラマーと話をしていると、ほとんどのリアルタイムプログラミングアプリケーションで期待される共通の知識の糸に気づきました。その知識はマルチスレッドとして知られています。私はプログラミングの世界を移動し、潜在的なプログラミング候補者にインタビューを行ったので、マルチスレッドについてほとんど知られていないことや、スレッドが適用される理由や方法に驚かされることは決してありません。Vance Morrisonによって書かれた一連の優れた記事で、MSDNはこの問題に対処しようとしました:(MSDNの8月号、すべての開発者がマルチスレッドアプリについて知っておくべきこと、および10月号はマルチスレッドでのローロック技術の影響を理解するを参照してください)。アプリ











初心者のプログラマーが最初にスレッド化を学ぶとき、彼らはプログラムでスレッド化を使用する可能性に魅了される可能性があります。彼らは実際にスレッドハッピーになるかもしれません  詳しく説明させてください、

1日目)プログラマーは、スレッドを生成できることを学び、プログラムで1つの新しいスレッドCool!の作成を開始します 。







おなじみですか?マルチスレッドプログラムを初めて設計しようとしたほとんどの人は、スレッドの設計知識が豊富であっても、おそらくこれらの毎日の箇条書きの少なくとも1つまたは2つを経験したことがあります。スレッド化が悪いことだとほのめかしているわけではありません。プログラムでスレッド化の効率を上げるプロセスでは、非常に注意してください。  シングルスレッドプログラムとは異なり、同時に多くのプロセスを処理しているため、複数の従属変数を持つ複数のプロセスを追跡するのは非常に難しい場合があります。ジャグリングと同じようにマルチスレッドを考えてください。手で1つのボールをジャグリングするのは(退屈ではありますが)かなり簡単です。ただし、これらのボールのうち2つを空中に置くように挑戦された場合、その作業は少し難しくなります。3、4、および5の場合、ボールは次第に難しくなります。ボールの数が増えると、実際にボールを落とす可能性が高くなります。 一度にたくさんのボールをジャグリングするには、知識、スキル、正確なタイミングが必要です。マルチスレッドもそうです。 









// shared memory variable between the two threads  
// used to indicate which thread we are in  
private string _threadOutput = "";  
/// <summary>  
/// Thread 1: Loop continuously,  
/// Thread 1: Displays that we are in thread 1  
/// </summary>  
void DisplayThread1()  
      while (_stopThreads == false)  
            Console.WriteLine("Display Thread 1");  
            // Assign the shared memory to a message about thread #1  
            _threadOutput = "Hello Thread1";  
            Thread.Sleep(1000);  // simulate a lot of processing   
            // tell the user what thread we are in thread #1, and display shared memory  
            Console.WriteLine("Thread 1 Output --> {0}", _threadOutput);  

/// <summary>  
/// Thread 2: Loop continuously,  
/// Thread 2: Displays that we are in thread 2  
/// </summary>  
void DisplayThread2()  
      while (_stopThreads == false)  
        Console.WriteLine("Display Thread 2");  
       // Assign the shared memory to a message about thread #2  
        _threadOutput = "Hello Thread2";  
        Thread.Sleep(1000);  // simulate a lot of processing  
       // tell the user we are in thread #2  
        Console.WriteLine("Thread 2 Output --> {0}", _threadOutput);  
      // construct two threads for our demonstration;  
      Thread thread1 = new Thread(new ThreadStart(DisplayThread1));  
      Thread thread2 = new Thread(new ThreadStart(DisplayThread2));  
      // start them  






スレッド2の出力->ハロースレッド1とスレッド1の出力->ハロースレッド2が表示されることがあります。スレッドの出力がコードと一致しません。コードを見て、それを目で追っていますが、_threadOutput = "Hello Thread 2"、Sleep、Write "Thread 2-> Hello Thread 2"ですが、このシーケンスで必ずしも最終結果が得られるとは限りません。 


このようなマルチスレッドプログラムでは、理論的にはコードが2つのメソッドDisplayThread1とDisplayThread2を同時に実行しているためです。各メソッドは変数_threadOutputを共有します。したがって、_threadOutputにはスレッド#1で値 "Hello Thread1"が割り当てられ、2行後にコンソールに_threadOutputが表示されますが、スレッド#1がそれを割り当てて表示する時間の間のどこかで、スレッド#2が_threadOutputを割り当てる可能性があります。値「HelloThread2」。これらの奇妙な結果が発生する可能性があるだけでなく、図2に示す出力に見られるように、非常に頻繁に発生します。この痛みを伴うスレッドの問題は、競合状態として知られるスレッドプログラミングで非常に一般的なバグです。 この例は、よく知られているスレッドの問題の非常に単純な例です。この問題は、参照されている変数やスレッドセーフでない変数を指すコレクションなどを介して、プログラマーからはるかに間接的に隠されている可能性があります。図2では症状は露骨ですが、競合状態は非常にまれにしか現れず、1分に1回、1時間に1回、または3日後に断続的に現れる可能性があります。レースは、その頻度が低く、再現が非常に難しいため、おそらくプログラマーにとって最悪の悪夢です。


競合状態を回避する最善の方法は、スレッドセーフなコードを作成することです。コードがスレッドセーフである場合、いくつかの厄介なスレッドの問題が発生するのを防ぐことができます。スレッドセーフなコードを書くためのいくつかの防御策があります。1つは、メモリの共有をできるだけ少なくすることです。クラスのインスタンスを作成し、それが1つのスレッドで実行され、次に同じクラスの別のインスタンスを作成し、それが別のスレッドで実行される場合、静的変数が含まれていない限り、クラスはスレッドセーフです。 。2つのクラスはそれぞれ、独自のフィールド用に独自のメモリを作成するため、共有メモリはありません。クラスに静的変数がある場合、またはクラスのインスタンスが他の複数のスレッドによって共有されている場合は、他のクラスがその変数の使用を完了するまで、一方のスレッドがその変数のメモリを使用できないようにする方法を見つける必要があります。ロック。  C#を使用すると、Monitorクラスまたはlock {}構造のいずれかを使用してコードをロックできます。(lock構造は、実際にはtry-finallyブロックを介してMonitorクラスを内部的に実装しますが、プログラマーからこれらの詳細を隠します)。リスト1の例では、共有_threadOutput変数を設定した時点から、コンソールへの実際の出力まで、コードのセクションをロックできます。コードのクリティカルセクションを両方のスレッドでロックして、どちらか一方に競合が発生しないようにします。メソッド内をロックする最も速くて汚い方法は、このポインターをロックすることです。このポインタをロックすると、クラスインスタンス全体がロックされるため、ロック内でクラスのフィールドを変更しようとするスレッドはすべてブロックされます。。ブロッキングとは、変数を変更しようとしているスレッドが、ロックされたスレッドでロックが解除されるまで待機することを意味します。スレッドは、lock {}構造の最後のブラケットに到達すると、ロックから解放されます。


/// <summary>  
/// Thread 1, Displays that we are in thread 1 (locked)  
 /// </summary>  
 void DisplayThread1()  
       while (_stopThreads == false)  
          // lock on the current instance of the class for thread #1  
             lock (this)  
                   Console.WriteLine("Display Thread 1");  
                   _threadOutput = "Hello Thread1";  
                   Thread.Sleep(1000);  // simulate a lot of processing  
                   // tell the user what thread we are in thread #1  
                   Console.WriteLine("Thread 1 Output --> {0}", _threadOutput);  
             }// lock released  for thread #1 here  

/// <summary>  
/// Thread 1, Displays that we are in thread 1 (locked)  
 /// </summary>  
 void DisplayThread2()  
       while (_stopThreads == false)  
           // lock on the current instance of the class for thread #2  
             lock (this)  
                   Console.WriteLine("Display Thread 2");  
                   _threadOutput = "Hello Thread2";  
                   Thread.Sleep(1000);  // simulate a lot of processing  
                   // tell the user what thread we are in thread #1  
                   Console.WriteLine("Thread 2 Output --> {0}", _threadOutput);  
             } // lock released  for thread #2 here  





.NETは、スレッドの制御に役立つ多くのメカニズムを提供します。別のスレッドが共有メモリの一部を処理している間、スレッドをブロックしたままにする別の方法は、AutoResetEventを使用することです。AutoResetEventクラスには、SetとWaitOneの2つのメソッドがあります。これらの2つの方法は、スレッドのブロックを制御するために一緒に使用できます。AutoResetEventがfalseで初期化されると、プログラムは、AutoResetEventでSetメソッドが呼び出されるまで、WaitOneを呼び出すコード行で停止します。AutoResetEventでSetメソッドが実行されると、スレッドのブロックが解除され、WaitOneを超えて続行できるようになります。次回WaitOneが呼び出されると、自動的にリセットされるため、プログラムは、WaitOneメソッドが実行されているコード行で再び待機(ブロック)します。この「停止とトリガー」を使用できます Setを呼び出して、別のスレッドがブロックされたスレッドを解放する準備ができるまで、あるスレッドをブロックするメカニズム。リスト3は、AutoResetEventを使用して、ブロックされたスレッドが待機し、ブロックされていないスレッドが実行されてコンソールに_threadOutputを表示している間に、互いにブロックする同じ2つのスレッドを示しています。最初に、_blockThread1はfalseを通知するように初期化され、_blockThread2はtrueを通知するように初期化されます。これは、_blockThread2がDisplayThread_2のループを最初に通過するときに、WaitOne呼び出しを続行できるようになる一方で、_blockThread1はDisplayThread_1のWaitOne呼び出しをブロックすることを意味します。_blockThread2がスレッド2のループの終わりに達すると、スレッド1をブロックから解放するためにSetを呼び出して_blockThread1に信号を送ります。次に、スレッド2は、スレッド1がループの終わりに到達して_blockThread2でSetを呼び出すまで、WaitOne呼び出しで待機します。スレッド1で呼び出されたセットはスレッド2のブロックを解放し、プロセスが再開されます。両方のAutoResetEvents(_blockThread1と_blockThread2)を最初にfalseを通知するように設定した場合、両方のスレッドが互いにトリガーする機会なしにループの進行を待機し、デッドロック。 


AutoResetEvent _blockThread1 = new AutoResetEvent(false);  
AutoResetEvent _blockThread2 = new AutoResetEvent(true);  
/// <summary>  
/// Thread 1, Displays that we are in thread 1  
/// </summary>  
void DisplayThread_1()  
      while (_stopThreads == false)  
               // block thread 1  while the thread 2 is executing  
                // Set was called to free the block on thread 1, continue executing the code  
                  Console.WriteLine("Display Thread 1");  
                  _threadOutput = "Hello Thread 1";  
                  Thread.Sleep(1000);  // simulate a lot of processing  
                   // tell the user what thread we are in thread #1  
                  Console.WriteLine("Thread 1 Output --> {0}", _threadOutput);  
                // finished executing the code in thread 1, so unblock thread 2  
/// <summary>  
/// Thread 2, Displays that we are in thread 2  
/// </summary>  
void DisplayThread_2()  
      while (_stopThreads == false)  
            // block thread 2  while thread 1 is executing  
            // Set was called to free the block on thread 2, continue executing the code  
                  Console.WriteLine("Display Thread 2");  
                  _threadOutput = "Hello Thread 2";  
                  Thread.Sleep(1000);  // simulate a lot of processing  
                   // tell the user we are in thread #2  
                  Console.WriteLine("Thread 2 Output --> {0}", _threadOutput);   
            // finished executing the code in thread 2, so unblock thread 1