Jackson  Crist

Jackson Crist

1615454999

A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0

The Vaex DataFrame has always been very fast. Built from the ground up to be out of core (the size of your disk is the limit), it pushes the limits of what single machines can do in the context of big data analysis.
Starting from version 2, we added better support for string data, giving an almost 1000x speedup compared to Pandas at the time. To support this seemingly trivial datatype, we had to choose a disk and memory format and did not want to reinvent the wheel. Apache Arrow was an obvious choice but did not meet the requirements at that time. However, we still added string support in Vaex, but in a future compatible way so that when the time arrives (now!), we can adopt Apache Arrow without rendering data from the past obsolete, or requiring data conversions. For compatibility with Apache Arrow, we developed the vaex-arrowpackage, which made interoperability with Vaex smooth, at the cost of a possible memory copy here and there.

#apache-arrow #dataframes #data-engineering #data-science #python

What is GEEK

Buddha Community

A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0

Semantic Similarity Framework for Knowledge Graph

Introduction

Sematch is an integrated framework for the development, evaluation, and application of semantic similarity for Knowledge Graphs (KGs). It is easy to use Sematch to compute semantic similarity scores of concepts, words and entities. Sematch focuses on specific knowledge-based semantic similarity metrics that rely on structural knowledge in taxonomy (e.g. depth, path length, least common subsumer), and statistical information contents (corpus-IC and graph-IC). Knowledge-based approaches differ from their counterpart corpus-based approaches relying on co-occurrence (e.g. Pointwise Mutual Information) or distributional similarity (Latent Semantic Analysis, Word2Vec, GLOVE and etc). Knowledge-based approaches are usually used for structural KGs, while corpus-based approaches are normally applied in textual corpora.

In text analysis applications, a common pipeline is adopted in using semantic similarity from concept level, to word and sentence level. For example, word similarity is first computed based on similarity scores of WordNet concepts, and sentence similarity is computed by composing word similarity scores. Finally, document similarity could be computed by identifying important sentences, e.g. TextRank.

logo

KG based applications also meet similar pipeline in using semantic similarity, from concept similarity (e.g. http://dbpedia.org/class/yago/Actor109765278) to entity similarity (e.g. http://dbpedia.org/resource/Madrid). Furthermore, in computing document similarity, entities are extracted and document similarity is computed by composing entity similarity scores.

kg

In KGs, concepts usually denote ontology classes while entities refer to ontology instances. Moreover, those concepts are usually constructed into hierarchical taxonomies, such as DBpedia ontology class, thus quantifying concept similarity in KG relies on similar semantic information (e.g. path length, depth, least common subsumer, information content) and semantic similarity metrics (e.g. Path, Wu & Palmer,Li, Resnik, Lin, Jiang & Conrad and WPath). In consequence, Sematch provides an integrated framework to develop and evaluate semantic similarity metrics for concepts, words, entities and their applications.


Getting started: 20 minutes to Sematch

Install Sematch

You need to install scientific computing libraries numpy and scipy first. An example of installing them with pip is shown below.

pip install numpy scipy

Depending on different OS, you can use different ways to install them. After sucessful installation of numpy and scipy, you can install sematch with following commands.

pip install sematch
python -m sematch.download

Alternatively, you can use the development version to clone and install Sematch with setuptools. We recommend you to update your pip and setuptools.

git clone https://github.com/gsi-upm/sematch.git
cd sematch
python setup.py install

We also provide a Sematch-Demo Server. You can use it for experimenting with main functionalities or take it as an example for using Sematch to develop applications. Please check our Documentation for more details.

Computing Word Similarity

The core module of Sematch is measuring semantic similarity between concepts that are represented as concept taxonomies. Word similarity is computed based on the maximum semantic similarity of WordNet concepts. You can use Sematch to compute multi-lingual word similarity based on WordNet with various of semantic similarity metrics.

from sematch.semantic.similarity import WordNetSimilarity
wns = WordNetSimilarity()

# Computing English word similarity using Li method
wns.word_similarity('dog', 'cat', 'li') # 0.449327301063
# Computing Spanish word similarity using Lin method
wns.monol_word_similarity('perro', 'gato', 'spa', 'lin') #0.876800984373
# Computing Chinese word similarity using  Wu & Palmer method
wns.monol_word_similarity('狗', '猫', 'cmn', 'wup') # 0.857142857143
# Computing Spanish and English word similarity using Resnik method
wns.crossl_word_similarity('perro', 'cat', 'spa', 'eng', 'res') #7.91166650904
# Computing Spanish and Chinese word similarity using Jiang & Conrad method
wns.crossl_word_similarity('perro', '猫', 'spa', 'cmn', 'jcn') #0.31023804699
# Computing Chinese and English word similarity using WPath method
wns.crossl_word_similarity('狗', 'cat', 'cmn', 'eng', 'wpath')#0.593666388463

Computing semantic similarity of YAGO concepts.

from sematch.semantic.similarity import YagoTypeSimilarity
sim = YagoTypeSimilarity()

#Measuring YAGO concept similarity through WordNet taxonomy and corpus based information content
sim.yago_similarity('http://dbpedia.org/class/yago/Dancer109989502','http://dbpedia.org/class/yago/Actor109765278', 'wpath') #0.642
sim.yago_similarity('http://dbpedia.org/class/yago/Dancer109989502','http://dbpedia.org/class/yago/Singer110599806', 'wpath') #0.544
#Measuring YAGO concept similarity based on graph-based IC
sim.yago_similarity('http://dbpedia.org/class/yago/Dancer109989502','http://dbpedia.org/class/yago/Actor109765278', 'wpath_graph') #0.423
sim.yago_similarity('http://dbpedia.org/class/yago/Dancer109989502','http://dbpedia.org/class/yago/Singer110599806', 'wpath_graph') #0.328

Computing semantic similarity of DBpedia concepts.

from sematch.semantic.graph import DBpediaDataTransform, Taxonomy
from sematch.semantic.similarity import ConceptSimilarity
concept = ConceptSimilarity(Taxonomy(DBpediaDataTransform()),'models/dbpedia_type_ic.txt')
concept.name2concept('actor')
concept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'path')
concept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'wup')
concept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'li')
concept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'res')
concept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'lin')
concept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'jcn')
concept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'wpath')

Computing semantic similarity of DBpedia entities.

from sematch.semantic.similarity import EntitySimilarity
sim = EntitySimilarity()
sim.similarity('http://dbpedia.org/resource/Madrid','http://dbpedia.org/resource/Barcelona') #0.409923677282
sim.similarity('http://dbpedia.org/resource/Apple_Inc.','http://dbpedia.org/resource/Steve_Jobs')#0.0904545454545
sim.relatedness('http://dbpedia.org/resource/Madrid','http://dbpedia.org/resource/Barcelona')#0.457984139871
sim.relatedness('http://dbpedia.org/resource/Apple_Inc.','http://dbpedia.org/resource/Steve_Jobs')#0.465991132787

Evaluate semantic similarity metrics with word similarity datasets

from sematch.evaluation import WordSimEvaluation
from sematch.semantic.similarity import WordNetSimilarity
evaluation = WordSimEvaluation()
evaluation.dataset_names()
wns = WordNetSimilarity()
# define similarity metrics
wpath = lambda x, y: wns.word_similarity_wpath(x, y, 0.8)
# evaluate similarity metrics with SimLex dataset
evaluation.evaluate_metric('wpath', wpath, 'noun_simlex')
# performa Steiger's Z significance Test
evaluation.statistical_test('wpath', 'path', 'noun_simlex')
# define similarity metrics for Spanish words
wpath_es = lambda x, y: wns.monol_word_similarity(x, y, 'spa', 'path')
# define cross-lingual similarity metrics for English-Spanish
wpath_en_es = lambda x, y: wns.crossl_word_similarity(x, y, 'eng', 'spa', 'wpath')
# evaluate metrics in multilingual word similarity datasets
evaluation.evaluate_metric('wpath_es', wpath_es, 'rg65_spanish')
evaluation.evaluate_metric('wpath_en_es', wpath_en_es, 'rg65_EN-ES')

Evaluate semantic similarity metrics with category classification

Although the word similarity correlation measure is the standard way to evaluate the semantic similarity metrics, it relies on human judgements over word pairs which may not have same performance in real applications. Therefore, apart from word similarity evaluation, the Sematch evaluation framework also includes a simple aspect category classification. The task classifies noun concepts such as pasta, noodle, steak, tea into their ontological parent concept FOOD, DRINKS.

from sematch.evaluation import AspectEvaluation
from sematch.application import SimClassifier, SimSVMClassifier
from sematch.semantic.similarity import WordNetSimilarity

# create aspect classification evaluation
evaluation = AspectEvaluation()
# load the dataset
X, y = evaluation.load_dataset()
# define word similarity function
wns = WordNetSimilarity()
word_sim = lambda x, y: wns.word_similarity(x, y)
# Train and evaluate metrics with unsupervised classification model
simclassifier = SimClassifier.train(zip(X,y), word_sim)
evaluation.evaluate(X,y, simclassifier)

macro averge:  (0.65319812882333839, 0.7101245049198579, 0.66317566364913016, None)
micro average:  (0.79210167952791644, 0.79210167952791644, 0.79210167952791644, None)
weighted average:  (0.80842645056024054, 0.79210167952791644, 0.79639496616636352, None)
accuracy:  0.792101679528
             precision    recall  f1-score   support

    SERVICE       0.50      0.43      0.46       519
 RESTAURANT       0.81      0.66      0.73       228
       FOOD       0.95      0.87      0.91      2256
   LOCATION       0.26      0.67      0.37        54
   AMBIENCE       0.60      0.70      0.65       597
     DRINKS       0.81      0.93      0.87       752

avg / total       0.81      0.79      0.80      4406

Matching Entities with type using SPARQL queries

You can use Sematch to download a list of entities having a specific type using different languages. Sematch will generate SPARQL queries and execute them in DBpedia Sparql Endpoint.

from sematch.application import Matcher
matcher = Matcher()
# matching scientist entities from DBpedia
matcher.match_type('scientist')
matcher.match_type('científico', 'spa')
matcher.match_type('科学家', 'cmn')
matcher.match_entity_type('movies with Tom Cruise')

Example of automatically generated SPARQL query.

SELECT DISTINCT ?s, ?label, ?abstract WHERE {
    {  
    ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/NuclearPhysicist110364643> . }
 UNION {  
    ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/Econometrician110043491> . }
 UNION {  
    ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/Sociologist110620758> . }
 UNION {  
    ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/Archeologist109804806> . }
 UNION {  
    ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/Neurolinguist110354053> . } 
    ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> . 
    ?s <http://www.w3.org/2000/01/rdf-schema#label> ?label . 
    FILTER( lang(?label) = "en") . 
    ?s <http://dbpedia.org/ontology/abstract> ?abstract . 
    FILTER( lang(?abstract) = "en") .
} LIMIT 5000

Entity feature extraction with Similarity Graph

Apart from semantic matching of entities from DBpedia, you can also use Sematch to extract features of entities and apply semantic similarity analysis using graph-based ranking algorithms. Given a list of objects (concepts, words, entities), Sematch compute their pairwise semantic similarity and generate similarity graph where nodes denote objects and edges denote similarity scores. An example of using similarity graph for extracting important words from an entity description.

from sematch.semantic.graph import SimGraph
from sematch.semantic.similarity import WordNetSimilarity
from sematch.nlp import Extraction, word_process
from sematch.semantic.sparql import EntityFeatures
from collections import Counter
tom = EntityFeatures().features('http://dbpedia.org/resource/Tom_Cruise')
words = Extraction().extract_nouns(tom['abstract'])
words = word_process(words)
wns = WordNetSimilarity()
word_graph = SimGraph(words, wns.word_similarity)
word_scores = word_graph.page_rank()
words, scores =zip(*Counter(word_scores).most_common(10))
print words
(u'picture', u'action', u'number', u'film', u'post', u'sport', 
u'program', u'men', u'performance', u'motion')

Publications

Ganggao Zhu, and Carlos A. Iglesias. "Computing Semantic Similarity of Concepts in Knowledge Graphs." IEEE Transactions on Knowledge and Data Engineering 29.1 (2017): 72-85.

Oscar Araque, Ganggao Zhu, Manuel Garcia-Amado and Carlos A. Iglesias Mining the Opinionated Web: Classification and Detection of Aspect Contexts for Aspect Based Sentiment Analysis, ICDM sentire, 2016.

Ganggao Zhu, and Carlos Angel Iglesias. "Sematch: Semantic Entity Search from Knowledge Graph." SumPre-HSWI@ ESWC. 2015.


Support

You can post bug reports and feature requests in Github issues. Make sure to read our guidelines first. This project is still under active development approaching to its goals. The project is mainly maintained by Ganggao Zhu. You can contact him via gzhu [at] dit.upm.es


Why this name, Sematch and Logo?

The name of Sematch is composed based on Spanish "se" and English "match". It is also the abbreviation of semantic matching because semantic similarity metrics helps to determine semantic distance of concepts, words, entities, instead of exact matching.

The logo of Sematch is based on Chinese Yin and Yang which is written in I Ching. Somehow, it correlates to 0 and 1 in computer science.

Author: Gsi-upm
Source Code: https://github.com/gsi-upm/sematch 
License: View license

#python #jupyternotebook #graph 

Jackson  Crist

Jackson Crist

1615454999

A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0

The Vaex DataFrame has always been very fast. Built from the ground up to be out of core (the size of your disk is the limit), it pushes the limits of what single machines can do in the context of big data analysis.
Starting from version 2, we added better support for string data, giving an almost 1000x speedup compared to Pandas at the time. To support this seemingly trivial datatype, we had to choose a disk and memory format and did not want to reinvent the wheel. Apache Arrow was an obvious choice but did not meet the requirements at that time. However, we still added string support in Vaex, but in a future compatible way so that when the time arrives (now!), we can adopt Apache Arrow without rendering data from the past obsolete, or requiring data conversions. For compatibility with Apache Arrow, we developed the vaex-arrowpackage, which made interoperability with Vaex smooth, at the cost of a possible memory copy here and there.

#apache-arrow #dataframes #data-engineering #data-science #python

NumPy Applications - Uses of Numpy

In this Numpy tutorial, we will learn Numpy applications.

NumPy is a basic level external library in Python used for complex mathematical operations. NumPy overcomes slower executions with the use of multi-dimensional array objects. It has built-in functions for manipulating arrays. We can convert different algorithms to can into functions for applying on arrays.NumPy has applications that are not only limited to itself. It is a very diverse library and has a wide range of applications in other sectors. Numpy can be put to use along with Data Science, Data Analysis and Machine Learning. It is also a base for other python libraries. These libraries use the functionalities in NumPy to increase their capabilities.

numpy applications

Numpy Applications

1. An alternative for lists and arrays in Python

Arrays in Numpy are equivalent to lists in python. Like lists in python, the Numpy arrays are homogenous sets of elements. The most important feature of NumPy arrays is they are homogenous in nature. This differentiates them from python arrays. It maintains uniformity for mathematical operations that would not be possible with heterogeneous elements. Another benefit of using NumPy arrays is there are a large number of functions that are applicable to these arrays. These functions could not be performed when applied to python arrays due to their heterogeneous nature.

2. NumPy maintains minimal memory

Arrays in NumPy are objects. Python deletes and creates these objects continually, as per the requirements. Hence, the memory allocation is less as compared to Python lists. NumPy has features to avoid memory wastage in the data buffer. It consists of functions like copies, view, and indexing that helps in saving a lot of memory. Indexing helps to return the view of the original array, that implements reuse of the data. It also specifies the data type of the elements which leads to code optimization.

3. Using NumPy for multi-dimensional arrays

We can also create multi-dimensional arrays in NumPy.These arrays have multiple rows and columns. These arrays have more than one column that makes these multi-dimensional. Multi-dimensional array implements the creation of matrices. These matrices are easy to work with. With the use of matrices the code also becomes memory efficient. We have a matrix module to perform various operations on these matrices.

4. Mathematical operations with NumPy

Working with NumPy also includes easy to use functions for mathematical computations on the array data set. We have many modules for performing basic and special mathematical functions in NumPy. There are functions for Linear Algebra, bitwise operations, Fourier transform, arithmetic operations, string operations, etc.

#numpy tutorials #applications of numpy #numpy applications #uses of numpy #numpy

NumPy Features - Why we should use Numpy?

Welcome to DataFlair!!! In this tutorial, we will learn Numpy Features and its importance.

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays

NumPy (Numerical Python) is an open-source core Python library for scientific computations. It is a general-purpose array and matrices processing package. Python is slower as compared to Fortran and other languages to perform looping. To overcome this we use NumPy that converts monotonous code into the compiled form.

numpy features

NumPy Features

These are the important features of NumPy:

1. High-performance N-dimensional array object

This is the most important feature of the NumPy library. It is the homogeneous array object. We perform all the operations on the array elements. The arrays in NumPy can be one dimensional or multidimensional.

a. One dimensional array

The one-dimensional array is an array consisting of a single row or column. The elements of the array are of homogeneous nature.

b. Multidimensional array

In this case, we have various rows and columns. We consider each column as a dimension. The structure is similar to an excel sheet. The elements are homogenous.

2. It contains tools for integrating code from C/C++ and Fortran

We can use the functions in NumPy to work with code written in other languages. We can hence integrate the functionalities available in various programming languages. This helps implement inter-platform functions.

#numpy tutorials #features of numpy #numpy features #why use numpy #numpy

Monty  Boehm

Monty Boehm

1640622240

Automatically Tag A Branch with The Next Semantic Version Tag

Auto-Tag

PyPI PyPI - Implementation PyPI - Python Version codecov PyPI - License

Automatically tag a branch with the next semantic version tag.

This is useful if you want to generate tags every time something is merged. Microservice and GitOps repository are good candidates for this type of action.

TOC

How to install

~ $ pip install auto-tag

To see if it works, you can try

~ $ auto-tag  -h
usage: auto-tag [-h] [-b BRANCH] [-r REPO]
                [-u [UPSTREAM_REMOTE [UPSTREAM_REMOTE ...]]]
                [-l {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]
                [--name NAME] [--email EMAIL] [-c CONFIG]
                [--skip-tag-if-one-already-present] [--append-v-to-tag]
                [--tag-search-strategy {biggest-tag-in-repo,biggest-tag-in-branch,latest-tag-in-repo,latest-tag-in-branch}]

.....

How it Works

The flow is as follows:

  • figure our repository based on the argument
  • load detectors from file if specified (-c option), if none specified load default ones (see Detectors)
  • check for the last tag (depending on the search strategy see Search Strategy
  • look at all commits done after that tag on a specific branch (or from the start of the repository if no tag is found)
  • apply the detector (see Detectors) on each commit and save the highest change detected (PATH, MINOR, MAJOR)
  • bump the last tag with the approbate change and apply it using the default git author in the system or a specific one (see Git Author)
  • if an upstream was specified push the tag to that upstream

Examples

Here we can see in commit 2245d5d that it stats with feature( so the latest know tag (0.2.1) was bumped to 0.3.0

~ $ git log --oneline
2245d5d (HEAD -> master) feature(component) commit #4
939322f commit #3
9ef3be6 (tag: 0.2.1) commit #2
0ee81b0 commit #1
~ $ auto-tag
2019-08-31 14:10:24,626: Start tagging <git.Repo "/Users/matei/git/test-auto-tag-branch/.git">
2019-08-31 14:10:24,649: Bumping tag 0.2.1 -> 0.3.0
2019-08-31 14:10:24,658: No push remote was specified
~ $ git log --oneline
2245d5d (HEAD -> master, tag: 0.3.0) feature(component) commit #4
939322f commit #3
9ef3be6 (tag: 0.2.1) commit #2
0ee81b0 commit #1

In this example we can see 2245d5deb5d97d288b7926be62d051b7eed35c98 introducing a feature that will trigger a MINOR change but we can also see 0de444695e3208b74d0b3ed7fd20fd0be4b2992e having a BREAKING_CHANGE that will introduce a MAJOR bump, this is the reason the tag moved from 0.2.1 to 1.0.0

~ $ git log
commit 0de444695e3208b74d0b3ed7fd20fd0be4b2992e (HEAD -> master)
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 21:58:01 2019 +0300

    fix(something) ....

    BREAKING_CHANGE: this must trigger major version bump

commit 65bf4b17669ea52f84fd1dfa4e4feadbc299a80e
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 21:57:47 2019 +0300

    fix(something) ....

commit 2245d5deb5d97d288b7926be62d051b7eed35c98
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 19:52:10 2019 +0300

    feature(component) commit #4

commit 939322f1efaa1c07b7ed33f2923526f327975cfc
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 19:51:24 2019 +0300

    commit #3

commit 9ef3be64c803d7d8d3b80596485eac18e80cb89d (tag: 0.2.1)
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 19:51:18 2019 +0300

    commit #2

commit 0ee81b0bed209941720ee602f76341bcb115b87d
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 19:50:25 2019 +0300

    commit #1
~ $ auto-tag
2019-08-31 14:10:24,626: Start tagging <git.Repo "/Users/matei/git/test-auto-tag-branch/.git">
2019-08-31 14:10:24,649: Bumping tag 0.2.1 -> 1.0.0
2019-08-31 14:10:24,658: No push remote was specified
~ $ git log
commit 0de444695e3208b74d0b3ed7fd20fd0be4b2992e (HEAD -> master, tag: 1.0.0)
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 21:58:01 2019 +0300

    fix(something) ....

    BREAKING_CHANGE: this must trigger major version bump

commit 65bf4b17669ea52f84fd1dfa4e4feadbc299a80e
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 21:57:47 2019 +0300

    fix(something) ....

commit 2245d5deb5d97d288b7926be62d051b7eed35c98
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 19:52:10 2019 +0300

    feature(component) commit #4

commit 939322f1efaa1c07b7ed33f2923526f327975cfc
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 19:51:24 2019 +0300

    commit #3

commit 9ef3be64c803d7d8d3b80596485eac18e80cb89d (tag: 0.2.1)
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 19:51:18 2019 +0300

    commit #2

commit 0ee81b0bed209941720ee602f76341bcb115b87d
Author: Matei-Marius Micu <micumatei@gmail.com>
Date:   Fri Aug 30 19:50:25 2019 +0300

    commit #1

Detectors

If you want to detect what commit enforces a specific tag bump(PATH, MINOR, MAJOR) you can configure detectors. They are configured in a yaml file that looks like this:

detectors:

  check_for_feature_heading:
    type: CommitMessageHeadStartsWithDetector
    produce_type_change: MINOR
    params:
      pattern: 'feature'


  check_for_breaking_change:
    type: CommitMessageContainsDetector
    produce_type_change: MAJOR
    params:
      pattern: 'BREAKING_CHANGE'
      case_sensitive: false

Here is the default configuration for detectors if none is specified. We can see we have two detectors check_for_feature_heading and check_for_breaking_change, with a type, what change they will trigger and specific parameters for each one. This configuration will do the following:

  • if the commit message starts with feature( a MINOR change will BE triggered
  • if the commit has BREAKIN_CHANGE in the message a MAJOR change will be triggered The bump on the tag will be based on the higher priority found.

The type and produce_type_change parameters are required params is specific to every detector.

To pass the file to the process just use the -c CLI parameter.

Currently we support the following triggers:

  • CommitMessageHeadStartsWithDetector
    • Parameters:
      • case_sensitive of type bool, if the comparison is case sensitive
      • strip of type bool, if we strip the spaces from the commit message
      • pattern of type string, what pattern is searched at the start of the commit message
  • CommitMessageContainsDetector
    • case_sensitive of type bool, if the comparison is case sensitive
    • strip of type bool, if we strip the spaces from the commit message
    • pattern of type string, what pattern is searched in the body of the commit message
  • CommitMessageMatchesRegexDetector
    • strip of type bool, if we strip the spaces from the commit message
    • pattern of type string, what regex pattern to match against the commit message

The regex detector is the most powerful one.

Git Author

When creating and tag we need to specify a git author, if a global one is not set (or if we want to make this one with a specific user), we have the option to specify one. The following options will add a temporary config to this repository(local config). After the tag was created it will restore the existing config (if any was present)

  --name NAME           User name used for creating git objects.If not
                        specified the system one will be used.
  --email EMAIL         Email name used for creating git objects.If not
                        specified the system one will be used.

If another user interacts with git while this process is taking place it will use the temporary config, but we assume we are run in a CI pipeline and this is the only process interacting with git.

Search Strategy

If you want to bump a tag first you need to find the last one, we have a few implementations to search for the last tag that can be configured with --tag-search-strategy CLI option.

  • biggest-tag-in-repo consider all tags in the repository as semantic versions and pick the biggest one
  • biggest-tag-in-branch consider all tags on the specified branch as semantic versions and pick the biggest one
  • latest-tag-in-repo compare commit date for each commit that has a tag in the repository and take the latest
  • latest-tag-in-branch compare commit date for each commit that has a tag one the specifid branch and take the latest

Download Details: 
Author: Mateimicu
Source Code: https://github.com/mateimicu/auto-tag 
License: View license

#git #github