Alisha  Larkin

Alisha Larkin


Tools For Manipulating and Evaluating The HOCR Format



hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.

There is a Public Specification for the hOCR Format.

About the code

Each command line program is self contained; if you have Python 2.7 with the required packages installed, it should just work. (Unfortunately, that means some code duplication; we may revisit this issue in later revisions.)


System-wide with pip

You can install hocr-tools along with its dependencies from PyPI:

sudo pip install hocr-tools

System-wide from source

On a Debian/Ubuntu system, install the dependencies from packages:

sudo apt-get install python-lxml python-reportlab python-pil \
  python-beautifulsoup python-numpy python-scipy python-matplotlib python-setuptools

Or, to fetch dependencies from the cheese shop:

sudo pip install -r requirements.txt  # basic

Then install the dist:

sudo python install



virtualenv venv
source venv/bin/activate
pip install -r requirements.txt


source venv/bin/activate

Available Programs

Included command line programs:


hocr-check file.html

Perform consistency checks on the hOCR file.


hocr-combine file1.html [file2.html ...]

Combine the OCR pages contained in each HTML file into a single document. The document metadata is taken from the first file.


hocr-cut [-h] [-d] [file.html]

Cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns


hocr-eval-lines [-v] true-lines.txt hocr-actual.html

Evaluate hOCR output against ASCII ground truth. This evaluation method requires that the line breaks in true-lines.txt and the ocr_line elements in hocr-actual.html agree (most ASCII output from OCR systems satisfies this requirement).


hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual

Compare the segmentations at the level of the element name (default: ocr_line). Computes undersegmentation, oversegmentation, and missegmentation.


hocr-eval hocr-true.html hocr-actual.html

Evaluate the actual OCR with respect to the ground truth. This outputs the number of OCR errors due to incorrect segmentation and the number of OCR errors due to character recognition errors.

It works by aligning segmentation components geometrically, and for each segmentation component that can be aligned, computing the string edit distance of the text the segmentation component contains.


Extract lines from Google 1000 book sample


hocr-extract-images [-b BASENAME] [-p PATTERN] [-e ELEMENT] [-P PADDING] [file]

Extract the images and texts within all the ocr_line elements within the hOCR file. The BASENAME is the image directory, the default pattern is line-%03d.png, the default element is ocr_line and there is no extra padding by default.


hocr-lines [FILE]

Extract the text within all the ocr_line elements within the hOCR file given by FILE. If called without any file, hocr-lines reads hOCR data from stdin.


hocr-merge-dc dc.xml hocr.html > hocr-new.html

Merges the Dublin Core metadata into the hOCR file by encoding the data in its header.


hocr-pdf <imgdir> > out.pdf
hocr-pdf --savefile out.pdf <imgdir>

Create a searchable PDF from a pile of hOCR and JPEG. It is important that the corresponding JPEG and hOCR files have the same name with their respective file ending. All of these files should lie in one directory, which one has to specify as an argument when calling the command, e.g. use hocr-pdf . > out.pdf to run the command in the current directory and save the output as out.pdf alternatively hocr-pdf . --savefile out.pdf which avoids routing the output through the terminal.


hocr-split file.html pattern

Split a multipage hOCR file into hOCR files containing one page each. The pattern should something like "base-%03d.html"


hocr-wordfreq [-h] [-i] [-n MAX] [-s] [-y] [file.html]

Outputs a list of the most frequent words in an hOCR file with their number of occurrences. If called without any file, hocr-wordfreq reads hOCR data (for example from hocr-combine) from stdin.

By default, the first 10 words are shown, but any number can be requested with -n. Use -i to ignore upper and lower case, -s to split on spaces only which will then lead to words also containing punctations, and -y tries to dehyphenate the text (separation of words at line break with a hyphen) before analysis.

Unit tests

The unit tests are written using the tsht framework.

Running the full test suite:


Running a single test

./test/tsht <path-to/unit-test.tsht>


./test/tsht test/hocr-pdf/test-hocr-pdf.tsht

Writing a test

Please see the documentation in the tsht repository and take a look at the existing unit tests.

  1. Create a new directory under ./test
  2. Copy any test assets (images, hOCR files...) to this directory
  3. Create a file <name-of-your-test>.tsht starting from this template:
#!/usr/bin/env tsht

# adjust to the number of your tests
plan 1

# write your tests here
exec_ok "hocr-foo" "-x" "foo"

# remove any temporary files
# rm some-generated-file

Author: ocropus
Source Code:
License: View license


What is GEEK

Buddha Community

Tools For Manipulating and Evaluating The HOCR Format

Oyente | An Analysis Tool for Smart Contracts


An Analysis Tool for Smart Contracts

This repository is currently maintained by Xiao Liang Yu (@yxliang01). If you encounter any bugs or usage issues, please feel free to create an issue on our issue tracker.

Quick Start

A container with required dependencies configured can be found here. The image is however outdated. We are working on pushing the latest image to dockerhub for your convenience. If you experience any issue with this image, please try to build a new docker image by pulling this codebase before open an issue.

To open the container, install docker and run:

docker pull luongnguyen/oyente && docker run -i -t luongnguyen/oyente

To evaluate the greeter contract inside the container, run:

cd /oyente/oyente && python -s greeter.sol

and you are done!

Note - If need the version of Oyente referred to in the paper, run the container from here

To run the web interface, execute docker run -w /oyente/web -p 3000:3000 oyente:latest ./bin/rails server

Custom Docker image build

docker build -t oyente .
docker run -it -p 3000:3000 -e "OYENTE=/oyente/oyente" oyente:latest

Open a web browser to http://localhost:3000 for the graphical interface.


Execute a python virtualenv

python -m virtualenv env
source env/bin/activate

Install Oyente via pip:

$ pip2 install oyente


The following require a Linux system to fufill. macOS instructions forthcoming.

solc evm

Full installation

Install the following dependencies


$ sudo add-apt-repository ppa:ethereum/ethereum
$ sudo apt-get update
$ sudo apt-get install solc

evm from go-ethereum

  1. or
  2. By from PPA if your using Ubuntu

z3 Theorem Prover version 4.5.0.

Download the source code of version z3-4.5.0

Install z3 using Python bindings

$ python scripts/ --python
$ cd build
$ make
$ sudo make install

Requests library

pip install requests

web3 library

pip install web3

Evaluating Ethereum Contracts

#evaluate a local solidity contract
python -s <contract filename>

#evaluate a local solidity with option -a to verify assertions in the contract
python -a -s <contract filename>

#evaluate a local evm contract
python -s <contract filename> -b

#evaluate a remote contract
python -ru

And that's it! Run python --help for a list of options.


The accompanying paper explaining the bugs detected by the tool can be found here.

Miscellaneous Utilities

A collection of the utilities that were developed for the paper are in misc_utils. Use them at your own risk - they have mostly been disposable.

  1. - Contains a number of functions to get statistics from contracts.
  2. - The get_contract_code function can be used to retrieve contract source from EtherScan
  3. - Contains functions to retrieve up-to-date transaction information for a particular contract.


Note: This is an improved version of the tool used for the paper. Benchmarks are not for direct comparison.

To run the benchmarks, it is best to use the docker container as it includes the blockchain snapshot necessary. In the container, run after activating the virtualenv. Results are in results.json once the benchmark completes.

The benchmarks take a long time and a lot of RAM in any but the largest of clusters, beware.

Some analytics regarding the number of contracts tested, number of contracts analysed etc. is collected when running this benchmark.


Checkout out our contribution guide and the code structure here.

$ sudo apt-get install software-properties-common
$ sudo add-apt-repository -y ppa:ethereum/ethereum
$ sudo apt-get update
$ sudo apt-get install ethereum

Download Details:
Author: enzymefinance
Source Code:
License: GPL-3.0 license

#blockchain #smartcontract #ethereum

50+ Useful DevOps Tools

The article comprises both very well established tools for those who are new to the DevOps methodology.

What Is DevOps?

The DevOps methodology, a software and team management approach defined by the portmanteau of Development and Operations, was first coined in 2009 and has since become a buzzword concept in the IT field.

DevOps has come to mean many things to each individual who uses the term as DevOps is not a singularly defined standard, software, or process but more of a culture. Gartner defines DevOps as:

“DevOps represents a change in IT culture, focusing on rapid IT service delivery through the adoption of agile, lean practices in the context of a system-oriented approach. DevOps emphasizes people (and culture), and seeks to improve collaboration between operations and development teams. DevOps implementations utilize technology — especially automation tools that can leverage an increasingly programmable and dynamic infrastructure from a life cycle perspective.”

As you can see from the above definition, DevOps is a multi-faceted approach to the Software Development Life Cycle (SDLC), but its main underlying strength is how it leverages technology and software to streamline this process. So with the right approach to DevOps, notably adopting its philosophies of co-operation and implementing the right tools, your business can increase deployment frequency by a factor of 30 and lead times by a factor of 8000 over traditional methods, according to a CapGemini survey.

The Right Tools for the Job

This list is designed to be as comprehensive as possible. The article comprises both very well established tools for those who are new to the DevOps methodology and those tools that are more recent releases to the market — either way, there is bound to be a tool on here that can be an asset for you and your business. For those who already live and breathe DevOps, we hope you find something that will assist you in your growing enterprise.

With such a litany of tools to choose from, there is no “right” answer to what tools you should adopt. No single tool will cover all your needs and will be deployed across a variety of development and Operational teams, so let’s break down what you need to consider before choosing what tool might work for you.

  • Plan and collaborate: Before you even begin the SDLC, your business needs to have a cohesive idea of what tools they’ll need to implement across your teams. There are even DevOps tools that can assist you with this first crucial step.
  • Build: Here you want tools that create identically provisioned environments. The last you need is to hear “But it works for me on my computer”
  • Automation: This has quickly become a given in DevOps, but automation will always drastically increase production over manual methods.
  • Continuous Integration: Tools need to provide constant and immediate feedback, several times a day but not all integrations are implemented equally, will the tool you select be right for the job?
  • Deployment: Deployments need to be kept predictable, smooth, and reliable with minimal risks, automation will also play a big part in this process.

With all that in mind, I hope this selection of tools will aid you as your business continues to expand into the DevOps lifestyle.

Tools Categories List:

Infrastructure As Code

Continuous Integration and Delivery

Development Automation

Usability Testing

Database and Big Data




Helpful CLI Tools



Infrastructure As Code


1. AWS CloudFormation

AWS CloudFormation is an absolute must if you are currently working, or planning to work, in the AWS Cloud. CloudFormation allows you to model your AWS infrastructure and provision all your AWS resources swiftly and easily. All of this is done within a JSON or YAML template file and the service comes with a variety of automation features ensuring your deployments will be predictable, reliable, and manageable.


2. Azure Resource Manager

Azure Resource Manager (ARM) is Microsoft’s answer to an all-encompassing IAC tool. With its ARM templates, described within JSON files, Azure Resource Manager will provision your infrastructure, handle dependencies, and declare multiple resources via a single template.


#Google Cloud Deployment Manager

3. Google Cloud Deployment Manager

Much like the tools mentioned above, Google Cloud Deployment Manager is Google’s IAC tool for the Google Cloud Platform. This tool utilizes YAML for its config files and JINJA2 or PYTHON for its templates. Some of its notable features are synchronistic deployment and ‘preview’, allowing you an overhead view of changes before they are committed.


4. Terraform

Terraform is brought to you by HashiCorp, the makers of Vault and Nomad. Terraform is vastly different from the above-mentioned tools in that it is not restricted to a specific cloud environment, this comes with increased benefits for tackling complex distributed applications without being tied to a single platform. And much like Google Cloud Deployment Manager, Terraform also has a preview feature.



5. Chef

Chef is an ideal choice for those who favor CI/CD. At its heart, Chef utilizes self-described recipes, templates, and cookbooks; a collection of ready-made templates. Cookbooks allow for consistent configuration even as your infrastructure rapidly scales. All of this is wrapped up in a beautiful Ruby-based DSL pie.



#tools #devops #devops 2020 #tech tools #tool selection #tool comparison

August  Larson

August Larson


String Format() Function in Python

To control and handle complex string formatting more efficiently

What is formatting, why is it used?

In python, there are several ways to present output. String formatting using python is one such method where it allows the user to control and handle complex string formatting more efficiently than simply printing space-separated values.There are many types of string formatting, such as padding and alignment, using dictionaries, etc. The usage of formatting techniques is not only subjected to strings. It also formats dates, numbers, signed digits, etc.

Structure of format() method

Let us look at the basic structure of how to write in string format method.

Syntax: ‘String {} value’.format(value)

Let us look at an example:
‘Welcome to the {} world.’.format(“python”)

Here, we have defined a string( ‘’) with a placeholder( {} ) and assigned the argument of the parameter as “python.” On executing the program, the value will be assigned to the placeholder, showing the output as:

#python #programming #string format() function in python #string format() function #format() #format() function

Sunny  Kunde

Sunny Kunde


Top 12 Most Used Tools By Developers In 2020

rameworks and libraries can be said as the fundamental building blocks when developers build software or applications. These tools help in opting out the repetitive tasks as well as reduce the amount of code that the developers need to write for a particular software.

Recently, the Stack Overflow Developer Survey 2020 surveyed nearly 65,000 developers, where they voted their go-to tools and libraries. Here, we list down the top 12 frameworks and libraries from the survey that are most used by developers around the globe in 2020.

(The libraries are listed according to their number of Stars in GitHub)

1| TensorFlow

**GitHub Stars: **147k

Rank: 5

**About: **Originally developed by researchers of Google Brain team, TensorFlow is an end-to-end open-source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art research in ML. It allows developers to easily build and deploy ML-powered applications.

Know more here.

2| Flutter

**GitHub Stars: **98.3k

**Rank: **9

About: Created by Google, Flutter is a free and open-source software development kit (SDK) which enables fast user experiences for mobile, web and desktop from a single codebase. The SDK works with existing code and is used by developers and organisations around the world.

#opinions #developer tools #frameworks #java tools #libraries #most used tools by developers #python tools

Amara  Legros

Amara Legros


8 Fun AI Tools Available Online

“AI for fun” — a phrase that we commonly don’t hear in the industry. Artificial intelligence has always been considered a revolutionary technology that has emerged to solve complex real-world problems like high-level computation, omitting manual labour, or data-driven optimisation. However, with its endless possibilities, there are many applications of AI that make this technology more accessible to the average layman person or kids at home.

To get people’s head around this sophisticated technology developers all around the world are continuously developing some fun AI tools that can be easily accessed online to get hands-on. Not only are these AI tools fun but also provide a good understanding of this technology to the users.

Here is a list of 10 exciting artificial intelligence tools that are available online for anyone to have fun with.

#opinions #ai tool online #ai tools #artificial intelligence tools #fun ai tools