1679194440
Complete, flexible, extensible, and easy-to-use page transition library for your server-side rendered website.
Swup is a library that helps you add page transitions to server-side rendered websites. It handles the complete lifecycle of a page visit by intercepting link clicks, loading the new page in the background, replacing the content and transitioning between the old and the new page.
Its goal is to make adding transitions to a site as simple as possible, while providing lots of other quality-of-life improvements.
Take a look at Sites using swup for more examples.
If you're having trouble implementing swup, check out the Common Issues section of the docs, look at closed issues or create a new discussion.
We're looking for maintainers! 👀
Become a sponsor on Open Collective or support development through GitHub sponsors.
Author: swup
Source Code: https://github.com/swup/swup
License: MIT license
1676956740
Divi is a powerful and easy-to-use WordPress page builder that helps you create beautiful and engaging landing pages with minimal effort. It comes with a variety of features and functionalities to help you create stunning designs that will capture the attention of your target audience and help you promote your business.
Divi makes it easy to customize your landing pages with its drag-and-drop functionality, allowing you to create pages quickly and without the need for coding. With Divi, you can also add cool elements like forms, images, and videos to create an attractive landing page.
In this article, we will discuss how to use Divi to create eye-catching landing pages that will help you market your business.
With Divi, you can create a beautiful landing page quickly and easily. First, you will need to purchase the Divi theme and install it on your website.
When you have it ready, there are two ways you can go. Either you can create anything from scratch or you can select from pre-designed templates. That’s what we are going to do in this example.
The first step is to choose a template that best suits your needs and the sector you are targeting. Divi offers a variety of layouts, some of which are specifically tailored for creating a landing page.
When selecting a template, you will want to consider factors such as the overall design, the number of sections, and the elements included in the layout. You can search for “Landing Page” in the Divi Layout Library, and you will find a variety of layouts to choose from.
Once you’ve selected a template, you can begin customizing your landing page. Divi offers an intuitive drag-and-drop interface, which makes it easy to add and rearrange sections, text, images, and other elements. You can also customize the fonts, colors, and other styling elements of the page to create a unique and professional look.
Divi Leads is an advanced lead generation and conversion optimization suite developed by Elegant Themes, the creators of the popular Divi WordPress theme. It includes features such as A/B testing, personalized CTAs, and lead generation forms to help you create effective landing pages that drive conversions.
With Divi Leads, you can easily create targeted campaigns to capture leads from multiple sources and customize the user experience for each visitor. With its intuitive drag and drop page builder, you can quickly build a high-converting landing page without any coding knowledge.
Plus, its powerful analytics dashboard allows you to track and monitor the performance of your campaigns in real-time. Divi Leads is the perfect tool to help you build your landing pages quickly and efficiently, so don’t froget to add it to your landing page.
After you have the design in place, it’s time to insert your content. This is where you can add graphics, sales copy, and a good call-to-action to encourage visitors to take the desired action.
Divi offers a wide range of modules that you can use to insert text, images, buttons, sliders, and more. You can also use the visual builder to edit the content and customize it to match the look and feel of your landing page.
Lastly, it’s important to ensure that your landing page is optimized for search engines and mobile devices. Divi provides a “Mobile Optimized” option, which displays the page in a mobile-friendly format. Additionally, you can add the necessary meta tags, keywords, and descriptions to ensure that it is properly indexed by search engines.
Once you have finished creating and optimizing your landing page, you can publish it and start driving traffic to it. It’s important to integrate some analitycs tools, like Google Analytics to gather information about your landing page visitors and adjust your marketing actions according your audience.
Divi is a powerful website builder that can be used to create amazing landing pages. It is easy to use and it offers a wide range of customization options to make your landing page stand out. Divi includes many tools and features that can be used to optimize your landing page design. Here are a few tips for optimizing your landing page with Divi.
1. Use Visual Cues: Visual cues are a great way to draw the visitor’s attention to the most important parts of your landing page. You can use arrows, lines, and other visual elements to point the visitor in the right direction and to guide them toward the desired action.
2. Utilize White Space: White space is an important design element as it helps to keep the page from looking cluttered and overwhelming. Utilizing white space on your landing page will make it easier for visitors to navigate and find the information they’re looking for.
3. Add Animation Effects: Animation effects can help to add life to your landing page. You can use animation effects to draw attention to certain elements or to add a bit of playfulness to the design.
4. Create Compelling Headlines: Your headlines should be attention-grabbing and should clearly communicate the main message of your landing page. Make sure your headlines are clear and concise and that they grab the visitor’s attention.
5. Incorporate Quality Images: Images can help to draw attention to certain elements and can add a visual appeal to your page. Make sure to only use high-quality images and be sure to optimize them for the web.
By following these tips, you can create an effective landing page with Divi. With its powerful tools and features, Divi makes it easy to customize and optimize the design of your landing page.
When creating a landing page with Divi, there are a few SEO tips to keep in mind to make sure your page is optimized and ranked well in search engine results.
1. Use keywords in your page: This means including relevant keywords throughout the content, titles, and headings of your page. This helps search engines understand what your page is about and can help improve your ranking.
2. Optimize your meta tags: This includes adding titles and descriptions to the page, which appear in search engine results. In addition to helping search engines better understand your page, the meta tags also give readers a better idea of what to expect when they click on the link.
3. Include internal links: This means linking to other pages within your website. This helps search engines understand the context of your page and can help improve the overall ranking of your entire website.
4. Use alt tags for images: Alt tags provide a text description for images, which helps search engines understand what the image is about and can help improve your page’s ranking.
5. Make sure your page is optimized for mobile: With more and more people using their phones to search the web, it’s essential to make sure your page is optimized for mobile. Divi provides tools and features that make it easy to optimize your page for mobile devices.
By following these SEO tips, you can create a great landing page with Divi that is optimized for search engine results and ensures that your page is ranked well.
In conclusion, creating a landing page with Divi is a great way to make a professional-looking website quickly and easily. With its drag-and-drop builder, you can create pages with a wide variety of layouts and content, and you have access to a library of pre-made layouts, modules, and elements that make designing a page a breeze.
With Divi, you can create a beautiful landing page that will impress your visitors and convert.
You can buy Divi on Elegant Themes website here
Original article source at: https://www.blog.duomly.com/
1675927562
In this article, we will see how to create a custom error page in laravel 9. Here we will create a custom 404 error page in laravel 7, laravel 8, and laravel 9. In laravel by default provide a simple design for the 404 error page. But you can create a custom error page as per your theme.
Some exceptions describe HTTP error codes from the server. For example, this may be a "page not found" error (404), an "unauthorized error" (401) or even a developer generated 500 error.
Laravel makes it easy to display custom error pages for various HTTP status codes. For example, if you wish to customize the error page for 404 HTTP status codes, create a resources/views/errors/404.blade.php
view template.
So, let's see the laravel 9 custom error page, laravel 404 page template, laravel custom 404 error page, and custom error page in laravel 8.
Step 1: Install Laravel 9
In this step, we will install laravel 9 using the following command.
composer create-project laravel/laravel laravel9-error-page-example
Step 2: Publish Default Error Page
Now, we will publish laravel's default error page templates using the vendor:publish
Artisan command.
php artisan vendor:publish --tag=laravel-errors
Step 3: Create 404 Error Page Design
In this step, we will create 404 page design. So, add the following HTML code to that file.
resources/views/errors/404.blade.php
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>How To Create Custom Error Page In Laravel 9 - Websolutionstuff</title>
<style>
@import url('https://fonts.googleapis.com/css?family=Dosis:300,400,500');
@-moz-keyframes rocket-movement {
100% {
-moz-transform: translate(1200px, -600px);
}
}
@-webkit-keyframes rocket-movement {
100% {
-webkit-transform: translate(1200px, -600px);
}
}
@keyframes rocket-movement {
100% {
transform: translate(1200px, -600px);
}
}
@-moz-keyframes spin-earth {
100% {
-moz-transform: rotate(-360deg);
transition: transform 20s;
}
}
@-webkit-keyframes spin-earth {
100% {
-webkit-transform: rotate(-360deg);
transition: transform 20s;
}
}
@keyframes spin-earth {
100% {
-webkit-transform: rotate(-360deg);
transform: rotate(-360deg);
transition: transform 20s;
}
}
@-moz-keyframes move-astronaut {
100% {
-moz-transform: translate(-160px, -160px);
}
}
@-webkit-keyframes move-astronaut {
100% {
-webkit-transform: translate(-160px, -160px);
}
}
@keyframes move-astronaut {
100% {
-webkit-transform: translate(-160px, -160px);
transform: translate(-160px, -160px);
}
}
@-moz-keyframes rotate-astronaut {
100% {
-moz-transform: rotate(-720deg);
}
}
@-webkit-keyframes rotate-astronaut {
100% {
-webkit-transform: rotate(-720deg);
}
}
@keyframes rotate-astronaut {
100% {
-webkit-transform: rotate(-720deg);
transform: rotate(-720deg);
}
}
@-moz-keyframes glow-star {
40% {
-moz-opacity: 0.3;
}
90%,
100% {
-moz-opacity: 1;
-moz-transform: scale(1.2);
}
}
@-webkit-keyframes glow-star {
40% {
-webkit-opacity: 0.3;
}
90%,
100% {
-webkit-opacity: 1;
-webkit-transform: scale(1.2);
}
}
@keyframes glow-star {
40% {
-webkit-opacity: 0.3;
opacity: 0.3;
}
90%,
100% {
-webkit-opacity: 1;
opacity: 1;
-webkit-transform: scale(1.2);
transform: scale(1.2);
border-radius: 999999px;
}
}
.spin-earth-on-hover {
transition: ease 200s !important;
transform: rotate(-3600deg) !important;
}
html,
body {
margin: 0;
width: 100%;
height: 100%;
font-family: 'Dosis', sans-serif;
font-weight: 300;
-webkit-user-select: none;
/* Safari 3.1+ */
-moz-user-select: none;
/* Firefox 2+ */
-ms-user-select: none;
/* IE 10+ */
user-select: none;
/* Standard syntax */
}
.bg-purple {
background: url(http://salehriaz.com/404Page/img/bg_purple.png);
background-repeat: repeat-x;
background-size: cover;
background-position: left top;
height: 100%;
overflow: hidden;
}
.custom-navbar {
padding-top: 15px;
}
.brand-logo {
margin-left: 25px;
margin-top: 5px;
display: inline-block;
}
.navbar-links {
display: inline-block;
float: right;
margin-right: 15px;
text-transform: uppercase;
}
ul {
list-style-type: none;
margin: 0;
padding: 0;
/* overflow: hidden;*/
display: flex;
align-items: center;
}
li {
float: left;
padding: 0px 15px;
}
li a {
display: block;
color: white;
text-align: center;
text-decoration: none;
letter-spacing: 2px;
font-size: 12px;
-webkit-transition: all 0.3s ease-in;
-moz-transition: all 0.3s ease-in;
-ms-transition: all 0.3s ease-in;
-o-transition: all 0.3s ease-in;
transition: all 0.3s ease-in;
}
li a:hover {
color: #ffcb39;
}
.btn-request {
padding: 10px 25px;
border: 1px solid #FFCB39;
border-radius: 100px;
font-weight: 400;
}
.btn-request:hover {
background-color: #FFCB39;
color: #fff;
transform: scale(1.05);
box-shadow: 0px 20px 20px rgba(0, 0, 0, 0.1);
}
.btn-go-home {
position: relative;
z-index: 200;
margin: 15px auto;
width: 100px;
padding: 10px 15px;
border: 1px solid #FFCB39;
border-radius: 100px;
font-weight: 400;
display: block;
color: white;
text-align: center;
text-decoration: none;
letter-spacing: 2px;
font-size: 11px;
-webkit-transition: all 0.3s ease-in;
-moz-transition: all 0.3s ease-in;
-ms-transition: all 0.3s ease-in;
-o-transition: all 0.3s ease-in;
transition: all 0.3s ease-in;
}
.btn-go-home:hover {
background-color: #FFCB39;
color: #fff;
transform: scale(1.05);
box-shadow: 0px 20px 20px rgba(0, 0, 0, 0.1);
}
.central-body {
/* width: 100%;*/
padding: 17% 5% 10% 5%;
text-align: center;
}
.objects img {
z-index: 90;
pointer-events: none;
}
.object_rocket {
z-index: 95;
position: absolute;
transform: translateX(-50px);
top: 75%;
pointer-events: none;
animation: rocket-movement 200s linear infinite both running;
}
.object_earth {
position: absolute;
top: 20%;
left: 15%;
z-index: 90;
/* animation: spin-earth 100s infinite linear both;*/
}
.object_moon {
position: absolute;
top: 12%;
left: 25%;
/*
transform: rotate(0deg);
transition: transform ease-in 99999999999s;
*/
}
.earth-moon {}
.object_astronaut {
animation: rotate-astronaut 200s infinite linear both alternate;
}
.box_astronaut {
z-index: 110 !important;
position: absolute;
top: 60%;
right: 20%;
will-change: transform;
animation: move-astronaut 50s infinite linear both alternate;
}
.image-404 {
position: relative;
z-index: 100;
pointer-events: none;
}
.stars {
background: url(http://salehriaz.com/404Page/img/overlay_stars.svg);
background-repeat: repeat;
background-size: contain;
background-position: left top;
}
.glowing_stars .star {
position: absolute;
border-radius: 100%;
background-color: #fff;
width: 3px;
height: 3px;
opacity: 0.3;
will-change: opacity;
}
.glowing_stars .star:nth-child(1) {
top: 80%;
left: 25%;
animation: glow-star 2s infinite ease-in-out alternate 1s;
}
.glowing_stars .star:nth-child(2) {
top: 20%;
left: 40%;
animation: glow-star 2s infinite ease-in-out alternate 3s;
}
.glowing_stars .star:nth-child(3) {
top: 25%;
left: 25%;
animation: glow-star 2s infinite ease-in-out alternate 5s;
}
.glowing_stars .star:nth-child(4) {
top: 75%;
left: 80%;
animation: glow-star 2s infinite ease-in-out alternate 7s;
}
.glowing_stars .star:nth-child(5) {
top: 90%;
left: 50%;
animation: glow-star 2s infinite ease-in-out alternate 9s;
}
@media only screen and (max-width: 600px) {
.navbar-links {
display: none;
}
.custom-navbar {
text-align: center;
}
.brand-logo img {
width: 120px;
}
.box_astronaut {
top: 70%;
}
.central-body {
padding-top: 25%;
}
}
.error_text{
font-size: 32px;
color: white;
}
</style>
</head>
<body class="bg-purple">
<div class="stars">
<div class="custom-navbar">
<div class="navbar-links">
<ul>
<li><a href="#" target="_blank">Home</a></li>
<li><a href="#" target="_blank">About</a></li>
<li><a href="#" target="_blank">Features</a></li>
<li><a href="#" class="btn-request" target="_blank">Request A Demo</a></li>
</ul>
</div>
</div>
<div class="central-body">
<img class="image-404" src="http://salehriaz.com/404Page/img/404.svg" width="300px">
<p class="error_text">Page Not Found - Websolutionstuff</p>
<a href="#" class="btn-go-home" target="_blank">GO BACK HOME</a>
</div>
<div class="objects">
<img class="object_rocket" src="http://salehriaz.com/404Page/img/rocket.svg" width="40px">
<div class="earth-moon">
<img class="object_earth" src="http://salehriaz.com/404Page/img/earth.svg" width="100px">
<img class="object_moon" src="http://salehriaz.com/404Page/img/moon.svg" width="80px">
</div>
<div class="box_astronaut">
<img class="object_astronaut" src="http://salehriaz.com/404Page/img/astronaut.svg" width="140px">
</div>
</div>
<div class="glowing_stars">
<div class="star"></div>
<div class="star"></div>
<div class="star"></div>
<div class="star"></div>
<div class="star"></div>
</div>
</div>
</body>
</html>
Step 4: Run Laravel 9 Application
Now, we will run the laravel 9 custom error page using the following command.
php artisan serve
Output:
Original article source at: https://websolutionstuff.com/
1671456180
The login page is the basic requirement for membership-based websites.
User needs to login to the website using its username or email and password to access the website. If the user has not logged out but the SESSION is destroyed then it needs to again login to the website.
By adding remember me on the login form the user needs to checked the remember me checkbox and login to the website. Now if the user has not logged out but the SESSION is destroyed.
When the user again accesses the website then it does not need to login and the SESSION is initiated.
In this tutorial, I show how you create a login page with remember me functionality with PDO and PHP.
I am using users
table in the example –
CREATE TABLE `users` (
`id` int(11) NOT NULL PRIMARY KEY AUTO_INCREMENT,
`username` varchar(80) NOT NULL,
`name` varchar(80) NOT NULL,
`password` varchar(80) NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Create a config.php
for a database connection.
Completed Code
<?php
$server = "localhost";
$username = "root";
$password = "";
$dbname = "tutorial";
// Create connection
try{
$conn = new PDO("mysql:host=$server;dbname=$dbname","$username","$password");
$conn->setAttribute(PDO::ATTR_ERRMODE,PDO::ERRMODE_EXCEPTION);
}catch(PDOException $e){
die('Unable to connect with the database');
}
Create index.php
file.
HTML
Create a <form >
.
Create a text element for username, a password element, a checkbox for remember me, and a submit button.
<!doctype html>
<html>
<head>
<title>Login page with Remember me using PDO and PHP</title>
<link href="style.css" rel="stylesheet" type="text/css">
</head>
<body>
<div class="container">
<form method="post" action="">
<div id="div_login">
<h1>Login</h1>
<div>
<input type="text" class="textbox" name="txt_uname" value="" placeholder="Username" />
</div>
<div>
<input type="password" class="textbox" name="txt_pwd" value="" placeholder="Password"/>
</div>
<div>
<input type="checkbox" name="rememberme" value="1" /> Remember Me
</div>
<div>
<input type="submit" value="Submit" name="but_submit" id="but_submit" />
</div>
</div>
</form>
</div>
</body>
</html>
PHP
Created 2 functions to encrypt and decrypt the userid. I am using OpenSSL for encrypting and decrypting-
'aes-256-cbc'
cipher (You can view other methods here). Get the $iv
.To encrypt pass values in openssl_encrypt
– openssl_encrypt($userid, $cipher, $key, 0, $iv)
.
Append $ciphertext
with $iv
and $key
separated by '::'
and encode in base64 format and return it.
$ciphertext
. Explode the $ciphertext
by '::'
and assign in variables.Pass values in openssl_decrypt()
function and return it.
Login <form > submit and set remember me COOKIE –
If but_submit
is POST then read username and password. If the username and password are not empty then check username and password is exists in the users
table or not.
If exists then read the user id.
Set Remember me COOKIE if 'rememberme'
is POST. Encrypt the user id by calling encryptCookie()
function. Set the $_COOKIE['rememberme']
for 30 days.
Initialize $_SESSION['userid']
with $userid
and redirect to home.php
.
Check remember me COOKIE –
Check if $_SESSION['userid']
is set or not. If set then redirects to home.php
otherwise, check $_COOKIE['rememberme']
is set or not.
If set then decrypt the $_COOKIE['rememberme']
bypassing it in decryptCookie()
function and get the user id. Check if user id exists or not. If exists then set $_SESSION['userid']
and redirect to home.php
.
<?php
include "config.php";
// Encrypt cookie
function encryptCookie( $value ) {
$key = hex2bin(openssl_random_pseudo_bytes(4));
$cipher = "aes-256-cbc";
$ivlen = openssl_cipher_iv_length($cipher);
$iv = openssl_random_pseudo_bytes($ivlen);
$ciphertext = openssl_encrypt($value, $cipher, $key, 0, $iv);
return( base64_encode($ciphertext . '::' . $iv. '::' .$key) );
}
// Decrypt cookie
function decryptCookie( $ciphertext ) {
$cipher = "aes-256-cbc";
list($encrypted_data, $iv,$key) = explode('::', base64_decode($ciphertext));
return openssl_decrypt($encrypted_data, $cipher, $key, 0, $iv);
}
// Check if $_SESSION or $_COOKIE already set
if( isset($_SESSION['userid']) ){
header('Location: home.php');
exit;
}else if( isset($_COOKIE['rememberme'] )){
// Decrypt cookie variable value
$userid = decryptCookie($_COOKIE['rememberme']);
// Fetch records
$stmt = $conn->prepare("SELECT count(*) as cntUser FROM users WHERE id=:id");
$stmt->bindValue(':id', (int)$userid, PDO::PARAM_INT);
$stmt->execute();
$count = $stmt->fetchColumn();
if( $count > 0 ){
$_SESSION['userid'] = $userid;
header('Location: home.php');
exit;
}
}
// On submit
if(isset($_POST['but_submit'])){
$username = $_POST['txt_uname'];
$password = $_POST['txt_pwd'];
if ($username != "" && $password != ""){
// Fetch records
$stmt = $conn->prepare("SELECT count(*) as cntUser,id FROM users WHERE username=:username and password=:password ");
$stmt->bindValue(':username', $username, PDO::PARAM_STR);
$stmt->bindValue(':password', $password, PDO::PARAM_STR);
$stmt->execute();
$record = $stmt->fetch();
$count = $record['cntUser'];
if($count > 0){
$userid = $record['id'];
if( isset($_POST['rememberme']) ){
// Set cookie variables
$days = 30;
$value = encryptCookie($userid);
setcookie ("rememberme",$value,time()+ ($days * 24 * 60 * 60 * 1000));
}
$_SESSION['userid'] = $userid;
header('Location: home.php');
exit;
}else{
echo "Invalid username and password";
}
}
}
?>
Create home.php
file.
Check if $_SESSION['userid']
is set or not. If not set then redirect to index.php
file.
On the page create a <form >
and a submit button for logout.
On logout, button click destroy the SESSION and remove the 'rememberme'
COOKIE by setting it time in the past.
Redirect to index.php
page.
Completed Code
<?php
include "config.php";
?>
<!doctype html>
<html>
<head>
<title>Login page with Remember me using PDO and PHP</title>
</head>
<body>
<?php
// Check user login or not
if(!isset($_SESSION['userid'])){
header('Location: index.php');
}
// logout
if(isset($_POST['but_logout'])){
session_destroy();
// Remove cookie variables
$days = 30;
setcookie ("rememberme","", time() - ($days * 24 * 60 * 60 * 1000) );
header('Location: index.php');
}
?>
<h1>Homepage</h1>
<form method='post' action="">
<input type="submit" value="Logout" name="but_logout">
</form>
</body>
</html>
Comment the setcookie()
function on the logout button click in home.php
file to check if remember me is working or not and you can also reduce the expiration time of COOKIE.
If it is working then SESSION is created when running index.php
file and the page is redirected to home.php
file.
You can view the MySQLi version of this tutorial here.
You can also view the Registration form creation with MySQLi and PHP tutorial here.
If you found this tutorial helpful then don't forget to share.
Original article source at: https://makitweb.com/
1670318100
Page speed is an important ranking attribute for search engines, making performance optimization a prerequisite for successful sites. Here’s how Google PageSpeed Insights can help identify and rectify performance issues.
If you’re a business owner, you’re interested in getting better search rankings for your website. If you’re a developer, you’ll need to cater to the client’s needs and create a site capable of ranking well. Google considers hundreds of characteristics when it determines the order of websites in its Search Engine Ranking Page (SERP).
Page speed was officially cited as an important ranking attribute in mid-2018. In this article, we will explain performance scores that business owners should pay attention to: PageSpeed Insights. We will be going deeper into some technical details that will help software developers make improvements in complicated cases, like those related to single-page applications.
When Google introduced PageSpeed Tools in 2010, most website owners became acquainted with it. Those who haven’t should open PageSpeed Insights to check their sites.
The service provides details on how a website performs both on desktop and mobile browsers. It’s easy to miss the fact that you can switch between them using the Mobile and Desktop tabs at the top of the analysis:
Because mobile devices are compact and aim to preserve battery life, their web browsers tend to exhibit lower performance than devices running desktop operating systems, so expect the desktop score to be higher.
Big tech companies won’t score in the red in any area, but smaller sites running on tighter budgets may. Business owners can also run PageSpeed Insights on their competitors’ sites and compare the results with their own to see if they need to invest in improving performance.
PageSpeed uses metrics from Core Web Vitals to provide a pass/fail assessment.
This tool has three scores:
PageSpeed prominently displays the Performance score in a colored circle in the Diagnose Performance Issues section. It’s calculated using PageSpeed’s built-in virtual machines with characteristics matching the average mobile or desktop device. Please bear in mind that this value is for page loading and for PageSpeed’s virtual machine, and is not considered by the Google Search engine.
This figure is useful when developers implement changes to a website, as it allows them to check the effect of the changes on performance. However, Google’s search engine considers only the detailed scores.
Detailed scores for a specific page—and for those that PageSpeed considers similar to the page analyzed—are calculated from statistics that Chrome browsers collect on real computers and send to Google. This means performance on Firefox, Safari, and other non-Chromium browsers is not taken into account.
The summary for all pages of the website is obtained the same way as the single-page score. To access it, select the Origin tab instead of the This URL tab. The URL listed under the tabs bar will be different, as Origin will display the main page of the site (domain only).
Google constantly updates the list of metrics considered by PageSpeed, so the best source of what is important is the Experience / Core Web Vitals section in Google Search Console, assuming you already added your website there.
To pass the Core Web Vitals Assessment, all the scores must be green:
For the values to be green, the page needs to score at least 75% in the test, and many users need to experience equal or better values. The threshold differs for each score and it’s significantly higher for FID.
To better understand the values, click the score title:
This links to a blog post explaining the thresholds for the given category in more detail.
The data is accumulated for 28 days, and there are two other major differences from what real users might be experiencing:
If many of a site’s users live in regions with slow internet access and use outdated or underperforming devices, the difference can be surprising. This isn’t one of PageSpeed Insights’ improvement recommendations. At first glance, it’s not obvious how to deal with this issue, but we will try to explain later.
The main part of the rating comes from how most users open the page. Despite the fact that not all users visit a website for a long period of time, and most visit a website occasionally, all users are considered in the ratings, so improving page load speeds, which impact everyone, is a good place to start.
We can find recommendations in the Opportunities section below the assessment results.
We can expand each item and get detailed recommendations for improvements. There is a lot of information, but here are the most basic and important tips:
Now let’s have a look at more complicated factors, where an experienced programmer can help.
As mentioned, Google Search Console considers average scores obtained from Chromium-based browsers for the last 28 days and also includes values for the entire lifetime of the page.
The inability to see what happens during the page’s lifetime is a problem. PageSpeed’s virtual machine can’t account for how the page performs once it’s loaded and users are interacting with it, which means site developers won’t have access to recommendations for improvements.
The solution is to include the Google Chrome Web Vitals library in the developer version of a site project to see what’s happening while a user interacts with the page.
Various options on how to include this library are in its README.md file on GitHub. The simplest way is to add the following script. It is tweaked to display values over the page lifetime in the main template’s <head>
:
<script>
(function() {
var script = document.createElement('script');
script.src = 'https://unpkg.com/web-vitals/dist/web-vitals.iife.js';
script.onload = function() {
// When loading `web-vitals` using a classic script, all the public
// methods can be found on the `webVitals` global namespace.
webVitals.getCLS(console.log, true); // CLS supported only in Chromium.
webVitals.getLCP(console.log, true); // LCP supported only in Chromium.
webVitals.getFID(console.log, true);
webVitals.getFCP(console.log, true);
webVitals.getTTFB(console.log, true);
}
document.head.appendChild(script);
}())
</script>
Note that Cumulative Layout Shift (CLS) and Largest Contentful Paint (LCP) calculation is available only for Chromium-based browsers, including Chrome, Opera, Brave (disable Brave Shields to make the library work), and most other modern browsers, except Firefox, which is based on a Mozilla engine, and Apple’s Safari browser.
After adding the script and reloading the page, open the browser’s developer tools and switch to the Console tab.
Values Provided By the Chrome Web Vitals Library in Chrome’s Console Tab
To see how those values are calculated for the mobile version, switch to the mobile device using the Device toolbar. To access it, click the Toggle Device Toolbar button in your browser’s Developer tools.
This will help pinpoint problems. Expanding the row in the console will show details on what triggered the score change.
Most of the time, the automatic advice for other scores is sufficient to get an idea on how to improve them. However, CLS changes after the page is loaded with user interactions, and there simply may not be any recommendations, especially for single-page applications. You may see a perfect 100 score in the Diagnose Performance Issues section, even as your page fails to pass the assessment for factors considered by the search engine.
For those of us struggling with CLS, this will be helpful. Expand the log record, then entries, specific entry, sources, specific source, and compare currentRect
with previousRect
:
Now that we can see what changed, we can identify some ways to fix it.
Of all the scores, CLS is the hardest to grasp, but it’s crucial for user experience. Layout shift occurs when an element is added to the document object model (DOM) or the size or position of an existing element is changed. It causes elements below that element to shift, and the user feels like they can’t control what’s going on, causing them to leave the website.
It’s relatively easy to handle this on a simple HTML page. Set width and height attributes for images so the text below them is not shifted while they load. This will likely solve the problem.
If your page is dynamic and users work with it like with an application, consider the following steps to address CLS issues:
More detailed recommendations are available at the Google Developers Optimize CLS page.
To illustrate how to use the 500-millisecond threshold, we will use an example involving an image upload.
Normally when a user uploads a file, the script adds an <img>
element to DOM, and then the client browser downloads the image from the server. Fetching an image from a server can take more than 500 milliseconds and may cause a layout shift.
But there is a way to get the image faster as it’s already on the client computer, and thus create the <img>
element before the 500-millisecond deadline is up.
Here is a universal example on pure ECMAScript without libraries that will work on most modern browsers:
<!DOCTYPE html>
<html>
<head></head>
<body>
<input type="file" id="input">
<img id="image">
<script>
document.getElementById('input').addEventListener('change', function() {
var imageInput = document.getElementById('input');
if (imageInput.files && imageInput.files[0]) {
var fileReader = new FileReader();
fileReader.onload = function (event) {
var imageElement = document.getElementById('image');
imageElement.setAttribute('src', event.target.result);
}
fileReader.readAsDataURL(imageInput.files[0]);
}
});
</script>
</body>
</html>
As we saw earlier, fixing these kinds of issues might require mental agility. With mobile devices, especially cheap ones, and with slow mobile internet, the ’90s art of performance optimization becomes useful and old-school web programming approaches can inspire our technique. Modern browser debug tools will help with that.
After finding and eliminating issues, Google’s search engine may take some time to register the changes. To update the results a bit faster, let Google Search Console know that you’ve fixed the problems.
Select the page you’re working on using the Search property box in the top left corner. Then navigate to Core Web Vitals in the left hamburger menu:
Click the Open Report button on the top right of the mobile or desktop report. (If you experienced problems with both, remember to repeat the same actions for the second report later.)
Next, go to the Details section under the chart and click on the row with the failed validation warning.
Then click the See Details button for this issue.
And finally click Start New Validation.
Do not expect immediate results. Validation may take up to 28 days.
SEO optimization is a continuous process, and the same is true of performance optimization. As your audience grows, servers receive more requests and responses get slower. Increasing demand usually means new features are added to your site, and they may affect performance.
When it comes to the cost/benefit aspect of performance optimization, it is necessary to strike the right balance. Developers don’t need to achieve the best values on all sites, all the time. Concentrate on what causes the most significant performance problems; you’ll get results faster and with less effort. Major corporations can afford to invest a lot of resources and ace all the scores, but this is not the case for small and midsize businesses. In reality, a small business most likely only needs to match or surpass the performance of their competitors, not industry heavyweights like Amazon.
Business owners should understand why site optimization is critical, what aspects of the work are most important, and which skills to seek out in the people they hire to do it. Developers, for their part, should keep performance in mind at all times, helping their clients create sites that not only feel fast for end users, but also score well in PageSpeed Insights.
Original article source at: https://www.toptal.com/
1648904358
Destacar ou anotar um texto em um arquivo PDF é uma ótima estratégia para ler e reter informações importantes. Esta técnica pode ajudar a trazer informações importantes imediatamente à atenção do leitor. Não há dúvida de que um texto destacado em amarelo provavelmente chamaria sua atenção primeiro.
A redação de um arquivo PDF permite ocultar informações confidenciais, mantendo a formatação do documento. Isso preserva informações privadas e confidenciais antes de compartilhá-las. Além disso, aumenta ainda mais a integridade e a credibilidade da organização no manuseio de informações confidenciais.
Neste tutorial, você aprenderá como redigir, enquadrar ou destacar um texto em arquivos PDF usando Python.
Neste guia, usaremos a biblioteca PyMuPDF , que é uma solução de intérprete de PDF, XPS e EBook altamente versátil e personalizável que pode ser usada em uma ampla variedade de aplicativos como renderizador, visualizador ou kit de ferramentas de PDF.
O objetivo deste tutorial é desenvolver um utilitário leve baseado em linha de comando para redigir, enquadrar ou destacar um texto incluído em um arquivo PDF ou em uma pasta contendo uma coleção de arquivos PDF. Além disso, ele permitirá que você remova os destaques de um arquivo PDF ou de uma coleção de arquivos PDF.
Vamos instalar os requisitos:
$ pip install PyMuPDF==1.18.9
Abra um novo arquivo Python e vamos começar:
# Import Libraries
from typing import Tuple
from io import BytesIO
import os
import argparse
import re
import fitz
def extract_info(input_file: str):
"""
Extracts file info
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
output = {
"File": input_file, "Encrypted": ("True" if pdfDoc.isEncrypted else "False")
}
# If PDF is encrypted the file metadata cannot be extracted
if not pdfDoc.isEncrypted:
for key, value in pdfDoc.metadata.items():
output[key] = value
# To Display File Info
print("## File Information ##################################################")
print("\n".join("{}:{}".format(i, j) for i, j in output.items()))
print("######################################################################")
return True, output
extract_info()
A função coleta os metadados de um arquivo PDF, os atributos que podem ser extraídos são format
, title
, author
, subject
, keywords
, creator
, producer
, creation date
, modification date
, trapped
, encryption
, e o número de páginas. Vale a pena notar que esses atributos não podem ser extraídos quando você direciona um arquivo PDF criptografado.
def search_for_text(lines, search_str):
"""
Search for the search string within the document lines
"""
for line in lines:
# Find all matches within one line
results = re.findall(search_str, line, re.IGNORECASE)
# In case multiple matches within one line
for result in results:
yield result
Esta função procura uma string dentro das linhas do documento usando a re.findall()
função, re.IGNORECASE
é ignorar o caso durante a pesquisa.
def redact_matching_data(page, matched_values):
"""
Redacts matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# Redact matching values
[page.addRedactAnnot(area, text=" ", fill=(0, 0, 0))
for area in matching_val_area]
# Apply the redaction
page.apply_redactions()
return matches_found
Esta função executa o seguinte:
Você pode alterar a cor da redação usando o fill
argumento no page.addRedactAnnot()
método, definindo-o para (0, 0, 0)
resultará em uma redação preta. Esses são valores RGB que variam de 0 a 1. Por exemplo, (1, 0, 0)
resultará em uma redação vermelha e assim por diante.
def frame_matching_data(page, matched_values):
"""
frames matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
for area in matching_val_area:
if isinstance(area, fitz.fitz.Rect):
# Draw a rectangle around matched values
annot = page.addRectAnnot(area)
# , fill = fitz.utils.getColor('black')
annot.setColors(stroke=fitz.utils.getColor('red'))
# If you want to remove matched data
#page.addFreetextAnnot(area, ' ')
annot.update()
return matches_found
A frame_matching_data()
função desenha um retângulo vermelho (quadro) em torno dos valores correspondentes.
Em seguida, vamos definir uma função para destacar o texto:
def highlight_matching_data(page, matched_values, type):
"""
Highlight matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# print("matching_val_area",matching_val_area)
highlight = None
if type == 'Highlight':
highlight = page.addHighlightAnnot(matching_val_area)
elif type == 'Squiggly':
highlight = page.addSquigglyAnnot(matching_val_area)
elif type == 'Underline':
highlight = page.addUnderlineAnnot(matching_val_area)
elif type == 'Strikeout':
highlight = page.addStrikeoutAnnot(matching_val_area)
else:
highlight = page.addHighlightAnnot(matching_val_area)
# To change the highlight colar
# highlight.setColors({"stroke":(0,0,1),"fill":(0.75,0.8,0.95) })
# highlight.setColors(stroke = fitz.utils.getColor('white'), fill = fitz.utils.getColor('red'))
# highlight.setColors(colors= fitz.utils.getColor('red'))
highlight.update()
return matches_found
A função acima aplica o modo de realce adequado nos valores correspondentes, dependendo do tipo de realce inserido como parâmetro.
Você sempre pode alterar a cor do destaque usando o highlight.setColors()
método conforme mostrado nos comentários.
def process_data(input_file: str, output_file: str, search_str: str, pages: Tuple = None, action: str = 'Highlight'):
"""
Process the pages of the PDF File
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
total_matches = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
# Get Matching Data
# Split page by lines
page_lines = page.getText("text").split('\n')
matched_values = search_for_text(page_lines, search_str)
if matched_values:
if action == 'Redact':
matches_found = redact_matching_data(page, matched_values)
elif action == 'Frame':
matches_found = frame_matching_data(page, matched_values)
elif action in ('Highlight', 'Squiggly', 'Underline', 'Strikeout'):
matches_found = highlight_matching_data(
page, matched_values, action)
else:
matches_found = highlight_matching_data(
page, matched_values, 'Highlight')
total_matches += matches_found
print(f"{total_matches} Match(es) Found of Search String {search_str} In Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
O objetivo principal da process_data()
função é o seguinte:
"Redact"
, "Frame"
, "Highlight"
, etc.)Aceita vários parâmetros:
input_file
: o caminho do arquivo PDF a ser processado.output_file
: o caminho do arquivo PDF a ser gerado após o processamento.search_str
: A string a ser pesquisada.pages
: As páginas a serem consideradas durante o processamento do arquivo PDF.action
: A ação a ser executada no arquivo PDF.Em seguida, vamos escrever uma função para remover o destaque caso queiramos:
def remove_highlght(input_file: str, output_file: str, pages: Tuple = None):
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
# Initialize a counter for annotations
annot_found = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
annot = page.firstAnnot
while annot:
annot_found += 1
page.deleteAnnot(annot)
annot = annot.next
if annot_found >= 0:
print(f"Annotation(s) Found In The Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
O objetivo da remove_highlight()
função é remover os destaques (não as redações) de um arquivo PDF. Ele executa o seguinte:
Agora vamos fazer uma função wrapper que usa funções anteriores para chamar a função apropriada dependendo da ação:
def process_file(**kwargs):
"""
To process one single file
Redact, Frame, Highlight... one PDF File
Remove Highlights from a single PDF File
"""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
if output_file is None:
output_file = input_file
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
# Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
if action == "Remove":
# Remove the Highlights except Redactions
remove_highlght(input_file=input_file,
output_file=output_file, pages=pages)
else:
process_data(input_file=input_file, output_file=output_file,
search_str=search_str, pages=pages, action=action)
A ação pode ser "Redact"
, "Frame"
, "Highlight"
, "Squiggly"
, "Underline"
, "Strikeout"
, e "Remove"
.
Vamos definir a mesma função, mas com pastas que contêm vários arquivos PDF:
def process_folder(**kwargs):
"""
Redact, Frame, Highlight... all PDF Files within a specified path
Remove Highlights from all PDF Files within a specified path
"""
input_folder = kwargs.get('input_folder')
search_str = kwargs.get('search_str')
# Run in recursive mode
recursive = kwargs.get('recursive')
#Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
pages = kwargs.get('pages')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
process_file(input_file=inp_pdf_file, output_file=None,
search_str=search_str, action=action, pages=pages)
if not recursive:
break
Esta função destina-se a processar os arquivos PDF incluídos em uma pasta específica.
Ele percorre os arquivos da pasta especificada recursivamente ou não, dependendo do valor do parâmetro recursivo e processa esses arquivos um por um.
Ele aceita os seguintes parâmetros:
input_folder
: o caminho da pasta que contém os arquivos PDF a serem processados.search_str
: O texto a ser pesquisado para manipular.recursive
: se esse processo deve ser executado recursivamente, fazendo um loop entre as subpastas ou não.action
: a ação a ser executada entre a lista mencionada anteriormente.pages
: as páginas a serem consideradas.Antes de fazermos nosso código principal, vamos criar uma função para analisar argumentos de linha de comando:
def is_valid_path(path):
"""
Validates the path inputted and checks whether it is a file path or a folder path
"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input_path', dest='input_path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', dest='action', choices=['Redact', 'Frame', 'Highlight', 'Squiggly', 'Underline', 'Strikeout', 'Remove'], type=str,
default='Highlight', help="Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider e.g.: [2,4]")
action = parser.parse_known_args()[0].action
if action != 'Remove':
parser.add_argument('-s', '--search_str', dest='search_str' # lambda x: os.path.has_valid_dir_syntax(x)
, type=str, required=True, help="Enter a valid search string")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file', type=str # lambda x: os.path.has_valid_dir_syntax(x)
, help="Enter a valid output file")
if os.path.isdir(path):
parser.add_argument('-r', '--recursive', dest='recursive', default=False, type=lambda x: (
str(x).lower() in ['true', '1', 'yes']), help="Process Recursively or Non-Recursively")
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
Por fim, vamos escrever o código principal:
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Extracting File Info
extract_info(input_file=args['input_path'])
# Process a file
process_file(
input_file=args['input_path'], output_file=args['output_file'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
pages=args['pages'], action=args['action']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
process_folder(
input_folder=args['input_path'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
action=args['action'], pages=args['pages'], recursive=args['recursive']
)
Agora vamos testar nosso programa:
$ python pdf_highlighter.py --help
Saída:
usage: pdf_highlighter.py [-h] -i INPUT_PATH [-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}] [-p PAGES]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}, --action {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}
Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: [2,4]
Antes de explorar nossos cenários de teste, deixe-me esclarecer alguns pontos:
PermissionError
, feche o arquivo PDF de entrada antes de executar este utilitário."organi[sz]e"
corresponder a "organizar" e "organizar".Como exemplo de demonstração, vamos destacar a palavra "BERT" no artigo BERT :
$ python pdf_highlighter.py -i bert-paper.pdf -a Highlight -s "BERT"
Saída:
## Command Arguments #################################################
input_path:bert-paper.pdf
action:Highlight
pages:None
search_str:BERT
output_file:None
######################################################################
## File Information ##################################################
File:bert-paper.pdf
Encrypted:False
format:PDF 1.5
title:
author:
subject:
keywords:
creator:LaTeX with hyperref package
producer:pdfTeX-1.40.17
creationDate:D:20190528000751Z
modDate:D:20190528000751Z
trapped:
encryption:None
######################################################################
121 Match(es) Found of Search String BERT In Input File: bert-paper.pdf
Como você pode ver, 121 correspondências foram destacadas, você pode usar outras opções de destaque, como sublinhado, moldura e outras. Aqui está o PDF resultante:
Vamos removê-lo agora:
$ python pdf_highlighter.py -i bert-paper.pdf -a Remove
O PDF resultante removerá o realce.
Convido você a brincar com outras ações, pois acho bastante interessante fazer isso automaticamente com Python.
Se você deseja destacar o texto de vários arquivos PDF, pode especificar a pasta para o -i
parâmetro ou mesclar os arquivos PDF e executar o código para obter um único PDF que tenha todo o texto que deseja destacar.
1648900689
Das Hervorheben oder Kommentieren eines Textes in einer PDF-Datei ist eine großartige Strategie, um wichtige Informationen zu lesen und zu behalten. Diese Technik kann dabei helfen, dem Leser wichtige Informationen sofort zur Kenntnis zu bringen. Es besteht kein Zweifel, dass Ihnen ein gelb hervorgehobener Text wahrscheinlich zuerst ins Auge fallen würde.
Durch das Schwärzen einer PDF-Datei können Sie vertrauliche Informationen ausblenden und gleichzeitig die Formatierung Ihres Dokuments beibehalten. Dies bewahrt private und vertrauliche Informationen vor der Weitergabe. Darüber hinaus wird die Integrität und Glaubwürdigkeit der Organisation im Umgang mit sensiblen Informationen weiter gestärkt.
In diesem Tutorial erfahren Sie, wie Sie mit Python einen Text in PDF-Dateien redigieren, einrahmen oder hervorheben.
In diesem Handbuch verwenden wir die PyMuPDF-Bibliothek , eine äußerst vielseitige, anpassbare PDF-, XPS- und EBook-Interpreterlösung, die in einer Vielzahl von Anwendungen als PDF-Renderer, Viewer oder Toolkit verwendet werden kann.
Das Ziel dieses Lernprogramms ist die Entwicklung eines einfachen befehlszeilenbasierten Dienstprogramms zum Schwärzen, Rahmen oder Hervorheben eines Textes, der in einer PDF-Datei oder in einem Ordner enthalten ist, der eine Sammlung von PDF-Dateien enthält. Darüber hinaus können Sie die Markierungen aus einer PDF-Datei oder einer Sammlung von PDF-Dateien entfernen.
Lassen Sie uns die Anforderungen installieren:
$ pip install PyMuPDF==1.18.9
Öffnen Sie eine neue Python-Datei und legen Sie los:
# Import Libraries
from typing import Tuple
from io import BytesIO
import os
import argparse
import re
import fitz
def extract_info(input_file: str):
"""
Extracts file info
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
output = {
"File": input_file, "Encrypted": ("True" if pdfDoc.isEncrypted else "False")
}
# If PDF is encrypted the file metadata cannot be extracted
if not pdfDoc.isEncrypted:
for key, value in pdfDoc.metadata.items():
output[key] = value
# To Display File Info
print("## File Information ##################################################")
print("\n".join("{}:{}".format(i, j) for i, j in output.items()))
print("######################################################################")
return True, output
extract_info()
Funktion sammelt die Metadaten einer PDF-Datei, die Attribute, die extrahiert werden können, sind format
, title
, author
, subject
, keywords
, creator
, producer
, creation date
, modification date
, trapped
, encryption
, und die Anzahl der Seiten. Beachten Sie, dass diese Attribute nicht extrahiert werden können, wenn Sie auf eine verschlüsselte PDF-Datei abzielen.
def search_for_text(lines, search_str):
"""
Search for the search string within the document lines
"""
for line in lines:
# Find all matches within one line
results = re.findall(search_str, line, re.IGNORECASE)
# In case multiple matches within one line
for result in results:
yield result
Diese Funktion sucht mit der re.findall()
Funktion nach einer Zeichenkette innerhalb der Dokumentzeilen, re.IGNORECASE
soll die Groß-/Kleinschreibung bei der Suche ignorieren.
def redact_matching_data(page, matched_values):
"""
Redacts matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# Redact matching values
[page.addRedactAnnot(area, text=" ", fill=(0, 0, 0))
for area in matching_val_area]
# Apply the redaction
page.apply_redactions()
return matches_found
Diese Funktion führt Folgendes aus:
Sie können die Farbe der Schwärzung mit dem fill
Argument der page.addRedactAnnot()
Methode ändern, wenn Sie es auf setzen, (0, 0, 0)
wird eine schwarze Schwärzung resultieren. Dies sind RGB-Werte im Bereich von 0 bis 1. Beispielsweise (1, 0, 0)
führt dies zu einer roten Schwärzung und so weiter.
def frame_matching_data(page, matched_values):
"""
frames matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
for area in matching_val_area:
if isinstance(area, fitz.fitz.Rect):
# Draw a rectangle around matched values
annot = page.addRectAnnot(area)
# , fill = fitz.utils.getColor('black')
annot.setColors(stroke=fitz.utils.getColor('red'))
# If you want to remove matched data
#page.addFreetextAnnot(area, ' ')
annot.update()
return matches_found
Die frame_matching_data()
Funktion zeichnet ein rotes Rechteck (Rahmen) um die übereinstimmenden Werte.
Als Nächstes definieren wir eine Funktion zum Hervorheben von Text:
def highlight_matching_data(page, matched_values, type):
"""
Highlight matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# print("matching_val_area",matching_val_area)
highlight = None
if type == 'Highlight':
highlight = page.addHighlightAnnot(matching_val_area)
elif type == 'Squiggly':
highlight = page.addSquigglyAnnot(matching_val_area)
elif type == 'Underline':
highlight = page.addUnderlineAnnot(matching_val_area)
elif type == 'Strikeout':
highlight = page.addStrikeoutAnnot(matching_val_area)
else:
highlight = page.addHighlightAnnot(matching_val_area)
# To change the highlight colar
# highlight.setColors({"stroke":(0,0,1),"fill":(0.75,0.8,0.95) })
# highlight.setColors(stroke = fitz.utils.getColor('white'), fill = fitz.utils.getColor('red'))
# highlight.setColors(colors= fitz.utils.getColor('red'))
highlight.update()
return matches_found
Die obige Funktion wendet den geeigneten Hervorhebungsmodus auf die übereinstimmenden Werte an, abhängig von der Art der als Parameter eingegebenen Hervorhebung.
Sie können die Farbe der Hervorhebung jederzeit mit der highlight.setColors()
in den Kommentaren gezeigten Methode ändern.
def process_data(input_file: str, output_file: str, search_str: str, pages: Tuple = None, action: str = 'Highlight'):
"""
Process the pages of the PDF File
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
total_matches = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
# Get Matching Data
# Split page by lines
page_lines = page.getText("text").split('\n')
matched_values = search_for_text(page_lines, search_str)
if matched_values:
if action == 'Redact':
matches_found = redact_matching_data(page, matched_values)
elif action == 'Frame':
matches_found = frame_matching_data(page, matched_values)
elif action in ('Highlight', 'Squiggly', 'Underline', 'Strikeout'):
matches_found = highlight_matching_data(
page, matched_values, action)
else:
matches_found = highlight_matching_data(
page, matched_values, 'Highlight')
total_matches += matches_found
print(f"{total_matches} Match(es) Found of Search String {search_str} In Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
Der Hauptzweck der process_data()
Funktion ist folgender:
"Redact"
, "Frame"
, "Highlight"
, usw.)Es akzeptiert mehrere Parameter:
input_file
: Der Pfad der zu verarbeitenden PDF-Datei.output_file
: Der Pfad der PDF-Datei, die nach der Verarbeitung generiert werden soll.search_str
: Die Zeichenfolge, nach der gesucht werden soll.pages
: Die bei der Verarbeitung der PDF-Datei zu berücksichtigenden Seiten.action
: Die Aktion, die für die PDF-Datei ausgeführt werden soll.Als Nächstes schreiben wir eine Funktion, um die Hervorhebung zu entfernen, falls wir dies möchten:
def remove_highlght(input_file: str, output_file: str, pages: Tuple = None):
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
# Initialize a counter for annotations
annot_found = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
annot = page.firstAnnot
while annot:
annot_found += 1
page.deleteAnnot(annot)
annot = annot.next
if annot_found >= 0:
print(f"Annotation(s) Found In The Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
Der Zweck der remove_highlight()
Funktion besteht darin, die Hervorhebungen (nicht die Schwärzungen) aus einer PDF-Datei zu entfernen. Es führt Folgendes aus:
Lassen Sie uns nun eine Wrapper-Funktion erstellen, die vorherige Funktionen verwendet, um die entsprechende Funktion abhängig von der Aktion aufzurufen:
def process_file(**kwargs):
"""
To process one single file
Redact, Frame, Highlight... one PDF File
Remove Highlights from a single PDF File
"""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
if output_file is None:
output_file = input_file
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
# Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
if action == "Remove":
# Remove the Highlights except Redactions
remove_highlght(input_file=input_file,
output_file=output_file, pages=pages)
else:
process_data(input_file=input_file, output_file=output_file,
search_str=search_str, pages=pages, action=action)
Die Aktion kann "Redact"
, "Frame"
, "Highlight"
, "Squiggly"
, "Underline"
, "Strikeout"
, und sein "Remove"
.
Lassen Sie uns dieselbe Funktion definieren, aber mit Ordnern, die mehrere PDF-Dateien enthalten:
def process_folder(**kwargs):
"""
Redact, Frame, Highlight... all PDF Files within a specified path
Remove Highlights from all PDF Files within a specified path
"""
input_folder = kwargs.get('input_folder')
search_str = kwargs.get('search_str')
# Run in recursive mode
recursive = kwargs.get('recursive')
#Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
pages = kwargs.get('pages')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
process_file(input_file=inp_pdf_file, output_file=None,
search_str=search_str, action=action, pages=pages)
if not recursive:
break
Diese Funktion dient dazu, die in einem bestimmten Ordner enthaltenen PDF-Dateien zu verarbeiten.
Es durchläuft die Dateien des angegebenen Ordners entweder rekursiv oder nicht, abhängig vom Wert des Parameters recursive, und verarbeitet diese Dateien nacheinander.
Es akzeptiert die folgenden Parameter:
input_folder
: Der Pfad des Ordners, der die zu verarbeitenden PDF-Dateien enthält.search_str
: Der Text, nach dem gesucht werden soll, um ihn zu manipulieren.recursive
: ob dieser Prozess rekursiv ausgeführt werden soll, indem die Unterordner durchlaufen werden oder nicht.action
: die auszuführende Aktion aus der zuvor erwähnten Liste.pages
: die zu berücksichtigenden Seiten.Bevor wir unseren Hauptcode erstellen, erstellen wir eine Funktion zum Analysieren von Befehlszeilenargumenten:
def is_valid_path(path):
"""
Validates the path inputted and checks whether it is a file path or a folder path
"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input_path', dest='input_path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', dest='action', choices=['Redact', 'Frame', 'Highlight', 'Squiggly', 'Underline', 'Strikeout', 'Remove'], type=str,
default='Highlight', help="Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider e.g.: [2,4]")
action = parser.parse_known_args()[0].action
if action != 'Remove':
parser.add_argument('-s', '--search_str', dest='search_str' # lambda x: os.path.has_valid_dir_syntax(x)
, type=str, required=True, help="Enter a valid search string")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file', type=str # lambda x: os.path.has_valid_dir_syntax(x)
, help="Enter a valid output file")
if os.path.isdir(path):
parser.add_argument('-r', '--recursive', dest='recursive', default=False, type=lambda x: (
str(x).lower() in ['true', '1', 'yes']), help="Process Recursively or Non-Recursively")
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
Lassen Sie uns zum Schluss den Hauptcode schreiben:
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Extracting File Info
extract_info(input_file=args['input_path'])
# Process a file
process_file(
input_file=args['input_path'], output_file=args['output_file'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
pages=args['pages'], action=args['action']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
process_folder(
input_folder=args['input_path'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
action=args['action'], pages=args['pages'], recursive=args['recursive']
)
Testen wir nun unser Programm:
$ python pdf_highlighter.py --help
Ausgabe:
usage: pdf_highlighter.py [-h] -i INPUT_PATH [-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}] [-p PAGES]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}, --action {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}
Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: [2,4]
Bevor ich unsere Testszenarien erkunde, möchte ich einige Punkte klarstellen:
PermissionError
wird, schließen Sie bitte die PDF-Eingabedatei, bevor Sie dieses Dienstprogramm ausführen."organi[sz]e"
sie sowohl mit „organisieren“ als auch mit „organisieren“ übereinstimmt.Lassen Sie uns als Demonstrationsbeispiel das Wort "BERT" im BERT-Papier hervorheben :
$ python pdf_highlighter.py -i bert-paper.pdf -a Highlight -s "BERT"
Ausgabe:
## Command Arguments #################################################
input_path:bert-paper.pdf
action:Highlight
pages:None
search_str:BERT
output_file:None
######################################################################
## File Information ##################################################
File:bert-paper.pdf
Encrypted:False
format:PDF 1.5
title:
author:
subject:
keywords:
creator:LaTeX with hyperref package
producer:pdfTeX-1.40.17
creationDate:D:20190528000751Z
modDate:D:20190528000751Z
trapped:
encryption:None
######################################################################
121 Match(es) Found of Search String BERT In Input File: bert-paper.pdf
Wie Sie sehen können, wurden 121 Übereinstimmungen hervorgehoben. Sie können andere Hervorhebungsoptionen verwenden, z. B. Unterstreichen, Rahmen und andere. Hier ist das resultierende PDF:
Entfernen wir es jetzt:
$ python pdf_highlighter.py -i bert-paper.pdf -a Remove
Das resultierende PDF entfernt die Hervorhebung.
Ich lade Sie ein, mit anderen Aktionen herumzuspielen, da ich es ziemlich interessant finde, dies mit Python automatisch zu tun.
Wenn Sie Text aus mehreren PDF-Dateien hervorheben möchten, können Sie entweder den Ordner für den -i
Parameter angeben oder die PDF-Dateien zusammenführen und den Code ausführen, um ein einziges PDF zu erstellen, das den gesamten Text enthält, den Sie hervorheben möchten.
1648893387
Выделение или комментирование текста в файле PDF — отличная стратегия для чтения и сохранения ключевой информации. Этот метод может помочь в немедленном доведении важной информации до сведения читателя. Нет сомнений, что текст, выделенный желтым цветом, вероятно, первым привлечет ваше внимание.
Редактирование PDF-файла позволяет скрыть конфиденциальную информацию, сохранив при этом формат документа. Это сохраняет личную и конфиденциальную информацию перед передачей. Более того, это еще больше повышает целостность и надежность организации при работе с конфиденциальной информацией.
В этом руководстве вы узнаете, как редактировать, обрамлять или выделять текст в файлах PDF с помощью Python.
В этом руководстве мы будем использовать библиотеку PyMuPDF , которая представляет собой очень универсальное, настраиваемое решение для интерпретатора PDF, XPS и электронных книг, которое можно использовать в широком спектре приложений в качестве средства визуализации PDF, средства просмотра или набора инструментов.
Целью этого руководства является разработка облегченной утилиты на основе командной строки для редактирования, кадрирования или выделения текста, включенного в один файл PDF или в папку, содержащую набор файлов PDF. Кроме того, это позволит вам удалить выделение из файла PDF или коллекции файлов PDF.
Давайте установим требования:
$ pip install PyMuPDF==1.18.9
Откройте новый файл Python и приступим:
# Import Libraries
from typing import Tuple
from io import BytesIO
import os
import argparse
import re
import fitz
def extract_info(input_file: str):
"""
Extracts file info
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
output = {
"File": input_file, "Encrypted": ("True" if pdfDoc.isEncrypted else "False")
}
# If PDF is encrypted the file metadata cannot be extracted
if not pdfDoc.isEncrypted:
for key, value in pdfDoc.metadata.items():
output[key] = value
# To Display File Info
print("## File Information ##################################################")
print("\n".join("{}:{}".format(i, j) for i, j in output.items()))
print("######################################################################")
return True, output
extract_info()
Функция собирает метаданные файла PDF, атрибуты, которые могут быть извлечены: format
, title
, author
, subject
, keywords
, creator
, producer
, creation date
, modification date
, trapped
, encryption
и количество страниц. Стоит отметить, что эти атрибуты нельзя извлечь, если вы нацелены на зашифрованный PDF-файл.
def search_for_text(lines, search_str):
"""
Search for the search string within the document lines
"""
for line in lines:
# Find all matches within one line
results = re.findall(search_str, line, re.IGNORECASE)
# In case multiple matches within one line
for result in results:
yield result
Эта функция ищет строку в строках документа, используя re.findall()
функцию, re.IGNORECASE
чтобы игнорировать регистр при поиске.
def redact_matching_data(page, matched_values):
"""
Redacts matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# Redact matching values
[page.addRedactAnnot(area, text=" ", fill=(0, 0, 0))
for area in matching_val_area]
# Apply the redaction
page.apply_redactions()
return matches_found
Эта функция выполняет следующее:
Вы можете изменить цвет исправления, используя fill
аргумент page.addRedactAnnot()
метода, установка которого (0, 0, 0)
приведет к черному редактированию. Это значения RGB в диапазоне от 0 до 1. Например, (1, 0, 0)
это приведет к красному редактированию и т. д.
def frame_matching_data(page, matched_values):
"""
frames matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
for area in matching_val_area:
if isinstance(area, fitz.fitz.Rect):
# Draw a rectangle around matched values
annot = page.addRectAnnot(area)
# , fill = fitz.utils.getColor('black')
annot.setColors(stroke=fitz.utils.getColor('red'))
# If you want to remove matched data
#page.addFreetextAnnot(area, ' ')
annot.update()
return matches_found
Функция frame_matching_data()
рисует красный прямоугольник (рамку) вокруг совпадающих значений.
Далее давайте определим функцию для выделения текста:
def highlight_matching_data(page, matched_values, type):
"""
Highlight matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# print("matching_val_area",matching_val_area)
highlight = None
if type == 'Highlight':
highlight = page.addHighlightAnnot(matching_val_area)
elif type == 'Squiggly':
highlight = page.addSquigglyAnnot(matching_val_area)
elif type == 'Underline':
highlight = page.addUnderlineAnnot(matching_val_area)
elif type == 'Strikeout':
highlight = page.addStrikeoutAnnot(matching_val_area)
else:
highlight = page.addHighlightAnnot(matching_val_area)
# To change the highlight colar
# highlight.setColors({"stroke":(0,0,1),"fill":(0.75,0.8,0.95) })
# highlight.setColors(stroke = fitz.utils.getColor('white'), fill = fitz.utils.getColor('red'))
# highlight.setColors(colors= fitz.utils.getColor('red'))
highlight.update()
return matches_found
Вышеупомянутая функция применяет адекватный режим выделения для совпадающих значений в зависимости от типа выделения, введенного в качестве параметра.
Вы всегда можете изменить цвет выделения, используя highlight.setColors()
метод, показанный в комментариях.
def process_data(input_file: str, output_file: str, search_str: str, pages: Tuple = None, action: str = 'Highlight'):
"""
Process the pages of the PDF File
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
total_matches = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
# Get Matching Data
# Split page by lines
page_lines = page.getText("text").split('\n')
matched_values = search_for_text(page_lines, search_str)
if matched_values:
if action == 'Redact':
matches_found = redact_matching_data(page, matched_values)
elif action == 'Frame':
matches_found = frame_matching_data(page, matched_values)
elif action in ('Highlight', 'Squiggly', 'Underline', 'Strikeout'):
matches_found = highlight_matching_data(
page, matched_values, action)
else:
matches_found = highlight_matching_data(
page, matched_values, 'Highlight')
total_matches += matches_found
print(f"{total_matches} Match(es) Found of Search String {search_str} In Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
Основная цель process_data()
функции заключается в следующем:
"Redact"
, "Frame"
, "Highlight"
, и т.д.)Он принимает несколько параметров:
input_file
: путь файла PDF для обработки.output_file
: путь к файлу PDF, который будет создан после обработки.search_str
: Строка для поиска.pages
: страницы, которые следует учитывать при обработке файла PDF.action
: действие, выполняемое с файлом PDF.Затем давайте напишем функцию для удаления выделения, если мы хотим:
def remove_highlght(input_file: str, output_file: str, pages: Tuple = None):
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
# Initialize a counter for annotations
annot_found = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
annot = page.firstAnnot
while annot:
annot_found += 1
page.deleteAnnot(annot)
annot = annot.next
if annot_found >= 0:
print(f"Annotation(s) Found In The Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
Цель этой remove_highlight()
функции — удалить выделение (а не исправления) из файла PDF. Он выполняет следующее:
Теперь давайте создадим функцию-оболочку, которая использует предыдущие функции для вызова соответствующей функции в зависимости от действия:
def process_file(**kwargs):
"""
To process one single file
Redact, Frame, Highlight... one PDF File
Remove Highlights from a single PDF File
"""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
if output_file is None:
output_file = input_file
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
# Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
if action == "Remove":
# Remove the Highlights except Redactions
remove_highlght(input_file=input_file,
output_file=output_file, pages=pages)
else:
process_data(input_file=input_file, output_file=output_file,
search_str=search_str, pages=pages, action=action)
Действие может быть "Redact"
, "Frame"
, "Highlight"
, "Squiggly"
, "Underline"
, "Strikeout"
, и "Remove"
.
Давайте определим ту же функцию, но с папками, содержащими несколько файлов PDF:
def process_folder(**kwargs):
"""
Redact, Frame, Highlight... all PDF Files within a specified path
Remove Highlights from all PDF Files within a specified path
"""
input_folder = kwargs.get('input_folder')
search_str = kwargs.get('search_str')
# Run in recursive mode
recursive = kwargs.get('recursive')
#Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
pages = kwargs.get('pages')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
process_file(input_file=inp_pdf_file, output_file=None,
search_str=search_str, action=action, pages=pages)
if not recursive:
break
Эта функция предназначена для обработки файлов PDF, включенных в определенную папку.
Он перебирает файлы указанной папки либо рекурсивно, либо нет, в зависимости от значения параметра recursive, и обрабатывает эти файлы один за другим.
Он принимает следующие параметры:
input_folder
: путь к папке, содержащей PDF-файлы для обработки.search_str
: текст для поиска для управления.recursive
: следует ли запускать этот процесс рекурсивно, перебирая подпапки или нет.action
: действие, которое нужно выполнить среди ранее упомянутого списка.pages
: страницы для рассмотрения.Прежде чем мы напишем наш основной код, давайте создадим функцию для разбора аргументов командной строки:
def is_valid_path(path):
"""
Validates the path inputted and checks whether it is a file path or a folder path
"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input_path', dest='input_path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', dest='action', choices=['Redact', 'Frame', 'Highlight', 'Squiggly', 'Underline', 'Strikeout', 'Remove'], type=str,
default='Highlight', help="Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider e.g.: [2,4]")
action = parser.parse_known_args()[0].action
if action != 'Remove':
parser.add_argument('-s', '--search_str', dest='search_str' # lambda x: os.path.has_valid_dir_syntax(x)
, type=str, required=True, help="Enter a valid search string")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file', type=str # lambda x: os.path.has_valid_dir_syntax(x)
, help="Enter a valid output file")
if os.path.isdir(path):
parser.add_argument('-r', '--recursive', dest='recursive', default=False, type=lambda x: (
str(x).lower() in ['true', '1', 'yes']), help="Process Recursively or Non-Recursively")
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
Наконец, давайте напишем основной код:
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Extracting File Info
extract_info(input_file=args['input_path'])
# Process a file
process_file(
input_file=args['input_path'], output_file=args['output_file'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
pages=args['pages'], action=args['action']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
process_folder(
input_folder=args['input_path'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
action=args['action'], pages=args['pages'], recursive=args['recursive']
)
Теперь давайте протестируем нашу программу:
$ python pdf_highlighter.py --help
Выход:
usage: pdf_highlighter.py [-h] -i INPUT_PATH [-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}] [-p PAGES]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}, --action {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}
Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: [2,4]
Прежде чем приступить к изучению наших тестовых сценариев, позвольте мне прояснить несколько моментов:
PermissionError
, закройте входной PDF-файл перед запуском этой утилиты."organi[sz]e"
соответствовала как «упорядочить», так и «упорядочить».В качестве демонстрационного примера выделим слово «BERT» в документе BERT :
$ python pdf_highlighter.py -i bert-paper.pdf -a Highlight -s "BERT"
Выход:
## Command Arguments #################################################
input_path:bert-paper.pdf
action:Highlight
pages:None
search_str:BERT
output_file:None
######################################################################
## File Information ##################################################
File:bert-paper.pdf
Encrypted:False
format:PDF 1.5
title:
author:
subject:
keywords:
creator:LaTeX with hyperref package
producer:pdfTeX-1.40.17
creationDate:D:20190528000751Z
modDate:D:20190528000751Z
trapped:
encryption:None
######################################################################
121 Match(es) Found of Search String BERT In Input File: bert-paper.pdf
Как видите, было выделено 121 совпадение, вы можете использовать другие варианты выделения, такие как подчеркивание, рамка и другие. Вот полученный PDF:
Теперь удалим:
$ python pdf_highlighter.py -i bert-paper.pdf -a Remove
Полученный PDF-файл удалит выделение.
Я приглашаю вас поиграть с другими действиями, так как мне очень интересно делать это автоматически с помощью Python.
Если вы хотите выделить текст из нескольких PDF-файлов, вы можете либо указать папку в -i
параметре, либо объединить PDF-файлы вместе и запустить код, чтобы получить один PDF-файл, содержащий весь текст, который вы хотите выделить.
1648889718
PDFファイル内のテキストを強調表示または注釈を付けることは、重要な情報を読み取って保持するための優れた戦略です。この手法は、重要な情報をすぐに読者の注意を引くのに役立ちます。黄色で強調表示されたテキストがおそらく最初に目を引くことは間違いありません。
PDFファイルを編集すると、ドキュメントのフォーマットを維持しながら機密情報を非表示にすることができます。これにより、共有する前に個人情報と機密情報が保持されます。さらに、機密情報を処理する際の組織の整合性と信頼性をさらに高めます。
このチュートリアルでは、Pythonを使用してPDFファイルのテキストを編集、フレーム化、または強調表示する方法を学習します。
このガイドでは、PyMuPDFライブラリを使用します。これは、PDFレンダラー、ビューアー、またはツールキットとして幅広いアプリケーションで使用できる、非常に用途が広く、カスタマイズ可能なPDF、XPS、およびEBookインタープリターソリューションです。
このチュートリアルの目的は、1つのPDFファイルまたはPDFファイルのコレクションを含むフォルダー内に含まれるテキストを編集、フレーム化、または強調表示するための軽量のコマンドラインベースのユーティリティを開発することです。さらに、PDFファイルまたはPDFファイルのコレクションからハイライトを削除できるようになります。
要件をインストールしましょう:
$ pip install PyMuPDF==1.18.9
新しいPythonファイルを開いて、始めましょう。
# Import Libraries
from typing import Tuple
from io import BytesIO
import os
import argparse
import re
import fitz
def extract_info(input_file: str):
"""
Extracts file info
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
output = {
"File": input_file, "Encrypted": ("True" if pdfDoc.isEncrypted else "False")
}
# If PDF is encrypted the file metadata cannot be extracted
if not pdfDoc.isEncrypted:
for key, value in pdfDoc.metadata.items():
output[key] = value
# To Display File Info
print("## File Information ##################################################")
print("\n".join("{}:{}".format(i, j) for i, j in output.items()))
print("######################################################################")
return True, output
extract_info()
関数はPDFファイルのメタデータformat
を収集します。抽出できる属性は、、、、、、、、、、、、、、およびページ数です。暗号化されたPDFファイルを対象とする場合、これらの属性を抽出できないことに注意してください。titleauthorsubjectkeywordscreatorproducercreation datemodification datetrappedencryption
def search_for_text(lines, search_str):
"""
Search for the search string within the document lines
"""
for line in lines:
# Find all matches within one line
results = re.findall(search_str, line, re.IGNORECASE)
# In case multiple matches within one line
for result in results:
yield result
re.findall()
この関数は、関数を使用してドキュメント行内の文字列を検索しre.IGNORECASE
ます。検索中は大文字と小文字を区別しません。
def redact_matching_data(page, matched_values):
"""
Redacts matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# Redact matching values
[page.addRedactAnnot(area, text=" ", fill=(0, 0, 0))
for area in matching_val_area]
# Apply the redaction
page.apply_redactions()
return matches_found
この関数は次のことを実行します。
fill
メソッドの引数を使用して編集の色を変更できます。page.addRedactAnnot()
これをに設定すると(0, 0, 0)
、黒の編集になります。これらは0から1の範囲のRGB値です。たとえば(1, 0, 0)
、赤の編集になります。
def frame_matching_data(page, matched_values):
"""
frames matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
for area in matching_val_area:
if isinstance(area, fitz.fitz.Rect):
# Draw a rectangle around matched values
annot = page.addRectAnnot(area)
# , fill = fitz.utils.getColor('black')
annot.setColors(stroke=fitz.utils.getColor('red'))
# If you want to remove matched data
#page.addFreetextAnnot(area, ' ')
annot.update()
return matches_found
このframe_matching_data()
関数は、一致する値の周りに赤い長方形(フレーム)を描画します。
次に、テキストを強調表示する関数を定義しましょう。
def highlight_matching_data(page, matched_values, type):
"""
Highlight matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# print("matching_val_area",matching_val_area)
highlight = None
if type == 'Highlight':
highlight = page.addHighlightAnnot(matching_val_area)
elif type == 'Squiggly':
highlight = page.addSquigglyAnnot(matching_val_area)
elif type == 'Underline':
highlight = page.addUnderlineAnnot(matching_val_area)
elif type == 'Strikeout':
highlight = page.addStrikeoutAnnot(matching_val_area)
else:
highlight = page.addHighlightAnnot(matching_val_area)
# To change the highlight colar
# highlight.setColors({"stroke":(0,0,1),"fill":(0.75,0.8,0.95) })
# highlight.setColors(stroke = fitz.utils.getColor('white'), fill = fitz.utils.getColor('red'))
# highlight.setColors(colors= fitz.utils.getColor('red'))
highlight.update()
return matches_found
上記の関数は、パラメータとして入力されたハイライトのタイプに応じて、一致する値に適切なハイライトモードを適用します。
highlight.setColors()
コメントに示されている方法を使用して、ハイライトの色をいつでも変更できます。
def process_data(input_file: str, output_file: str, search_str: str, pages: Tuple = None, action: str = 'Highlight'):
"""
Process the pages of the PDF File
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
total_matches = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
# Get Matching Data
# Split page by lines
page_lines = page.getText("text").split('\n')
matched_values = search_for_text(page_lines, search_str)
if matched_values:
if action == 'Redact':
matches_found = redact_matching_data(page, matched_values)
elif action == 'Frame':
matches_found = frame_matching_data(page, matched_values)
elif action in ('Highlight', 'Squiggly', 'Underline', 'Strikeout'):
matches_found = highlight_matching_data(
page, matched_values, action)
else:
matches_found = highlight_matching_data(
page, matched_values, 'Highlight')
total_matches += matches_found
print(f"{total_matches} Match(es) Found of Search String {search_str} In Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
process_data()
この関数の主な目的は次のとおりです。
"Redact"
など)"Frame"
を適用します"Highlight"
それはいくつかのパラメータを受け入れます:
input_file
:処理するPDFファイルのパス。output_file
:処理後に生成するPDFファイルのパス。search_str
:検索する文字列。pages
:PDFファイルの処理中に考慮するページ。action
:PDFファイルに対して実行するアクション。次に、必要に応じてハイライトを削除する関数を作成しましょう。
def remove_highlght(input_file: str, output_file: str, pages: Tuple = None):
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
# Initialize a counter for annotations
annot_found = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
annot = page.firstAnnot
while annot:
annot_found += 1
page.deleteAnnot(annot)
annot = annot.next
if annot_found >= 0:
print(f"Annotation(s) Found In The Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
この関数の目的はremove_highlight()
、PDFファイルから(編集ではなく)ハイライトを削除することです。次のことを実行します。
次に、前の関数を使用して、アクションに応じて適切な関数を呼び出すラッパー関数を作成しましょう。
def process_file(**kwargs):
"""
To process one single file
Redact, Frame, Highlight... one PDF File
Remove Highlights from a single PDF File
"""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
if output_file is None:
output_file = input_file
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
# Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
if action == "Remove":
# Remove the Highlights except Redactions
remove_highlght(input_file=input_file,
output_file=output_file, pages=pages)
else:
process_data(input_file=input_file, output_file=output_file,
search_str=search_str, pages=pages, action=action)
アクションは、、、、、、、、およびです。"Redact"
_ "Frame"
_"Highlight""Squiggly""Underline""Strikeout""Remove"
同じ関数を定義しましょう。ただし、複数のPDFファイルを含むフォルダーを使用します。
def process_folder(**kwargs):
"""
Redact, Frame, Highlight... all PDF Files within a specified path
Remove Highlights from all PDF Files within a specified path
"""
input_folder = kwargs.get('input_folder')
search_str = kwargs.get('search_str')
# Run in recursive mode
recursive = kwargs.get('recursive')
#Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
pages = kwargs.get('pages')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
process_file(input_file=inp_pdf_file, output_file=None,
search_str=search_str, action=action, pages=pages)
if not recursive:
break
この関数は、特定のフォルダーに含まれるPDFファイルを処理することを目的としています。
パラメータrecursiveの値に応じて、指定されたフォルダのファイル全体を再帰的にループするか、再帰的にループせずに、これらのファイルを1つずつ処理します。
次のパラメータを受け入れます。
input_folder
:処理するPDFファイルを含むフォルダーのパス。search_str
:操作するために検索するテキスト。recursive
:サブフォルダーをループしてこのプロセスを再帰的に実行するかどうか。action
:前述のリストの中で実行するアクション。pages
:検討するページ。メインコードを作成する前に、コマンドライン引数を解析するための関数を作成しましょう。
def is_valid_path(path):
"""
Validates the path inputted and checks whether it is a file path or a folder path
"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input_path', dest='input_path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', dest='action', choices=['Redact', 'Frame', 'Highlight', 'Squiggly', 'Underline', 'Strikeout', 'Remove'], type=str,
default='Highlight', help="Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider e.g.: [2,4]")
action = parser.parse_known_args()[0].action
if action != 'Remove':
parser.add_argument('-s', '--search_str', dest='search_str' # lambda x: os.path.has_valid_dir_syntax(x)
, type=str, required=True, help="Enter a valid search string")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file', type=str # lambda x: os.path.has_valid_dir_syntax(x)
, help="Enter a valid output file")
if os.path.isdir(path):
parser.add_argument('-r', '--recursive', dest='recursive', default=False, type=lambda x: (
str(x).lower() in ['true', '1', 'yes']), help="Process Recursively or Non-Recursively")
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
最後に、メインコードを書きましょう。
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Extracting File Info
extract_info(input_file=args['input_path'])
# Process a file
process_file(
input_file=args['input_path'], output_file=args['output_file'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
pages=args['pages'], action=args['action']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
process_folder(
input_folder=args['input_path'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
action=args['action'], pages=args['pages'], recursive=args['recursive']
)
それでは、プログラムをテストしてみましょう。
$ python pdf_highlighter.py --help
出力:
usage: pdf_highlighter.py [-h] -i INPUT_PATH [-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}] [-p PAGES]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}, --action {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}
Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: [2,4]
テストシナリオを検討する前に、いくつかのポイントを明確にしましょう。
PermissionError
このユーティリティを実行する前に、入力PDFファイルを閉じてください。"organi[sz]e"
「organise」と「organize」の両方に一致するように検索文字列を設定します。デモンストレーションの例として、BERTペーパーで「BERT」という単語を強調してみましょう。
$ python pdf_highlighter.py -i bert-paper.pdf -a Highlight -s "BERT"
出力:
## Command Arguments #################################################
input_path:bert-paper.pdf
action:Highlight
pages:None
search_str:BERT
output_file:None
######################################################################
## File Information ##################################################
File:bert-paper.pdf
Encrypted:False
format:PDF 1.5
title:
author:
subject:
keywords:
creator:LaTeX with hyperref package
producer:pdfTeX-1.40.17
creationDate:D:20190528000751Z
modDate:D:20190528000751Z
trapped:
encryption:None
######################################################################
121 Match(es) Found of Search String BERT In Input File: bert-paper.pdf
ご覧のとおり、121の一致が強調表示されています。下線、フレームなど、他の強調表示オプションを使用できます。結果のPDFは次のとおりです。
今すぐ削除しましょう:
$ python pdf_highlighter.py -i bert-paper.pdf -a Remove
結果のPDFはハイライトを削除します。
Pythonで自動的に実行するのは非常に興味深いので、他のアクションを試してみることをお勧めします。
複数のPDFファイルのテキストを強調表示する場合は、-i
パラメーターにフォルダーを指定するか、pdfファイルをマージしてコードを実行し、強調表示するすべてのテキストを含む単一のPDFを作成します。
1648886001
Surligner ou annoter un texte dans un fichier PDF est une excellente stratégie pour lire et retenir les informations clés. Cette technique peut aider à attirer immédiatement l'attention du lecteur sur des informations importantes. Il ne fait aucun doute qu'un texte surligné en jaune attirera probablement votre attention en premier.
La rédaction d'un fichier PDF vous permet de masquer des informations sensibles tout en conservant la mise en forme de votre document. Cela préserve les informations privées et confidentielles avant de les partager. De plus, cela renforce encore l'intégrité et la crédibilité de l'organisation dans le traitement des informations sensibles.
Dans ce didacticiel, vous apprendrez à rédiger, encadrer ou surligner un texte dans des fichiers PDF à l'aide de Python.
Dans ce guide, nous utiliserons la bibliothèque PyMuPDF , qui est une solution d'interprétation PDF, XPS et EBook hautement polyvalente et personnalisable qui peut être utilisée dans une large gamme d'applications en tant que moteur de rendu PDF, visualiseur ou boîte à outils.
L'objectif de ce didacticiel est de développer un utilitaire léger basé sur la ligne de commande pour biffer, encadrer ou surligner un texte inclus dans un fichier PDF ou dans un dossier contenant une collection de fichiers PDF. De plus, il vous permettra de supprimer les surlignages d'un fichier PDF ou d'une collection de fichiers PDF.
Installons les exigences :
$ pip install PyMuPDF==1.18.9
Ouvrez un nouveau fichier Python et commençons :
# Import Libraries
from typing import Tuple
from io import BytesIO
import os
import argparse
import re
import fitz
def extract_info(input_file: str):
"""
Extracts file info
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
output = {
"File": input_file, "Encrypted": ("True" if pdfDoc.isEncrypted else "False")
}
# If PDF is encrypted the file metadata cannot be extracted
if not pdfDoc.isEncrypted:
for key, value in pdfDoc.metadata.items():
output[key] = value
# To Display File Info
print("## File Information ##################################################")
print("\n".join("{}:{}".format(i, j) for i, j in output.items()))
print("######################################################################")
return True, output
extract_info()
collecte les métadonnées d'un fichier PDF, les attributs qui peuvent être extraits sont format
, title
, author
, subject
, keywords
, creator
, producer
, creation date
, modification date
, trapped
, encryption
, et le nombre de pages. Il convient de noter que ces attributs ne peuvent pas être extraits lorsque vous ciblez un fichier PDF crypté.
def search_for_text(lines, search_str):
"""
Search for the search string within the document lines
"""
for line in lines:
# Find all matches within one line
results = re.findall(search_str, line, re.IGNORECASE)
# In case multiple matches within one line
for result in results:
yield result
Cette fonction recherche une chaîne dans les lignes du document à l'aide de la re.findall()
fonction, re.IGNORECASE
consiste à ignorer la casse lors de la recherche.
def redact_matching_data(page, matched_values):
"""
Redacts matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# Redact matching values
[page.addRedactAnnot(area, text=" ", fill=(0, 0, 0))
for area in matching_val_area]
# Apply the redaction
page.apply_redactions()
return matches_found
Cette fonction effectue les opérations suivantes :
Vous pouvez changer la couleur de la rédaction en utilisant l' fill
argument de la page.addRedactAnnot()
méthode, en la définissant sur cela (0, 0, 0)
se traduira par une rédaction noire. Ce sont des valeurs RVB allant de 0 à 1. Par exemple, (1, 0, 0)
cela entraînera une rédaction rouge, et ainsi de suite.
def frame_matching_data(page, matched_values):
"""
frames matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
for area in matching_val_area:
if isinstance(area, fitz.fitz.Rect):
# Draw a rectangle around matched values
annot = page.addRectAnnot(area)
# , fill = fitz.utils.getColor('black')
annot.setColors(stroke=fitz.utils.getColor('red'))
# If you want to remove matched data
#page.addFreetextAnnot(area, ' ')
annot.update()
return matches_found
La frame_matching_data()
fonction dessine un rectangle rouge (cadre) autour des valeurs correspondantes.
Ensuite, définissons une fonction pour mettre en surbrillance le texte :
def highlight_matching_data(page, matched_values, type):
"""
Highlight matching values
"""
matches_found = 0
# Loop throughout matching values
for val in matched_values:
matches_found += 1
matching_val_area = page.searchFor(val)
# print("matching_val_area",matching_val_area)
highlight = None
if type == 'Highlight':
highlight = page.addHighlightAnnot(matching_val_area)
elif type == 'Squiggly':
highlight = page.addSquigglyAnnot(matching_val_area)
elif type == 'Underline':
highlight = page.addUnderlineAnnot(matching_val_area)
elif type == 'Strikeout':
highlight = page.addStrikeoutAnnot(matching_val_area)
else:
highlight = page.addHighlightAnnot(matching_val_area)
# To change the highlight colar
# highlight.setColors({"stroke":(0,0,1),"fill":(0.75,0.8,0.95) })
# highlight.setColors(stroke = fitz.utils.getColor('white'), fill = fitz.utils.getColor('red'))
# highlight.setColors(colors= fitz.utils.getColor('red'))
highlight.update()
return matches_found
La fonction ci-dessus applique le mode de surbrillance adéquat sur les valeurs correspondantes en fonction du type de surbrillance saisi en paramètre.
Vous pouvez toujours changer la couleur de la surbrillance en utilisant la highlight.setColors()
méthode indiquée dans les commentaires.
def process_data(input_file: str, output_file: str, search_str: str, pages: Tuple = None, action: str = 'Highlight'):
"""
Process the pages of the PDF File
"""
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
total_matches = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
# Get Matching Data
# Split page by lines
page_lines = page.getText("text").split('\n')
matched_values = search_for_text(page_lines, search_str)
if matched_values:
if action == 'Redact':
matches_found = redact_matching_data(page, matched_values)
elif action == 'Frame':
matches_found = frame_matching_data(page, matched_values)
elif action in ('Highlight', 'Squiggly', 'Underline', 'Strikeout'):
matches_found = highlight_matching_data(
page, matched_values, action)
else:
matches_found = highlight_matching_data(
page, matched_values, 'Highlight')
total_matches += matches_found
print(f"{total_matches} Match(es) Found of Search String {search_str} In Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
Le but principal de la process_data()
fonction est le suivant :
"Redact"
, "Frame"
, "Highlight"
, etc.)Il accepte plusieurs paramètres :
input_file
: Le chemin du fichier PDF à traiter.output_file
: Le chemin du fichier PDF à générer après traitement.search_str
: La chaîne à rechercher.pages
: Les pages à considérer lors du traitement du fichier PDF.action
: L'action à effectuer sur le fichier PDF.Ensuite, écrivons une fonction pour supprimer la surbrillance au cas où nous voudrions :
def remove_highlght(input_file: str, output_file: str, pages: Tuple = None):
# Open the PDF
pdfDoc = fitz.open(input_file)
# Save the generated PDF to memory buffer
output_buffer = BytesIO()
# Initialize a counter for annotations
annot_found = 0
# Iterate through pages
for pg in range(pdfDoc.pageCount):
# If required for specific pages
if pages:
if str(pg) not in pages:
continue
# Select the page
page = pdfDoc[pg]
annot = page.firstAnnot
while annot:
annot_found += 1
page.deleteAnnot(annot)
annot = annot.next
if annot_found >= 0:
print(f"Annotation(s) Found In The Input File: {input_file}")
# Save to output
pdfDoc.save(output_buffer)
pdfDoc.close()
# Save the output buffer to the output file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
Le but de la remove_highlight()
fonction est de supprimer les surlignages (pas les caviardages) d'un fichier PDF. Il effectue les opérations suivantes :
Créons maintenant une fonction wrapper qui utilise les fonctions précédentes pour appeler la fonction appropriée en fonction de l'action :
def process_file(**kwargs):
"""
To process one single file
Redact, Frame, Highlight... one PDF File
Remove Highlights from a single PDF File
"""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
if output_file is None:
output_file = input_file
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
# Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
if action == "Remove":
# Remove the Highlights except Redactions
remove_highlght(input_file=input_file,
output_file=output_file, pages=pages)
else:
process_data(input_file=input_file, output_file=output_file,
search_str=search_str, pages=pages, action=action)
L'action peut être "Redact"
, "Frame"
, "Highlight"
, "Squiggly"
, "Underline"
, "Strikeout"
et "Remove"
.
Définissons la même fonction mais avec des dossiers contenant plusieurs fichiers PDF :
def process_folder(**kwargs):
"""
Redact, Frame, Highlight... all PDF Files within a specified path
Remove Highlights from all PDF Files within a specified path
"""
input_folder = kwargs.get('input_folder')
search_str = kwargs.get('search_str')
# Run in recursive mode
recursive = kwargs.get('recursive')
#Redact, Frame, Highlight, Squiggly, Underline, Strikeout, Remove
action = kwargs.get('action')
pages = kwargs.get('pages')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
process_file(input_file=inp_pdf_file, output_file=None,
search_str=search_str, action=action, pages=pages)
if not recursive:
break
Cette fonction est destinée à traiter les fichiers PDF inclus dans un dossier spécifique.
Il parcourt les fichiers du dossier spécifié de manière récursive ou non selon la valeur du paramètre récursif et traite ces fichiers un par un.
Il accepte les paramètres suivants :
input_folder
: Le chemin du dossier contenant les fichiers PDF à traiter.search_str
: Le texte à rechercher afin de le manipuler.recursive
: s'il faut exécuter ce processus de manière récursive en bouclant les sous-dossiers ou non.action
: l'action à effectuer parmi la liste mentionnée précédemment.pages
: les pages à considérer.Avant de créer notre code principal, créons une fonction pour analyser les arguments de la ligne de commande :
def is_valid_path(path):
"""
Validates the path inputted and checks whether it is a file path or a folder path
"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input_path', dest='input_path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', dest='action', choices=['Redact', 'Frame', 'Highlight', 'Squiggly', 'Underline', 'Strikeout', 'Remove'], type=str,
default='Highlight', help="Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider e.g.: [2,4]")
action = parser.parse_known_args()[0].action
if action != 'Remove':
parser.add_argument('-s', '--search_str', dest='search_str' # lambda x: os.path.has_valid_dir_syntax(x)
, type=str, required=True, help="Enter a valid search string")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file', type=str # lambda x: os.path.has_valid_dir_syntax(x)
, help="Enter a valid output file")
if os.path.isdir(path):
parser.add_argument('-r', '--recursive', dest='recursive', default=False, type=lambda x: (
str(x).lower() in ['true', '1', 'yes']), help="Process Recursively or Non-Recursively")
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
Enfin, écrivons le code principal :
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Extracting File Info
extract_info(input_file=args['input_path'])
# Process a file
process_file(
input_file=args['input_path'], output_file=args['output_file'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
pages=args['pages'], action=args['action']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
process_folder(
input_folder=args['input_path'],
search_str=args['search_str'] if 'search_str' in (args.keys()) else None,
action=args['action'], pages=args['pages'], recursive=args['recursive']
)
Testons maintenant notre programme :
$ python pdf_highlighter.py --help
Sortir:
usage: pdf_highlighter.py [-h] -i INPUT_PATH [-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}] [-p PAGES]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}, --action {Redact,Frame,Highlight,Squiggly,Underline,Strikeout,Remove}
Choose whether to Redact or to Frame or to Highlight or to Squiggly or to Underline or to Strikeout or to Remove
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: [2,4]
Avant d'explorer nos scénarios de test, permettez-moi de clarifier quelques points :
PermissionError
, veuillez fermer le fichier PDF d'entrée avant d'exécuter cet utilitaire."organi[sz]e"
corresponde à la fois à "organiser" et à "organiser".A titre d'exemple de démonstration, soulignons le mot "BERT" dans l'article BERT :
$ python pdf_highlighter.py -i bert-paper.pdf -a Highlight -s "BERT"
Sortir:
## Command Arguments #################################################
input_path:bert-paper.pdf
action:Highlight
pages:None
search_str:BERT
output_file:None
######################################################################
## File Information ##################################################
File:bert-paper.pdf
Encrypted:False
format:PDF 1.5
title:
author:
subject:
keywords:
creator:LaTeX with hyperref package
producer:pdfTeX-1.40.17
creationDate:D:20190528000751Z
modDate:D:20190528000751Z
trapped:
encryption:None
######################################################################
121 Match(es) Found of Search String BERT In Input File: bert-paper.pdf
Comme vous pouvez le voir, 121 correspondances ont été mises en surbrillance, vous pouvez utiliser d'autres options de surbrillance, telles que le soulignement, le cadre et autres. Voici le PDF obtenu :
Supprimons-le maintenant :
$ python pdf_highlighter.py -i bert-paper.pdf -a Remove
Le PDF résultant supprimera la surbrillance.
Je vous invite à jouer avec d'autres actions, car je trouve assez intéressant de le faire automatiquement avec Python.
Si vous souhaitez mettre en surbrillance du texte à partir de plusieurs fichiers PDF, vous pouvez soit spécifier le dossier du -i
paramètre, soit fusionner les fichiers pdf et exécuter le code pour obtenir un seul PDF contenant tout le texte que vous souhaitez mettre en surbrillance.
1648855380
De nos jours, les entreprises de moyenne et grande taille utilisent quotidiennement des quantités massives de documents imprimés. Parmi eux se trouvent des factures, des reçus, des documents d'entreprise, des rapports et des communiqués de presse.
Pour ces entreprises, l'utilisation d'un scanner OCR peut faire gagner un temps considérable tout en améliorant l'efficacité et la précision.
Les algorithmes de reconnaissance optique de caractères (OCR) permettent aux ordinateurs d'analyser automatiquement les documents imprimés ou manuscrits et de préparer les données textuelles dans des formats modifiables pour que les ordinateurs puissent les traiter efficacement. Les systèmes OCR transforment une image bidimensionnelle de texte pouvant contenir du texte imprimé par machine ou manuscrit à partir de sa représentation d'image en texte lisible par machine.
Généralement, un moteur OCR implique plusieurs étapes nécessaires pour former un algorithme d'apprentissage automatique pour une résolution efficace des problèmes à l'aide de la reconnaissance optique des caractères.
Les étapes suivantes qui peuvent différer d'un moteur à l'autre sont grosso modo nécessaires pour aborder la reconnaissance automatique des caractères :
Dans ce tutoriel, je vais vous montrer ce qui suit :
Veuillez noter que ce didacticiel concerne l'extraction de texte à partir d'images dans des documents PDF.
Pour commencer, nous devons utiliser les bibliothèques suivantes :
Tesseract OCR : est un moteur de reconnaissance de texte open source disponible sous la licence Apache 2.0 et son développement est sponsorisé par Google depuis 2006. En 2006, Tesseract était considéré comme l'un des moteurs OCR open source les plus précis. Vous pouvez l'utiliser directement ou utiliser l'API pour extraire le texte imprimé des images. La meilleure partie est qu'il prend en charge une grande variété de langues.
L'installation du moteur Tesseract sort du cadre de cet article. Cependant, vous devez suivre le guide d'installation officiel de Tesseract pour l'installer sur votre système d'exploitation.
Pour valider la configuration de Tesseract, veuillez exécuter la commande suivante et vérifier la sortie générée :
Python-tesseract : est un wrapper Python pour le moteur Tesseract-OCR de Google. Il est également utile en tant que script d'invocation autonome pour tesseract, car il peut lire tous les types d'images pris en charge par les bibliothèques d'imagerie Pillow et Leptonica, y compris jpeg, png, gif, bmp, tiff et autres.
OpenCV : est une bibliothèque open-source Python, pour la vision par ordinateur, l'apprentissage automatique et le traitement d'images. OpenCV prend en charge une grande variété de langages de programmation comme Python, C++, Java, etc. Il peut traiter des images et des vidéos pour identifier des objets, des visages ou même l'écriture manuscrite d'un humain.
PyMuPDF : MuPDF est une solution d'interpréteur PDF, XPS et eBook hautement polyvalente et personnalisable qui peut être utilisée dans un large éventail d'applications en tant que moteur de rendu PDF, visualiseur ou boîte à outils. PyMuPDF est une liaison Python pour MuPDF. Il s'agit d'un visualiseur PDF et XPS léger.
Numpy : est un package de traitement de tableau à usage général. Il fournit un objet tableau multidimensionnel hautes performances et des outils pour travailler avec ces tableaux. C'est le package fondamental pour le calcul scientifique avec Python. En outre, Numpy peut également être utilisé comme un conteneur multidimensionnel efficace de données génériques.
Oreiller : est construit au-dessus de PIL (Python Image Library). C'est un module essentiel pour le traitement d'images en Python.
Pandas : est une bibliothèque Python open source sous licence BSD fournissant des structures de données et des outils d'analyse de données hautes performances et faciles à utiliser pour le langage de programmation Python.
Type de fichier : package Python petit et sans dépendance pour déduire le type de fichier et le type MIME.
Ce didacticiel vise à développer un utilitaire léger basé sur la ligne de commande pour extraire, rédiger ou surligner un texte inclus dans une image ou un fichier PDF numérisé, ou dans un dossier contenant une collection de fichiers PDF.
Pour commencer, installons la configuration requise :
$ pip install Filetype==1.0.7 numpy==1.19.4 opencv-python==4.4.0.46 pandas==1.1.4 Pillow==8.0.1 PyMuPDF==1.18.9 pytesseract==0.3.7
Commençons par importer les bibliothèques nécessaires :
import os
import re
import argparse
import pytesseract
from pytesseract import Output
import cv2
import numpy as np
import fitz
from io import BytesIO
from PIL import Image
import pandas as pd
import filetype
# Path Of The Tesseract OCR engine
TESSERACT_PATH = "C:\Program Files\Tesseract-OCR\tesseract.exe"
# Include tesseract executable
pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH
TESSERACT_PATH
est l'endroit où se trouve l'exécutable Tesseract. Évidemment, vous devez le changer pour votre cas.
def pix2np(pix):
"""
Converts a pixmap buffer into a numpy array
"""
# pix.samples = sequence of bytes of the image pixels like RGBA
#pix.h = height in pixels
#pix.w = width in pixels
# pix.n = number of components per pixel (depends on the colorspace and alpha)
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.h, pix.w, pix.n)
try:
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
except IndexError:
# Convert Gray to RGB
im = cv2.cvtColor(im, cv2.COLOR_GRAY2RGB)
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
return im
Cette fonction convertit un tampon pixmap représentant une capture d'écran prise à l'aide de la bibliothèque PyMuPDF en un tableau NumPy.
Pour améliorer la précision de Tesseract, définissons quelques fonctions de prétraitement à l'aide d'OpenCV :
# Image Pre-Processing Functions to improve output accurracy
# Convert to grayscale
def grayscale(img):
return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Remove noise
def remove_noise(img):
return cv2.medianBlur(img, 5)
# Thresholding
def threshold(img):
# return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# dilation
def dilate(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.dilate(img, kernel, iterations=1)
# erosion
def erode(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.erode(img, kernel, iterations=1)
# opening -- erosion followed by a dilation
def opening(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
# canny edge detection
def canny(img):
return cv2.Canny(img, 100, 200)
# skew correction
def deskew(img):
coords = np.column_stack(np.where(img > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
(h, w) = img.shape[:2]
center = (w//2, h//2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(
img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
return rotated
# template matching
def match_template(img, template):
return cv2.matchTemplate(img, template, cv2.TM_CCOEFF_NORMED)
def convert_img2bin(img):
"""
Pre-processes the image and generates a binary output
"""
# Convert the image into a grayscale image
output_img = grayscale(img)
# Invert the grayscale image by flipping pixel values.
# All pixels that are grater than 0 are set to 0 and all pixels that are = to 0 are set to 255
output_img = cv2.bitwise_not(output_img)
# Converting image to binary by Thresholding in order to show a clear separation between white and blacl pixels.
output_img = threshold(output_img)
return output_img
Nous avons défini des fonctions pour de nombreuses tâches de prétraitement, notamment la conversion d'images en niveaux de gris, l'inversion des valeurs de pixels, la séparation des pixels blancs et noirs, et bien plus encore.
Ensuite, définissons une fonction pour afficher une image :
def display_img(title, img):
"""Displays an image on screen and maintains the output until the user presses a key"""
cv2.namedWindow('img', cv2.WINDOW_NORMAL)
cv2.setWindowTitle('img', title)
cv2.resizeWindow('img', 1200, 900)
# Display Image on screen
cv2.imshow('img', img)
# Mantain output until user presses a key
cv2.waitKey(0)
# Destroy windows when user presses a key
cv2.destroyAllWindows()
La display_img()
fonction affiche à l'écran une image dans une fenêtre dont le titre est title
paramétré et maintient cette fenêtre ouverte jusqu'à ce que l'utilisateur appuie sur une touche du clavier.
def generate_ss_text(ss_details):
"""Loops through the captured text of an image and arranges this text line by line.
This function depends on the image layout."""
# Arrange the captured text after scanning the page
parse_text = []
word_list = []
last_word = ''
# Loop through the captured text of the entire page
for word in ss_details['text']:
# If the word captured is not empty
if word != '':
# Add it to the line word list
word_list.append(word)
last_word = word
if (last_word != '' and word == '') or (word == ss_details['text'][-1]):
parse_text.append(word_list)
word_list = []
return parse_text
La fonction ci-dessus parcourt le texte capturé d'une image et organise le texte saisi ligne par ligne. Cela dépend de la disposition de l'image et peut nécessiter des ajustements pour certains formats d'image.
Ensuite, définissons une fonction pour rechercher du texte à l'aide d'expressions régulières :
def search_for_text(ss_details, search_str):
"""Search for the search string within the image content"""
# Find all matches within one page
results = re.findall(search_str, ss_details['text'], re.IGNORECASE)
# In case multiple matches within one page
for result in results:
yield result
Nous utiliserons cette fonction pour rechercher un texte spécifique dans le contenu saisi d'une image. Il renvoie un générateur des correspondances trouvées.
def save_page_content(pdfContent, page_id, page_data):
"""Appends the content of a scanned page, line by line, to a pandas DataFrame."""
if page_data:
for idx, line in enumerate(page_data, 1):
line = ' '.join(line)
pdfContent = pdfContent.append(
{'page': page_id, 'line_id': idx, 'line': line}, ignore_index=True
)
return pdfContent
save_page_content()
La fonction ajoute le contenu saisi d'une image ligne par ligne après l'avoir numérisée à la trame de pdfContent
données pandas.
Créons maintenant une fonction pour enregistrer la trame de données résultante dans un fichier CSV :
def save_file_content(pdfContent, input_file):
"""Outputs the content of the pandas DataFrame to a CSV file having the same path as the input_file
but with different extension (.csv)"""
content_file = os.path.join(os.path.dirname(input_file), os.path.splitext(
os.path.basename(input_file))[0] + ".csv")
pdfContent.to_csv(content_file, sep=',', index=False)
return content_file
Ensuite, écrivons une fonction qui calcule le score de confiance du texte extrait de l'image numérisée :
def calculate_ss_confidence(ss_details: dict):
"""Calculate the confidence score of the text grabbed from the scanned image."""
# page_num --> Page number of the detected text or item
# block_num --> Block number of the detected text or item
# par_num --> Paragraph number of the detected text or item
# line_num --> Line number of the detected text or item
# Convert the dict to dataFrame
df = pd.DataFrame.from_dict(ss_details)
# Convert the field conf (confidence) to numeric
df['conf'] = pd.to_numeric(df['conf'], errors='coerce')
# Elliminate records with negative confidence
df = df[df.conf != -1]
# Calculate the mean confidence by page
conf = df.groupby(['page_num'])['conf'].mean().tolist()
return conf[0]
Aller à la fonction principale : numérisation de l'image :
def ocr_img(
img: np.array, input_file: str, search_str: str,
highlight_readable_text: bool = False, action: str = 'Highlight',
show_comparison: bool = False, generate_output: bool = True):
"""Scans an image buffer or an image file.
Pre-processes the image.
Calls the Tesseract engine with pre-defined parameters.
Calculates the confidence score of the image grabbed content.
Draws a green rectangle around readable text items having a confidence score > 30.
Searches for a specific text.
Highlight or redact found matches of the searched text.
Displays a window showing readable text fields or the highlighted or redacted text.
Generates the text content of the image.
Prints a summary to the console."""
# If image source file is inputted as a parameter
if input_file:
# Reading image using opencv
img = cv2.imread(input_file)
# Preserve a copy of this image for comparison purposes
initial_img = img.copy()
highlighted_img = img.copy()
# Convert image to binary
bin_img = convert_img2bin(img)
# Calling Tesseract
# Tesseract Configuration parameters
# oem --> OCR engine mode = 3 >> Legacy + LSTM mode only (LSTM neutral net mode works the best)
# psm --> page segmentation mode = 6 >> Assume as single uniform block of text (How a page of text can be analyzed)
config_param = r'--oem 3 --psm 6'
# Feeding image to tesseract
details = pytesseract.image_to_data(
bin_img, output_type=Output.DICT, config=config_param, lang='eng')
# The details dictionary contains the information of the input image
# such as detected text, region, position, information, height, width, confidence score.
ss_confidence = calculate_ss_confidence(details)
boxed_img = None
# Total readable items
ss_readable_items = 0
# Total matches found
ss_matches = 0
for seq in range(len(details['text'])):
# Consider only text fields with confidence score > 30 (text is readable)
if float(details['conf'][seq]) > 30.0:
ss_readable_items += 1
# Draws a green rectangle around readable text items having a confidence score > 30
if highlight_readable_text:
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
boxed_img = cv2.rectangle(
img, (x, y), (x+w, y+h), (0, 255, 0), 2)
# Searches for the string
if search_str:
results = re.findall(
search_str, details['text'][seq], re.IGNORECASE)
for result in results:
ss_matches += 1
if action:
# Draw a red rectangle around the searchable text
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
# Details of the rectangle
# Starting coordinate representing the top left corner of the rectangle
start_point = (x, y)
# Ending coordinate representing the botton right corner of the rectangle
end_point = (x + w, y + h)
#Color in BGR -- Blue, Green, Red
if action == "Highlight":
color = (0, 255, 255) # Yellow
elif action == "Redact":
color = (0, 0, 0) # Black
# Thickness in px (-1 will fill the entire shape)
thickness = -1
boxed_img = cv2.rectangle(
img, start_point, end_point, color, thickness)
if ss_readable_items > 0 and highlight_readable_text and not (ss_matches > 0 and action in ("Highlight", "Redact")):
highlighted_img = boxed_img.copy()
# Highlight found matches of the search string
if ss_matches > 0 and action == "Highlight":
cv2.addWeighted(boxed_img, 0.4, highlighted_img,
1 - 0.4, 0, highlighted_img)
# Redact found matches of the search string
elif ss_matches > 0 and action == "Redact":
highlighted_img = boxed_img.copy()
#cv2.addWeighted(boxed_img, 1, highlighted_img, 0, 0, highlighted_img)
# save the image
cv2.imwrite("highlighted-text-image.jpg", highlighted_img)
# Displays window showing readable text fields or the highlighted or redacted data
if show_comparison and (highlight_readable_text or action):
title = input_file if input_file else 'Compare'
conc_img = cv2.hconcat([initial_img, highlighted_img])
display_img(title, conc_img)
# Generates the text content of the image
output_data = None
if generate_output and details:
output_data = generate_ss_text(details)
# Prints a summary to the console
if input_file:
summary = {
"File": input_file, "Total readable words": ss_readable_items, "Total matches": ss_matches, "Confidence score": ss_confidence
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("###################################################################")
return highlighted_img, ss_readable_items, ss_matches, ss_confidence, output_data
# pass image into pytesseract module
# pytesseract is trained in many languages
#config_param = r'--oem 3 --psm 6'
#details = pytesseract.image_to_data(img,config=config_param,lang='eng')
# print(details)
# return details
Ce qui précède effectue les opérations suivantes :
def image_to_byte_array(image: Image):
"""
Converts an image into a byte array
"""
imgByteArr = BytesIO()
image.save(imgByteArr, format=image.format if image.format else 'JPEG')
imgByteArr = imgByteArr.getvalue()
return imgByteArr
def ocr_file(**kwargs):
"""Opens the input PDF File.
Opens a memory buffer for storing the output PDF file.
Creates a DataFrame for storing pages statistics
Iterates throughout the chosen pages of the input PDF file
Grabs a screen-shot of the selected PDF page.
Converts the screen-shot pix to a numpy array
Scans the grabbed screen-shot.
Collects the statistics of the screen-shot(page).
Saves the content of the screen-shot(page).
Adds the updated screen-shot (Highlighted, Redacted) to the output file.
Saves the whole content of the PDF file.
Saves the output PDF file if required.
Prints a summary to the console."""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
highlight_readable_text = kwargs.get('highlight_readable_text')
action = kwargs.get('action')
show_comparison = kwargs.get('show_comparison')
generate_output = kwargs.get('generate_output')
# Opens the input PDF file
pdfIn = fitz.open(input_file)
# Opens a memory buffer for storing the output PDF file.
pdfOut = fitz.open()
# Creates an empty DataFrame for storing pages statistics
dfResult = pd.DataFrame(
columns=['page', 'page_readable_items', 'page_matches', 'page_total_confidence'])
# Creates an empty DataFrame for storing file content
if generate_output:
pdfContent = pd.DataFrame(columns=['page', 'line_id', 'line'])
# Iterate throughout the pages of the input file
for pg in range(pdfIn.pageCount):
if str(pages) != str(None):
if str(pg) not in str(pages):
continue
# Select a page
page = pdfIn[pg]
# Rotation angle
rotate = int(0)
# PDF Page is converted into a whole picture 1056*816 and then for each picture a screenshot is taken.
# zoom = 1.33333333 -----> Image size = 1056*816
# zoom = 2 ---> 2 * Default Resolution (text is clear, image text is hard to read) = filesize small / Image size = 1584*1224
# zoom = 4 ---> 4 * Default Resolution (text is clear, image text is barely readable) = filesize large
# zoom = 8 ---> 8 * Default Resolution (text is clear, image text is readable) = filesize large
zoom_x = 2
zoom_y = 2
# The zoom factor is equal to 2 in order to make text clear
# Pre-rotate is to rotate if needed.
mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
# To captue a specific part of the PDF page
# rect = page.rect #page size
# mp = rect.tl + (rect.bl - (0.75)/zoom_x) #rectangular area 56 = 75/1.3333
# clip = fitz.Rect(mp,rect.br) #The area to capture
# pix = page.getPixmap(matrix=mat, alpha=False,clip=clip)
# Get a screen-shot of the PDF page
# Colorspace -> represents the color space of the pixmap (csRGB, csGRAY, csCMYK)
# alpha -> Transparancy indicator
pix = page.getPixmap(matrix=mat, alpha=False, colorspace="csGRAY")
# convert the screen-shot pix to numpy array
img = pix2np(pix)
# Erode image to omit or thin the boundaries of the bright area of the image
# We apply Erosion on binary images.
#kernel = np.ones((2,2) , np.uint8)
#img = cv2.erode(img,kernel,iterations=1)
upd_np_array, pg_readable_items, pg_matches, pg_total_confidence, pg_output_data \
= ocr_img(img=img, input_file=None, search_str=search_str, highlight_readable_text=highlight_readable_text # False
, action=action # 'Redact'
, show_comparison=show_comparison # True
, generate_output=generate_output # False
)
# Collects the statistics of the page
dfResult = dfResult.append({'page': (pg+1), 'page_readable_items': pg_readable_items,
'page_matches': pg_matches, 'page_total_confidence': pg_total_confidence}, ignore_index=True)
if generate_output:
pdfContent = save_page_content(
pdfContent=pdfContent, page_id=(pg+1), page_data=pg_output_data)
# Convert the numpy array to image object with mode = RGB
#upd_img = Image.fromarray(np.uint8(upd_np_array)).convert('RGB')
upd_img = Image.fromarray(upd_np_array[..., ::-1])
# Convert the image to byte array
upd_array = image_to_byte_array(upd_img)
# Get Page Size
"""
#To check whether initial page is portrait or landscape
if page.rect.width > page.rect.height:
fmt = fitz.PaperRect("a4-1")
else:
fmt = fitz.PaperRect("a4")
#pno = -1 -> Insert after last page
pageo = pdfOut.newPage(pno = -1, width = fmt.width, height = fmt.height)
"""
pageo = pdfOut.newPage(
pno=-1, width=page.rect.width, height=page.rect.height)
pageo.insertImage(page.rect, stream=upd_array)
#pageo.insertImage(page.rect, stream=upd_img.tobytes())
#pageo.showPDFpage(pageo.rect, pdfDoc, page.number)
content_file = None
if generate_output:
content_file = save_file_content(
pdfContent=pdfContent, input_file=input_file)
summary = {
"File": input_file, "Total pages": pdfIn.pageCount,
"Processed pages": dfResult['page'].count(), "Total readable words": dfResult['page_readable_items'].sum(),
"Total matches": dfResult['page_matches'].sum(), "Confidence score": dfResult['page_total_confidence'].mean(),
"Output file": output_file, "Content file": content_file
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("\nPages Statistics:")
print(dfResult, sep='\n')
print("###################################################################")
pdfIn.close()
if output_file:
pdfOut.save(output_file)
pdfOut.close()
La image_to_byte_array()
fonction convertit une image en un tableau d'octets.
La ocr_file()
fonction effectue les opérations suivantes :
Ajoutons une autre fonction pour traiter un dossier contenant plusieurs fichiers PDF :
def ocr_folder(**kwargs):
"""Scans all PDF Files within a specified path"""
input_folder = kwargs.get('input_folder')
# Run in recursive mode
recursive = kwargs.get('recursive')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
action = kwargs.get('action')
generate_output = kwargs.get('generate_output')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
output_file = None
if search_str:
# Generate an output file
output_file = os.path.join(os.path.dirname(
inp_pdf_file), 'ocr_' + os.path.basename(inp_pdf_file))
ocr_file(
input_file=inp_pdf_file, output_file=output_file, search_str=search_str, pages=pages, highlight_readable_text=False, action=action, show_comparison=False, generate_output=generate_output
)
if not recursive:
break
Cette fonction est destinée à numériser les fichiers PDF inclus dans un dossier spécifique. Il parcourt les fichiers du dossier spécifié de manière récursive ou non selon la valeur du paramètre récursif et traite ces fichiers un par un.
Il accepte les paramètres suivants :
input_folder
: Le chemin du dossier contenant les fichiers PDF à traiter.search_str
: Le texte à rechercher pour manipuler.recursive
: s'il faut exécuter ce processus de manière récursive en bouclant les sous-dossiers ou non.action
: l'action à effectuer parmi les suivantes : Surligner, Caviarder.pages
: les pages à considérer.generate_output
: sélectionnez si vous souhaitez enregistrer le contenu du fichier PDF d'entrée dans un fichier CSV ou nonAvant de terminer, définissons des fonctions utiles pour analyser les arguments de la ligne de commande :
def is_valid_path(path):
"""Validates the path inputted and checks whether it is a file path or a folder path"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input-path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', choices=[
'Highlight', 'Redact'], type=str, help="Choose to highlight or to redact")
parser.add_argument('-s', '--search-str', dest='search_str',
type=str, help="Enter a valid search string")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider in the PDF file, e.g. (0,1)")
parser.add_argument("-g", "--generate-output", action="store_true", help="Generate text content in a CSV file")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file',
type=str, help="Enter a valid output file")
parser.add_argument("-t", "--highlight-readable-text", action="store_true", help="Highlight readable text in the generated image")
parser.add_argument("-c", "--show-comparison", action="store_true", help="Show comparison between captured image and the generated image")
if os.path.isdir(path):
parser.add_argument("-r", "--recursive", action="store_true", help="Whether to process the directory recursively")
# To Porse The Command Line Arguments
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
La is_valid_path()
fonction valide un chemin entré en paramètre et vérifie s'il s'agit d'un chemin de fichier ou d'un chemin de répertoire.
La parse_args()
fonction définit et définit les contraintes appropriées pour les arguments de ligne de commande de l'utilisateur lors de l'exécution de cet utilitaire.
Vous trouverez ci-dessous des explications pour tous les paramètres :
input_path
: Paramètre obligatoire pour saisir le chemin du fichier ou du dossier à traiter, ce paramètre est associé à la is_valid_path()
fonction précédemment définie.action
: L'action à effectuer parmi une liste d'options prédéfinies pour éviter toute sélection erronée.search_str
: Le texte à rechercher pour manipuler.pages
: les pages à considérer lors du traitement d'un fichier PDF.generate_content
: spécifie s'il faut générer le contenu saisi du fichier d'entrée, qu'il s'agisse d'une image ou d'un PDF dans un fichier CSV ou non.output_file
: Le chemin du fichier de sortie. Le remplissage de cet argument est contraint par la sélection d'un fichier en entrée, et non d'un répertoire.highlight_readable_text
: pour dessiner des rectangles verts autour des champs de texte lisibles ayant un score de confiance supérieur à 30.show_comparison
: Affiche une fenêtre montrant une comparaison entre l'image d'origine et l'image traitée.recursive
: s'il faut traiter un dossier de manière récursive ou non. Le remplissage de cet argument est contraint par la sélection d'un répertoire.Enfin, écrivons le code principal qui utilise les fonctions définies précédemment :
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Process a file
if filetype.is_image(args['input_path']):
ocr_img(
# if 'search_str' in (args.keys()) else None
img=None, input_file=args['input_path'], search_str=args['search_str'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
else:
ocr_file(
input_file=args['input_path'], output_file=args['output_file'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
ocr_folder(
input_folder=args['input_path'], recursive=args['recursive'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], action=args['action'], generate_output=args['generate_output']
)
Testons notre programme :
$ python pdf_ocr.py
Sortir:
usage: pdf_ocr.py [-h] -i INPUT_PATH [-a {Highlight,Redact}] [-s SEARCH_STR] [-p PAGES] [-g GENERATE_OUTPUT]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Highlight,Redact}, --action {Highlight,Redact}
Choose to highlight or to redact
-s SEARCH_STR, --search_str SEARCH_STR
Enter a valid search string
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: (0,1)
-g GENERATE_OUTPUT, --generate_output GENERATE_OUTPUT
Generate content in a CSV file
Avant d'explorer nos scénarios de test, prenez garde aux points suivants :
PermissionError
erreur, veuillez fermer le fichier d'entrée avant d'exécuter cet utilitaire.Essayons d'abord d'entrer une image (vous pouvez l'obtenir ici si vous voulez obtenir la même sortie), sans aucun fichier PDF impliqué :
$ python pdf_ocr.py -s "BERT" -a Highlight -i example-image-containing-text.jpg
Voici la sortie :
## Command Arguments #################################################
input_path:example-image-containing-text.jpg
action:Highlight
search_str:BERT
pages:None
generate_output:False
output_file:None
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:example-image-containing-text.jpg
Total readable words:192
Total matches:3
Confidence score:89.89337547979804
###################################################################
Et une nouvelle image est apparue dans le répertoire courant :
Vous pouvez passer
-t
ou --highlight-readable-text
mettre en surbrillance tout le texte détecté (avec un format différent, afin de distinguer la chaîne recherchée des autres).
Vous pouvez également passer -c
ou --show-comparison
pour afficher l'image d'origine et l'image modifiée dans la même fenêtre.
Maintenant que cela fonctionne pour les images, essayons pour les fichiers PDF :
$ python pdf_ocr.py -s "BERT" -i image.pdf -o output.pdf --generate-output -a "Highlight"
image.pdf
est un simple fichier PDF contenant l'image de l'exemple précédent (encore une fois, vous pouvez l'obtenir ici ).
Cette fois, nous avons passé un fichier PDF à l' -i
argument, et output.pdf
en tant que fichier PDF résultant (où toute la surbrillance se produit). La commande ci-dessus génère la sortie suivante :
## Command Arguments #################################################
input_path:image.pdf
action:Highlight
search_str:BERT
pages:None
generate_output:True
output_file:output.pdf
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:image.pdf
Total pages:1
Processed pages:1
Total readable words:192.0
Total matches:3.0
Confidence score:83.1775128855722
Output file:output.pdf
Content file:image.csv
Pages Statistics:
page page_readable_items page_matches page_total_confidence
0 1.0 192.0 3.0 83.177513
###################################################################
Le output.pdf
fichier est produit après l'exécution, où il comprend le même PDF original mais avec du texte en surbrillance. De plus, nous avons maintenant des statistiques sur notre fichier PDF, où 192 mots au total ont été détectés, et 3 ont été appariés à l'aide de notre recherche avec une confiance d'environ 83,2 %.
Un fichier CSV est également généré qui inclut le texte détecté de l'image sur chaque ligne.
Il existe d'autres paramètres que nous n'avons pas utilisés dans nos exemples, n'hésitez pas à les explorer. Vous pouvez également passer un dossier entier à l' -i
argument pour analyser une collection de fichiers PDF.
Tesseract est parfait pour numériser des documents propres et clairs. Une numérisation de mauvaise qualité peut produire des résultats médiocres dans l'OCR. Normalement, il ne donne pas de résultats précis des images affectées par des artefacts, notamment une occlusion partielle, une perspective déformée et un arrière-plan complexe.
Source de l'article original sur https://www.thepythoncode.com
1648848120
現在、中規模から大規模の企業では、大量の印刷ドキュメントが日常的に使用されています。その中には、請求書、領収書、企業文書、レポート、およびメディアリリースがあります。
これらの企業にとって、OCRスキャナーを使用すると、効率と精度を向上させながら、かなりの時間を節約できます。
光学式文字認識(OCR)アルゴリズムを使用すると、コンピューターは印刷または手書きのドキュメントを自動的に分析し、コンピューターが効率的に処理できるようにテキストデータを編集可能な形式に準備できます。OCRシステムは、機械で印刷または手書きされたテキストを含む可能性のあるテキストの2次元画像を、その画像表現から機械で読み取り可能なテキストに変換します。
一般に、OCRエンジンには、光学式文字認識を使用して効率的な問題解決を行うための機械学習アルゴリズムをトレーニングするために必要な複数のステップが含まれます。
自動文字認識にアプローチするには、エンジンごとに異なる可能性のある次の手順が大まかに必要です。
このチュートリアルでは、次のことを紹介します。
このチュートリアルは、PDFドキュメント内の画像からテキストを抽出することを目的としていることに注意してください。
開始するには、次のライブラリを使用する必要があります。
Tesseract OCR: Apache 2.0ライセンスの下で利用可能なオープンソースのテキスト認識エンジンであり、その開発は2006年からGoogleによって後援されています。2006年、Tesseractは最も正確なオープンソースOCRエンジンの1つと見なされました。直接使用することも、APIを使用して画像から印刷されたテキストを抽出することもできます。最良の部分は、それが多種多様な言語をサポートしていることです。
Tesseractエンジンのインストールは、この記事の範囲外です。ただし、オペレーティングシステムにインストールするには、Tesseractの公式インストールガイドに従う必要があります。
Tesseractの設定を検証するには、次のコマンドを実行して、生成された出力を確認してください。
Python-tesseract: GoogleのTesseract-OCRエンジンのPythonラッパーです。また、jpeg、png、gif、bmp、tiffなど、PillowおよびLeptonicaイメージングライブラリでサポートされているすべての画像タイプを読み取ることができるため、tesseractのスタンドアロン呼び出しスクリプトとしても役立ちます。
OpenCV:コンピュータービジョン、機械学習、画像処理用のPythonオープンソースライブラリです。OpenCVは、Python、C ++、Javaなどのさまざまなプログラミング言語をサポートしています。画像やビデオを処理して、オブジェクト、顔、さらには人間の手書きを識別できます。
PyMuPDF: MuPDFは、非常に用途が広く、カスタマイズ可能なPDF、XPS、およびeBookインタープリターソリューションであり、PDFレンダラー、ビューアー、またはツールキットとして幅広いアプリケーションで使用できます。PyMuPDFは、MuPDF用のPythonバインディングです。軽量のPDFおよびXPSビューアーです。
Numpy:は汎用の配列処理パッケージです。高性能の多次元配列オブジェクトと、これらの配列を操作するためのツールを提供します。これは、Pythonを使用した科学計算の基本的なパッケージです。さらに、Numpyは、汎用データの効率的な多次元コンテナーとしても使用できます。
枕: PIL(Python Image Library)の上に構築されています。Pythonでの画像処理に不可欠なモジュールです。
Pandas:オープンソースのBSDライセンスのPythonライブラリであり、Pythonプログラミング言語用の高性能で使いやすいデータ構造とデータ分析ツールを提供します。
Filetype:ファイルタイプとMIMEタイプを推測するための小さくて依存関係のないPythonパッケージ。
このチュートリアルは、画像やスキャンしたPDFファイル、またはPDFファイルのコレクションを含むフォルダー内に含まれるテキストを抽出、編集、または強調表示するための軽量のコマンドラインベースのユーティリティを開発することを目的としています。
開始するには、要件をインストールしましょう。
$ pip install Filetype==1.0.7 numpy==1.19.4 opencv-python==4.4.0.46 pandas==1.1.4 Pillow==8.0.1 PyMuPDF==1.18.9 pytesseract==0.3.7
必要なライブラリをインポートすることから始めましょう:
import os
import re
import argparse
import pytesseract
from pytesseract import Output
import cv2
import numpy as np
import fitz
from io import BytesIO
from PIL import Image
import pandas as pd
import filetype
# Path Of The Tesseract OCR engine
TESSERACT_PATH = "C:\Program Files\Tesseract-OCR\tesseract.exe"
# Include tesseract executable
pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH
TESSERACT_PATH
Tesseract実行可能ファイルが配置されている場所です。明らかに、あなたはあなたのケースのためにそれを変える必要があります。
def pix2np(pix):
"""
Converts a pixmap buffer into a numpy array
"""
# pix.samples = sequence of bytes of the image pixels like RGBA
#pix.h = height in pixels
#pix.w = width in pixels
# pix.n = number of components per pixel (depends on the colorspace and alpha)
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.h, pix.w, pix.n)
try:
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
except IndexError:
# Convert Gray to RGB
im = cv2.cvtColor(im, cv2.COLOR_GRAY2RGB)
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
return im
この関数は、PyMuPDFライブラリを使用して撮影されたスクリーンショットを表すピックスマップバッファーをNumPy配列に変換します。
Tesseractの精度を向上させるために、OpenCVを使用していくつかの前処理関数を定義しましょう。
# Image Pre-Processing Functions to improve output accurracy
# Convert to grayscale
def grayscale(img):
return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Remove noise
def remove_noise(img):
return cv2.medianBlur(img, 5)
# Thresholding
def threshold(img):
# return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# dilation
def dilate(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.dilate(img, kernel, iterations=1)
# erosion
def erode(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.erode(img, kernel, iterations=1)
# opening -- erosion followed by a dilation
def opening(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
# canny edge detection
def canny(img):
return cv2.Canny(img, 100, 200)
# skew correction
def deskew(img):
coords = np.column_stack(np.where(img > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
(h, w) = img.shape[:2]
center = (w//2, h//2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(
img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
return rotated
# template matching
def match_template(img, template):
return cv2.matchTemplate(img, template, cv2.TM_CCOEFF_NORMED)
def convert_img2bin(img):
"""
Pre-processes the image and generates a binary output
"""
# Convert the image into a grayscale image
output_img = grayscale(img)
# Invert the grayscale image by flipping pixel values.
# All pixels that are grater than 0 are set to 0 and all pixels that are = to 0 are set to 255
output_img = cv2.bitwise_not(output_img)
# Converting image to binary by Thresholding in order to show a clear separation between white and blacl pixels.
output_img = threshold(output_img)
return output_img
画像のグレースケールへの変換、ピクセル値の反転、白と黒のピクセルの分離など、多くの前処理タスクの関数を定義しました。
次に、画像を表示する関数を定義しましょう。
def display_img(title, img):
"""Displays an image on screen and maintains the output until the user presses a key"""
cv2.namedWindow('img', cv2.WINDOW_NORMAL)
cv2.setWindowTitle('img', title)
cv2.resizeWindow('img', 1200, 900)
# Display Image on screen
cv2.imshow('img', img)
# Mantain output until user presses a key
cv2.waitKey(0)
# Destroy windows when user presses a key
cv2.destroyAllWindows()
このdisplay_img()
関数は、タイトルがパラメーターに設定されているウィンドウに画像を画面に表示しtitle
、ユーザーがキーボードのキーを押すまでこのウィンドウを開いたままにします。
def generate_ss_text(ss_details):
"""Loops through the captured text of an image and arranges this text line by line.
This function depends on the image layout."""
# Arrange the captured text after scanning the page
parse_text = []
word_list = []
last_word = ''
# Loop through the captured text of the entire page
for word in ss_details['text']:
# If the word captured is not empty
if word != '':
# Add it to the line word list
word_list.append(word)
last_word = word
if (last_word != '' and word == '') or (word == ss_details['text'][-1]):
parse_text.append(word_list)
word_list = []
return parse_text
上記の関数は、画像のキャプチャされたテキスト全体を反復処理し、取得したテキストを1行ずつ配置します。画像のレイアウトによって異なり、一部の画像形式では微調整が必要になる場合があります。
次に、正規表現を使用してテキストを検索する関数を定義しましょう。
def search_for_text(ss_details, search_str):
"""Search for the search string within the image content"""
# Find all matches within one page
results = re.findall(search_str, ss_details['text'], re.IGNORECASE)
# In case multiple matches within one page
for result in results:
yield result
この関数を使用して、画像の取得したコンテンツ内の特定のテキストを検索します。見つかった一致のジェネレーターを返します。
def save_page_content(pdfContent, page_id, page_data):
"""Appends the content of a scanned page, line by line, to a pandas DataFrame."""
if page_data:
for idx, line in enumerate(page_data, 1):
line = ' '.join(line)
pdfContent = pdfContent.append(
{'page': page_id, 'line_id': idx, 'line': line}, ignore_index=True
)
return pdfContent
save_page_content()pdfContent
関数は、パンダのデータフレームにスキャンした後、画像の取得したコンテンツを1行ずつ追加します。
次に、結果のデータフレームをCSVファイルに保存する関数を作成しましょう。
def save_file_content(pdfContent, input_file):
"""Outputs the content of the pandas DataFrame to a CSV file having the same path as the input_file
but with different extension (.csv)"""
content_file = os.path.join(os.path.dirname(input_file), os.path.splitext(
os.path.basename(input_file))[0] + ".csv")
pdfContent.to_csv(content_file, sep=',', index=False)
return content_file
次に、スキャンした画像から取得したテキストの信頼スコアを計算する関数を作成しましょう。
def calculate_ss_confidence(ss_details: dict):
"""Calculate the confidence score of the text grabbed from the scanned image."""
# page_num --> Page number of the detected text or item
# block_num --> Block number of the detected text or item
# par_num --> Paragraph number of the detected text or item
# line_num --> Line number of the detected text or item
# Convert the dict to dataFrame
df = pd.DataFrame.from_dict(ss_details)
# Convert the field conf (confidence) to numeric
df['conf'] = pd.to_numeric(df['conf'], errors='coerce')
# Elliminate records with negative confidence
df = df[df.conf != -1]
# Calculate the mean confidence by page
conf = df.groupby(['page_num'])['conf'].mean().tolist()
return conf[0]
主な機能に移動します:画像をスキャンします:
def ocr_img(
img: np.array, input_file: str, search_str: str,
highlight_readable_text: bool = False, action: str = 'Highlight',
show_comparison: bool = False, generate_output: bool = True):
"""Scans an image buffer or an image file.
Pre-processes the image.
Calls the Tesseract engine with pre-defined parameters.
Calculates the confidence score of the image grabbed content.
Draws a green rectangle around readable text items having a confidence score > 30.
Searches for a specific text.
Highlight or redact found matches of the searched text.
Displays a window showing readable text fields or the highlighted or redacted text.
Generates the text content of the image.
Prints a summary to the console."""
# If image source file is inputted as a parameter
if input_file:
# Reading image using opencv
img = cv2.imread(input_file)
# Preserve a copy of this image for comparison purposes
initial_img = img.copy()
highlighted_img = img.copy()
# Convert image to binary
bin_img = convert_img2bin(img)
# Calling Tesseract
# Tesseract Configuration parameters
# oem --> OCR engine mode = 3 >> Legacy + LSTM mode only (LSTM neutral net mode works the best)
# psm --> page segmentation mode = 6 >> Assume as single uniform block of text (How a page of text can be analyzed)
config_param = r'--oem 3 --psm 6'
# Feeding image to tesseract
details = pytesseract.image_to_data(
bin_img, output_type=Output.DICT, config=config_param, lang='eng')
# The details dictionary contains the information of the input image
# such as detected text, region, position, information, height, width, confidence score.
ss_confidence = calculate_ss_confidence(details)
boxed_img = None
# Total readable items
ss_readable_items = 0
# Total matches found
ss_matches = 0
for seq in range(len(details['text'])):
# Consider only text fields with confidence score > 30 (text is readable)
if float(details['conf'][seq]) > 30.0:
ss_readable_items += 1
# Draws a green rectangle around readable text items having a confidence score > 30
if highlight_readable_text:
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
boxed_img = cv2.rectangle(
img, (x, y), (x+w, y+h), (0, 255, 0), 2)
# Searches for the string
if search_str:
results = re.findall(
search_str, details['text'][seq], re.IGNORECASE)
for result in results:
ss_matches += 1
if action:
# Draw a red rectangle around the searchable text
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
# Details of the rectangle
# Starting coordinate representing the top left corner of the rectangle
start_point = (x, y)
# Ending coordinate representing the botton right corner of the rectangle
end_point = (x + w, y + h)
#Color in BGR -- Blue, Green, Red
if action == "Highlight":
color = (0, 255, 255) # Yellow
elif action == "Redact":
color = (0, 0, 0) # Black
# Thickness in px (-1 will fill the entire shape)
thickness = -1
boxed_img = cv2.rectangle(
img, start_point, end_point, color, thickness)
if ss_readable_items > 0 and highlight_readable_text and not (ss_matches > 0 and action in ("Highlight", "Redact")):
highlighted_img = boxed_img.copy()
# Highlight found matches of the search string
if ss_matches > 0 and action == "Highlight":
cv2.addWeighted(boxed_img, 0.4, highlighted_img,
1 - 0.4, 0, highlighted_img)
# Redact found matches of the search string
elif ss_matches > 0 and action == "Redact":
highlighted_img = boxed_img.copy()
#cv2.addWeighted(boxed_img, 1, highlighted_img, 0, 0, highlighted_img)
# save the image
cv2.imwrite("highlighted-text-image.jpg", highlighted_img)
# Displays window showing readable text fields or the highlighted or redacted data
if show_comparison and (highlight_readable_text or action):
title = input_file if input_file else 'Compare'
conc_img = cv2.hconcat([initial_img, highlighted_img])
display_img(title, conc_img)
# Generates the text content of the image
output_data = None
if generate_output and details:
output_data = generate_ss_text(details)
# Prints a summary to the console
if input_file:
summary = {
"File": input_file, "Total readable words": ss_readable_items, "Total matches": ss_matches, "Confidence score": ss_confidence
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("###################################################################")
return highlighted_img, ss_readable_items, ss_matches, ss_confidence, output_data
# pass image into pytesseract module
# pytesseract is trained in many languages
#config_param = r'--oem 3 --psm 6'
#details = pytesseract.image_to_data(img,config=config_param,lang='eng')
# print(details)
# return details
上記は以下を実行します:
def image_to_byte_array(image: Image):
"""
Converts an image into a byte array
"""
imgByteArr = BytesIO()
image.save(imgByteArr, format=image.format if image.format else 'JPEG')
imgByteArr = imgByteArr.getvalue()
return imgByteArr
def ocr_file(**kwargs):
"""Opens the input PDF File.
Opens a memory buffer for storing the output PDF file.
Creates a DataFrame for storing pages statistics
Iterates throughout the chosen pages of the input PDF file
Grabs a screen-shot of the selected PDF page.
Converts the screen-shot pix to a numpy array
Scans the grabbed screen-shot.
Collects the statistics of the screen-shot(page).
Saves the content of the screen-shot(page).
Adds the updated screen-shot (Highlighted, Redacted) to the output file.
Saves the whole content of the PDF file.
Saves the output PDF file if required.
Prints a summary to the console."""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
highlight_readable_text = kwargs.get('highlight_readable_text')
action = kwargs.get('action')
show_comparison = kwargs.get('show_comparison')
generate_output = kwargs.get('generate_output')
# Opens the input PDF file
pdfIn = fitz.open(input_file)
# Opens a memory buffer for storing the output PDF file.
pdfOut = fitz.open()
# Creates an empty DataFrame for storing pages statistics
dfResult = pd.DataFrame(
columns=['page', 'page_readable_items', 'page_matches', 'page_total_confidence'])
# Creates an empty DataFrame for storing file content
if generate_output:
pdfContent = pd.DataFrame(columns=['page', 'line_id', 'line'])
# Iterate throughout the pages of the input file
for pg in range(pdfIn.pageCount):
if str(pages) != str(None):
if str(pg) not in str(pages):
continue
# Select a page
page = pdfIn[pg]
# Rotation angle
rotate = int(0)
# PDF Page is converted into a whole picture 1056*816 and then for each picture a screenshot is taken.
# zoom = 1.33333333 -----> Image size = 1056*816
# zoom = 2 ---> 2 * Default Resolution (text is clear, image text is hard to read) = filesize small / Image size = 1584*1224
# zoom = 4 ---> 4 * Default Resolution (text is clear, image text is barely readable) = filesize large
# zoom = 8 ---> 8 * Default Resolution (text is clear, image text is readable) = filesize large
zoom_x = 2
zoom_y = 2
# The zoom factor is equal to 2 in order to make text clear
# Pre-rotate is to rotate if needed.
mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
# To captue a specific part of the PDF page
# rect = page.rect #page size
# mp = rect.tl + (rect.bl - (0.75)/zoom_x) #rectangular area 56 = 75/1.3333
# clip = fitz.Rect(mp,rect.br) #The area to capture
# pix = page.getPixmap(matrix=mat, alpha=False,clip=clip)
# Get a screen-shot of the PDF page
# Colorspace -> represents the color space of the pixmap (csRGB, csGRAY, csCMYK)
# alpha -> Transparancy indicator
pix = page.getPixmap(matrix=mat, alpha=False, colorspace="csGRAY")
# convert the screen-shot pix to numpy array
img = pix2np(pix)
# Erode image to omit or thin the boundaries of the bright area of the image
# We apply Erosion on binary images.
#kernel = np.ones((2,2) , np.uint8)
#img = cv2.erode(img,kernel,iterations=1)
upd_np_array, pg_readable_items, pg_matches, pg_total_confidence, pg_output_data \
= ocr_img(img=img, input_file=None, search_str=search_str, highlight_readable_text=highlight_readable_text # False
, action=action # 'Redact'
, show_comparison=show_comparison # True
, generate_output=generate_output # False
)
# Collects the statistics of the page
dfResult = dfResult.append({'page': (pg+1), 'page_readable_items': pg_readable_items,
'page_matches': pg_matches, 'page_total_confidence': pg_total_confidence}, ignore_index=True)
if generate_output:
pdfContent = save_page_content(
pdfContent=pdfContent, page_id=(pg+1), page_data=pg_output_data)
# Convert the numpy array to image object with mode = RGB
#upd_img = Image.fromarray(np.uint8(upd_np_array)).convert('RGB')
upd_img = Image.fromarray(upd_np_array[..., ::-1])
# Convert the image to byte array
upd_array = image_to_byte_array(upd_img)
# Get Page Size
"""
#To check whether initial page is portrait or landscape
if page.rect.width > page.rect.height:
fmt = fitz.PaperRect("a4-1")
else:
fmt = fitz.PaperRect("a4")
#pno = -1 -> Insert after last page
pageo = pdfOut.newPage(pno = -1, width = fmt.width, height = fmt.height)
"""
pageo = pdfOut.newPage(
pno=-1, width=page.rect.width, height=page.rect.height)
pageo.insertImage(page.rect, stream=upd_array)
#pageo.insertImage(page.rect, stream=upd_img.tobytes())
#pageo.showPDFpage(pageo.rect, pdfDoc, page.number)
content_file = None
if generate_output:
content_file = save_file_content(
pdfContent=pdfContent, input_file=input_file)
summary = {
"File": input_file, "Total pages": pdfIn.pageCount,
"Processed pages": dfResult['page'].count(), "Total readable words": dfResult['page_readable_items'].sum(),
"Total matches": dfResult['page_matches'].sum(), "Confidence score": dfResult['page_total_confidence'].mean(),
"Output file": output_file, "Content file": content_file
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("\nPages Statistics:")
print(dfResult, sep='\n')
print("###################################################################")
pdfIn.close()
if output_file:
pdfOut.save(output_file)
pdfOut.close()
このimage_to_byte_array()
関数は、画像をバイト配列に変換します。
このocr_file()
関数は次のことを行います。
複数のPDFファイルを含むフォルダーを処理するための別の関数を追加しましょう。
def ocr_folder(**kwargs):
"""Scans all PDF Files within a specified path"""
input_folder = kwargs.get('input_folder')
# Run in recursive mode
recursive = kwargs.get('recursive')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
action = kwargs.get('action')
generate_output = kwargs.get('generate_output')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
output_file = None
if search_str:
# Generate an output file
output_file = os.path.join(os.path.dirname(
inp_pdf_file), 'ocr_' + os.path.basename(inp_pdf_file))
ocr_file(
input_file=inp_pdf_file, output_file=output_file, search_str=search_str, pages=pages, highlight_readable_text=False, action=action, show_comparison=False, generate_output=generate_output
)
if not recursive:
break
この機能は、特定のフォルダに含まれるPDFファイルをスキャンすることを目的としています。パラメータrecursiveの値に応じて、指定されたフォルダのファイル全体を再帰的にループするか、再帰的にループせず、これらのファイルを1つずつ処理します。
次のパラメータを受け入れます。
input_folder
:処理するPDFファイルを含むフォルダーのパス。search_str
:操作するために検索するテキスト。recursive
:サブフォルダーをループしてこのプロセスを再帰的に実行するかどうか。action
:次の中で実行するアクション:ハイライト、編集。pages
:検討するページ。generate_output
:入力したPDFファイルの内容をCSVファイルに保存するかどうかを選択します終了する前に、コマンドライン引数を解析するための便利な関数を定義しましょう。
def is_valid_path(path):
"""Validates the path inputted and checks whether it is a file path or a folder path"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input-path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', choices=[
'Highlight', 'Redact'], type=str, help="Choose to highlight or to redact")
parser.add_argument('-s', '--search-str', dest='search_str',
type=str, help="Enter a valid search string")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider in the PDF file, e.g. (0,1)")
parser.add_argument("-g", "--generate-output", action="store_true", help="Generate text content in a CSV file")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file',
type=str, help="Enter a valid output file")
parser.add_argument("-t", "--highlight-readable-text", action="store_true", help="Highlight readable text in the generated image")
parser.add_argument("-c", "--show-comparison", action="store_true", help="Show comparison between captured image and the generated image")
if os.path.isdir(path):
parser.add_argument("-r", "--recursive", action="store_true", help="Whether to process the directory recursively")
# To Porse The Command Line Arguments
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
このis_valid_path()
関数は、パラメータとして入力されたパスを検証し、それがファイルパスであるかディレクトリパスであるかを確認します。
このparse_args()
関数は、このユーティリティを実行するときに、ユーザーのコマンドライン引数に適切な制約を定義および設定します。
以下は、すべてのパラメーターの説明です。
input_pathis_valid_path()
:処理するファイルまたはフォルダのパスを入力するために必要なパラメータ。このパラメータは、以前に定義された関数に関連付けられています。action
:誤った選択を回避するために、事前定義されたオプションのリストから実行するアクション。search_str
:操作するために検索するテキスト。pages
:PDFファイルを処理するときに考慮するページ。generate_content
:入力ファイルの取得したコンテンツを生成するかどうか、画像またはPDFをCSVファイルに生成するかどうかを指定します。output_file
:出力ファイルのパス。この引数の入力は、ディレクトリではなく、入力としてファイルを選択することによって制約されます。highlight_readable_text
:信頼スコアが30を超える読み取り可能なテキストフィールドの周囲に緑色の長方形を描画します。show_comparison
:元の画像と処理された画像の比較を示すウィンドウを表示します。recursive
:フォルダを再帰的に処理するかどうか。この引数の入力は、ディレクトリの選択によって制約されます。最後に、以前に定義された関数を使用するメインコードを記述しましょう。
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Process a file
if filetype.is_image(args['input_path']):
ocr_img(
# if 'search_str' in (args.keys()) else None
img=None, input_file=args['input_path'], search_str=args['search_str'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
else:
ocr_file(
input_file=args['input_path'], output_file=args['output_file'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
ocr_folder(
input_folder=args['input_path'], recursive=args['recursive'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], action=args['action'], generate_output=args['generate_output']
)
プログラムをテストしてみましょう。
$ python pdf_ocr.py
出力:
usage: pdf_ocr.py [-h] -i INPUT_PATH [-a {Highlight,Redact}] [-s SEARCH_STR] [-p PAGES] [-g GENERATE_OUTPUT]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Highlight,Redact}, --action {Highlight,Redact}
Choose to highlight or to redact
-s SEARCH_STR, --search_str SEARCH_STR
Enter a valid search string
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: (0,1)
-g GENERATE_OUTPUT, --generate_output GENERATE_OUTPUT
Generate content in a CSV file
テストシナリオを検討する前に、次の点に注意してください。
PermissionError
このユーティリティを実行する前に入力ファイルを閉じてください。まず、 PDFファイルを使用せずに画像を入力してみましょう(同じ出力を取得したい場合は、ここで取得できます)。
$ python pdf_ocr.py -s "BERT" -a Highlight -i example-image-containing-text.jpg
以下が出力になります。
## Command Arguments #################################################
input_path:example-image-containing-text.jpg
action:Highlight
search_str:BERT
pages:None
generate_output:False
output_file:None
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:example-image-containing-text.jpg
Total readable words:192
Total matches:3
Confidence score:89.89337547979804
###################################################################
そして、新しい画像が現在のディレクトリに表示されます。
検出されたすべてのテキストを渡す
-t
か--highlight-readable-text
、強調表示することができます(検索文字列を他の文字列と区別するために、異なる形式で)。
また、元の画像と編集した画像を同じウィンドウに渡す-c
か、表示することもできます。--show-comparison
これで画像が機能するようになりました。PDFファイルを試してみましょう。
$ python pdf_ocr.py -s "BERT" -i image.pdf -o output.pdf --generate-output -a "Highlight"
image.pdf
は、前の例の画像を含む単純なPDFファイルです(ここでも、ここで取得できます)。
This time we've passed a PDF file to the -i
argument, and output.pdf
as the resulting PDF file (where all the highlighting occurs). The above command generates the following output:
## Command Arguments #################################################
input_path:image.pdf
action:Highlight
search_str:BERT
pages:None
generate_output:True
output_file:output.pdf
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:image.pdf
Total pages:1
Processed pages:1
Total readable words:192.0
Total matches:3.0
Confidence score:83.1775128855722
Output file:output.pdf
Content file:image.csv
Pages Statistics:
page page_readable_items page_matches page_total_confidence
0 1.0 192.0 3.0 83.177513
###################################################################
The output.pdf
file is produced after the execution, where it includes the same original PDF but with highlighted text. Additionally, we have now statistics about our PDF file, where 192 total words have been detected, and 3 were matched using our search with a confidence of about 83.2%.
A CSV file is also generated that includes the detected text from the image on each line.
There are other parameters we didn't use in our examples, feel free to explore them. You can also pass an entire folder to the -i
argument to scan a collection of PDF files.
Tesseractは、クリーンでクリアなドキュメントをスキャンするのに最適です。スキャンの品質が低いと、OCRの結果が悪くなる可能性があります。通常、部分的なオクルージョン、歪んだ遠近法、複雑な背景などのアーティファクトの影響を受けた画像の正確な結果は得られません。
https://www.thepythoncode.comの元の記事のソース
1648837200
В настоящее время компании среднего и крупного масштаба ежедневно используют огромное количество печатных документов. Среди них счета-фактуры, квитанции, корпоративные документы, отчеты и пресс-релизы.
Для этих компаний использование OCR-сканера может сэкономить значительное количество времени при одновременном повышении эффективности и точности.
Алгоритмы оптического распознавания символов (OCR) позволяют компьютерам автоматически анализировать печатные или рукописные документы и преобразовывать текстовые данные в редактируемые форматы для их эффективной обработки компьютерами. Системы оптического распознавания символов преобразуют двумерное изображение текста, которое может содержать машинопечатный или рукописный текст, из графического представления в машиночитаемый текст.
Как правило, механизм OCR включает в себя несколько шагов, необходимых для обучения алгоритма машинного обучения эффективному решению проблем с помощью оптического распознавания символов.
Следующие шаги, которые могут отличаться от одного движка к другому, примерно необходимы для автоматического распознавания символов:
В рамках этого урока я собираюсь показать вам следующее:
Обратите внимание, что это руководство посвящено извлечению текста из изображений в документах PDF.
Для начала нам понадобятся следующие библиотеки:
Tesseract OCR : это механизм распознавания текста с открытым исходным кодом, доступный по лицензии Apache 2.0, и его разработка спонсируется Google с 2006 года. В 2006 году Tesseract считался одним из самых точных механизмов распознавания текста с открытым исходным кодом. Вы можете использовать его напрямую или использовать API для извлечения печатного текста из изображений. Самое приятное то, что он поддерживает большое количество языков.
Установка движка Tesseract выходит за рамки этой статьи. Однако вам необходимо следовать официальному руководству по установке Tesseract , чтобы установить его в своей операционной системе.
Чтобы проверить установку Tesseract, выполните следующую команду и проверьте сгенерированный вывод:
Python-tesseract : это оболочка Python для Google Tesseract-OCR Engine. Он также полезен в качестве автономного сценария вызова для tesseract, поскольку он может читать все типы изображений, поддерживаемые библиотеками обработки изображений Pillow и Leptonica, включая jpeg, png, gif, bmp, tiff и другие.
OpenCV : это библиотека Python с открытым исходным кодом для компьютерного зрения, машинного обучения и обработки изображений. OpenCV поддерживает широкий спектр языков программирования, таких как Python, C++, Java и т. д. Он может обрабатывать изображения и видео для идентификации объектов, лиц или даже почерка человека.
PyMuPDF : MuPDF — это универсальное, настраиваемое решение для интерпретатора PDF, XPS и электронных книг, которое можно использовать в самых разных приложениях в качестве средства визуализации PDF, средства просмотра или набора инструментов. PyMuPDF — это привязка Python к MuPDF. Это легкая программа для просмотра PDF и XPS.
Numpy: универсальный пакет для обработки массивов. Он предоставляет высокопроизводительный объект многомерного массива и инструменты для работы с этими массивами. Это основной пакет для научных вычислений с Python. Кроме того, Numpy также можно использовать в качестве эффективного многомерного контейнера общих данных.
Подушка: построена поверх PIL (библиотека изображений Python). Это важный модуль для обработки изображений в Python.
Pandas: это библиотека Python с открытым исходным кодом под лицензией BSD, предоставляющая высокопроизводительные, простые в использовании структуры данных и инструменты анализа данных для языка программирования Python.
Тип файла: небольшой пакет Python без зависимостей для определения типа файла и типа MIME.
Целью этого руководства является разработка облегченной утилиты на основе командной строки для извлечения, редактирования или выделения текста, включенного в изображение или отсканированный файл PDF, или в папку, содержащую коллекцию файлов PDF.
Для начала установим требования:
$ pip install Filetype==1.0.7 numpy==1.19.4 opencv-python==4.4.0.46 pandas==1.1.4 Pillow==8.0.1 PyMuPDF==1.18.9 pytesseract==0.3.7
Начнем с импорта необходимых библиотек:
import os
import re
import argparse
import pytesseract
from pytesseract import Output
import cv2
import numpy as np
import fitz
from io import BytesIO
from PIL import Image
import pandas as pd
import filetype
# Path Of The Tesseract OCR engine
TESSERACT_PATH = "C:\Program Files\Tesseract-OCR\tesseract.exe"
# Include tesseract executable
pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH
TESSERACT_PATH
где находится исполняемый файл Tesseract. Очевидно, вам нужно изменить его для вашего случая.
def pix2np(pix):
"""
Converts a pixmap buffer into a numpy array
"""
# pix.samples = sequence of bytes of the image pixels like RGBA
#pix.h = height in pixels
#pix.w = width in pixels
# pix.n = number of components per pixel (depends on the colorspace and alpha)
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.h, pix.w, pix.n)
try:
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
except IndexError:
# Convert Gray to RGB
im = cv2.cvtColor(im, cv2.COLOR_GRAY2RGB)
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
return im
Эта функция преобразует буфер растрового изображения, представляющий снимок экрана, сделанный с помощью библиотеки PyMuPDF, в массив NumPy.
Чтобы повысить точность Tesseract, давайте определим некоторые функции предварительной обработки с помощью OpenCV:
# Image Pre-Processing Functions to improve output accurracy
# Convert to grayscale
def grayscale(img):
return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Remove noise
def remove_noise(img):
return cv2.medianBlur(img, 5)
# Thresholding
def threshold(img):
# return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# dilation
def dilate(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.dilate(img, kernel, iterations=1)
# erosion
def erode(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.erode(img, kernel, iterations=1)
# opening -- erosion followed by a dilation
def opening(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
# canny edge detection
def canny(img):
return cv2.Canny(img, 100, 200)
# skew correction
def deskew(img):
coords = np.column_stack(np.where(img > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
(h, w) = img.shape[:2]
center = (w//2, h//2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(
img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
return rotated
# template matching
def match_template(img, template):
return cv2.matchTemplate(img, template, cv2.TM_CCOEFF_NORMED)
def convert_img2bin(img):
"""
Pre-processes the image and generates a binary output
"""
# Convert the image into a grayscale image
output_img = grayscale(img)
# Invert the grayscale image by flipping pixel values.
# All pixels that are grater than 0 are set to 0 and all pixels that are = to 0 are set to 255
output_img = cv2.bitwise_not(output_img)
# Converting image to binary by Thresholding in order to show a clear separation between white and blacl pixels.
output_img = threshold(output_img)
return output_img
Мы определили функции для многих задач предварительной обработки, включая преобразование изображений в оттенки серого, переключение значений пикселей, разделение белых и черных пикселей и многое другое.
Далее давайте определим функцию для отображения изображения:
def display_img(title, img):
"""Displays an image on screen and maintains the output until the user presses a key"""
cv2.namedWindow('img', cv2.WINDOW_NORMAL)
cv2.setWindowTitle('img', title)
cv2.resizeWindow('img', 1200, 900)
# Display Image on screen
cv2.imshow('img', img)
# Mantain output until user presses a key
cv2.waitKey(0)
# Destroy windows when user presses a key
cv2.destroyAllWindows()
Функция display_img()
отображает на экране изображение в окне, заголовок которого соответствует title
параметру, и поддерживает это окно открытым до тех пор, пока пользователь не нажмет клавишу на клавиатуре.
def generate_ss_text(ss_details):
"""Loops through the captured text of an image and arranges this text line by line.
This function depends on the image layout."""
# Arrange the captured text after scanning the page
parse_text = []
word_list = []
last_word = ''
# Loop through the captured text of the entire page
for word in ss_details['text']:
# If the word captured is not empty
if word != '':
# Add it to the line word list
word_list.append(word)
last_word = word
if (last_word != '' and word == '') or (word == ss_details['text'][-1]):
parse_text.append(word_list)
word_list = []
return parse_text
Вышеупомянутая функция выполняет итерацию по захваченному тексту изображения и упорядочивает захваченный текст построчно. Это зависит от макета изображения и может потребовать настройки для некоторых форматов изображений.
Далее давайте определим функцию для поиска текста с использованием регулярных выражений :
def search_for_text(ss_details, search_str):
"""Search for the search string within the image content"""
# Find all matches within one page
results = re.findall(search_str, ss_details['text'], re.IGNORECASE)
# In case multiple matches within one page
for result in results:
yield result
Мы будем использовать эту функцию для поиска определенного текста в захваченном содержимом изображения. Возвращает генератор найденных совпадений.
def save_page_content(pdfContent, page_id, page_data):
"""Appends the content of a scanned page, line by line, to a pandas DataFrame."""
if page_data:
for idx, line in enumerate(page_data, 1):
line = ' '.join(line)
pdfContent = pdfContent.append(
{'page': page_id, 'line_id': idx, 'line': line}, ignore_index=True
)
return pdfContent
save_page_content()
Функция добавляет захваченное содержимое изображения построчно после его сканирования в кадр данных pdfContent
pandas.
Теперь давайте создадим функцию для сохранения полученного фрейма данных в файл CSV:
def save_file_content(pdfContent, input_file):
"""Outputs the content of the pandas DataFrame to a CSV file having the same path as the input_file
but with different extension (.csv)"""
content_file = os.path.join(os.path.dirname(input_file), os.path.splitext(
os.path.basename(input_file))[0] + ".csv")
pdfContent.to_csv(content_file, sep=',', index=False)
return content_file
Далее давайте напишем функцию, которая вычисляет показатель достоверности текста, взятого из отсканированного изображения:
def calculate_ss_confidence(ss_details: dict):
"""Calculate the confidence score of the text grabbed from the scanned image."""
# page_num --> Page number of the detected text or item
# block_num --> Block number of the detected text or item
# par_num --> Paragraph number of the detected text or item
# line_num --> Line number of the detected text or item
# Convert the dict to dataFrame
df = pd.DataFrame.from_dict(ss_details)
# Convert the field conf (confidence) to numeric
df['conf'] = pd.to_numeric(df['conf'], errors='coerce')
# Elliminate records with negative confidence
df = df[df.conf != -1]
# Calculate the mean confidence by page
conf = df.groupby(['page_num'])['conf'].mean().tolist()
return conf[0]
Переходим к основной функции: сканирование изображения:
def ocr_img(
img: np.array, input_file: str, search_str: str,
highlight_readable_text: bool = False, action: str = 'Highlight',
show_comparison: bool = False, generate_output: bool = True):
"""Scans an image buffer or an image file.
Pre-processes the image.
Calls the Tesseract engine with pre-defined parameters.
Calculates the confidence score of the image grabbed content.
Draws a green rectangle around readable text items having a confidence score > 30.
Searches for a specific text.
Highlight or redact found matches of the searched text.
Displays a window showing readable text fields or the highlighted or redacted text.
Generates the text content of the image.
Prints a summary to the console."""
# If image source file is inputted as a parameter
if input_file:
# Reading image using opencv
img = cv2.imread(input_file)
# Preserve a copy of this image for comparison purposes
initial_img = img.copy()
highlighted_img = img.copy()
# Convert image to binary
bin_img = convert_img2bin(img)
# Calling Tesseract
# Tesseract Configuration parameters
# oem --> OCR engine mode = 3 >> Legacy + LSTM mode only (LSTM neutral net mode works the best)
# psm --> page segmentation mode = 6 >> Assume as single uniform block of text (How a page of text can be analyzed)
config_param = r'--oem 3 --psm 6'
# Feeding image to tesseract
details = pytesseract.image_to_data(
bin_img, output_type=Output.DICT, config=config_param, lang='eng')
# The details dictionary contains the information of the input image
# such as detected text, region, position, information, height, width, confidence score.
ss_confidence = calculate_ss_confidence(details)
boxed_img = None
# Total readable items
ss_readable_items = 0
# Total matches found
ss_matches = 0
for seq in range(len(details['text'])):
# Consider only text fields with confidence score > 30 (text is readable)
if float(details['conf'][seq]) > 30.0:
ss_readable_items += 1
# Draws a green rectangle around readable text items having a confidence score > 30
if highlight_readable_text:
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
boxed_img = cv2.rectangle(
img, (x, y), (x+w, y+h), (0, 255, 0), 2)
# Searches for the string
if search_str:
results = re.findall(
search_str, details['text'][seq], re.IGNORECASE)
for result in results:
ss_matches += 1
if action:
# Draw a red rectangle around the searchable text
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
# Details of the rectangle
# Starting coordinate representing the top left corner of the rectangle
start_point = (x, y)
# Ending coordinate representing the botton right corner of the rectangle
end_point = (x + w, y + h)
#Color in BGR -- Blue, Green, Red
if action == "Highlight":
color = (0, 255, 255) # Yellow
elif action == "Redact":
color = (0, 0, 0) # Black
# Thickness in px (-1 will fill the entire shape)
thickness = -1
boxed_img = cv2.rectangle(
img, start_point, end_point, color, thickness)
if ss_readable_items > 0 and highlight_readable_text and not (ss_matches > 0 and action in ("Highlight", "Redact")):
highlighted_img = boxed_img.copy()
# Highlight found matches of the search string
if ss_matches > 0 and action == "Highlight":
cv2.addWeighted(boxed_img, 0.4, highlighted_img,
1 - 0.4, 0, highlighted_img)
# Redact found matches of the search string
elif ss_matches > 0 and action == "Redact":
highlighted_img = boxed_img.copy()
#cv2.addWeighted(boxed_img, 1, highlighted_img, 0, 0, highlighted_img)
# save the image
cv2.imwrite("highlighted-text-image.jpg", highlighted_img)
# Displays window showing readable text fields or the highlighted or redacted data
if show_comparison and (highlight_readable_text or action):
title = input_file if input_file else 'Compare'
conc_img = cv2.hconcat([initial_img, highlighted_img])
display_img(title, conc_img)
# Generates the text content of the image
output_data = None
if generate_output and details:
output_data = generate_ss_text(details)
# Prints a summary to the console
if input_file:
summary = {
"File": input_file, "Total readable words": ss_readable_items, "Total matches": ss_matches, "Confidence score": ss_confidence
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("###################################################################")
return highlighted_img, ss_readable_items, ss_matches, ss_confidence, output_data
# pass image into pytesseract module
# pytesseract is trained in many languages
#config_param = r'--oem 3 --psm 6'
#details = pytesseract.image_to_data(img,config=config_param,lang='eng')
# print(details)
# return details
Вышеупомянутое выполняет следующее:
def image_to_byte_array(image: Image):
"""
Converts an image into a byte array
"""
imgByteArr = BytesIO()
image.save(imgByteArr, format=image.format if image.format else 'JPEG')
imgByteArr = imgByteArr.getvalue()
return imgByteArr
def ocr_file(**kwargs):
"""Opens the input PDF File.
Opens a memory buffer for storing the output PDF file.
Creates a DataFrame for storing pages statistics
Iterates throughout the chosen pages of the input PDF file
Grabs a screen-shot of the selected PDF page.
Converts the screen-shot pix to a numpy array
Scans the grabbed screen-shot.
Collects the statistics of the screen-shot(page).
Saves the content of the screen-shot(page).
Adds the updated screen-shot (Highlighted, Redacted) to the output file.
Saves the whole content of the PDF file.
Saves the output PDF file if required.
Prints a summary to the console."""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
highlight_readable_text = kwargs.get('highlight_readable_text')
action = kwargs.get('action')
show_comparison = kwargs.get('show_comparison')
generate_output = kwargs.get('generate_output')
# Opens the input PDF file
pdfIn = fitz.open(input_file)
# Opens a memory buffer for storing the output PDF file.
pdfOut = fitz.open()
# Creates an empty DataFrame for storing pages statistics
dfResult = pd.DataFrame(
columns=['page', 'page_readable_items', 'page_matches', 'page_total_confidence'])
# Creates an empty DataFrame for storing file content
if generate_output:
pdfContent = pd.DataFrame(columns=['page', 'line_id', 'line'])
# Iterate throughout the pages of the input file
for pg in range(pdfIn.pageCount):
if str(pages) != str(None):
if str(pg) not in str(pages):
continue
# Select a page
page = pdfIn[pg]
# Rotation angle
rotate = int(0)
# PDF Page is converted into a whole picture 1056*816 and then for each picture a screenshot is taken.
# zoom = 1.33333333 -----> Image size = 1056*816
# zoom = 2 ---> 2 * Default Resolution (text is clear, image text is hard to read) = filesize small / Image size = 1584*1224
# zoom = 4 ---> 4 * Default Resolution (text is clear, image text is barely readable) = filesize large
# zoom = 8 ---> 8 * Default Resolution (text is clear, image text is readable) = filesize large
zoom_x = 2
zoom_y = 2
# The zoom factor is equal to 2 in order to make text clear
# Pre-rotate is to rotate if needed.
mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
# To captue a specific part of the PDF page
# rect = page.rect #page size
# mp = rect.tl + (rect.bl - (0.75)/zoom_x) #rectangular area 56 = 75/1.3333
# clip = fitz.Rect(mp,rect.br) #The area to capture
# pix = page.getPixmap(matrix=mat, alpha=False,clip=clip)
# Get a screen-shot of the PDF page
# Colorspace -> represents the color space of the pixmap (csRGB, csGRAY, csCMYK)
# alpha -> Transparancy indicator
pix = page.getPixmap(matrix=mat, alpha=False, colorspace="csGRAY")
# convert the screen-shot pix to numpy array
img = pix2np(pix)
# Erode image to omit or thin the boundaries of the bright area of the image
# We apply Erosion on binary images.
#kernel = np.ones((2,2) , np.uint8)
#img = cv2.erode(img,kernel,iterations=1)
upd_np_array, pg_readable_items, pg_matches, pg_total_confidence, pg_output_data \
= ocr_img(img=img, input_file=None, search_str=search_str, highlight_readable_text=highlight_readable_text # False
, action=action # 'Redact'
, show_comparison=show_comparison # True
, generate_output=generate_output # False
)
# Collects the statistics of the page
dfResult = dfResult.append({'page': (pg+1), 'page_readable_items': pg_readable_items,
'page_matches': pg_matches, 'page_total_confidence': pg_total_confidence}, ignore_index=True)
if generate_output:
pdfContent = save_page_content(
pdfContent=pdfContent, page_id=(pg+1), page_data=pg_output_data)
# Convert the numpy array to image object with mode = RGB
#upd_img = Image.fromarray(np.uint8(upd_np_array)).convert('RGB')
upd_img = Image.fromarray(upd_np_array[..., ::-1])
# Convert the image to byte array
upd_array = image_to_byte_array(upd_img)
# Get Page Size
"""
#To check whether initial page is portrait or landscape
if page.rect.width > page.rect.height:
fmt = fitz.PaperRect("a4-1")
else:
fmt = fitz.PaperRect("a4")
#pno = -1 -> Insert after last page
pageo = pdfOut.newPage(pno = -1, width = fmt.width, height = fmt.height)
"""
pageo = pdfOut.newPage(
pno=-1, width=page.rect.width, height=page.rect.height)
pageo.insertImage(page.rect, stream=upd_array)
#pageo.insertImage(page.rect, stream=upd_img.tobytes())
#pageo.showPDFpage(pageo.rect, pdfDoc, page.number)
content_file = None
if generate_output:
content_file = save_file_content(
pdfContent=pdfContent, input_file=input_file)
summary = {
"File": input_file, "Total pages": pdfIn.pageCount,
"Processed pages": dfResult['page'].count(), "Total readable words": dfResult['page_readable_items'].sum(),
"Total matches": dfResult['page_matches'].sum(), "Confidence score": dfResult['page_total_confidence'].mean(),
"Output file": output_file, "Content file": content_file
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("\nPages Statistics:")
print(dfResult, sep='\n')
print("###################################################################")
pdfIn.close()
if output_file:
pdfOut.save(output_file)
pdfOut.close()
Функция image_to_byte_array()
преобразует изображение в массив байтов.
Функция ocr_file()
делает следующее:
Давайте добавим еще одну функцию для обработки папки, содержащей несколько файлов PDF:
def ocr_folder(**kwargs):
"""Scans all PDF Files within a specified path"""
input_folder = kwargs.get('input_folder')
# Run in recursive mode
recursive = kwargs.get('recursive')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
action = kwargs.get('action')
generate_output = kwargs.get('generate_output')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
output_file = None
if search_str:
# Generate an output file
output_file = os.path.join(os.path.dirname(
inp_pdf_file), 'ocr_' + os.path.basename(inp_pdf_file))
ocr_file(
input_file=inp_pdf_file, output_file=output_file, search_str=search_str, pages=pages, highlight_readable_text=False, action=action, show_comparison=False, generate_output=generate_output
)
if not recursive:
break
Эта функция предназначена для сканирования файлов PDF, содержащихся в определенной папке. Он перебирает файлы указанной папки либо рекурсивно, либо нет, в зависимости от значения параметра recursive и обрабатывает эти файлы один за другим.
Он принимает следующие параметры:
input_folder
: путь к папке, содержащей PDF-файлы для обработки.search_str
: текст для поиска для обработки.recursive
: следует ли запускать этот процесс рекурсивно, перебирая подпапки или нет.action
: действие, которое нужно выполнить среди следующих: Выделить, Редактировать.pages
: страницы для рассмотрения.generate_output
: выберите, следует ли сохранять содержимое входного файла PDF в файл CSV или нет.Прежде чем мы закончим, давайте определим полезные функции для разбора аргументов командной строки:
def is_valid_path(path):
"""Validates the path inputted and checks whether it is a file path or a folder path"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input-path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', choices=[
'Highlight', 'Redact'], type=str, help="Choose to highlight or to redact")
parser.add_argument('-s', '--search-str', dest='search_str',
type=str, help="Enter a valid search string")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider in the PDF file, e.g. (0,1)")
parser.add_argument("-g", "--generate-output", action="store_true", help="Generate text content in a CSV file")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file',
type=str, help="Enter a valid output file")
parser.add_argument("-t", "--highlight-readable-text", action="store_true", help="Highlight readable text in the generated image")
parser.add_argument("-c", "--show-comparison", action="store_true", help="Show comparison between captured image and the generated image")
if os.path.isdir(path):
parser.add_argument("-r", "--recursive", action="store_true", help="Whether to process the directory recursively")
# To Porse The Command Line Arguments
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
Функция is_valid_path()
проверяет путь, введенный в качестве параметра, и проверяет, является ли он путем к файлу или путем к каталогу.
Функция parse_args()
определяет и устанавливает соответствующие ограничения для аргументов командной строки пользователя при запуске этой утилиты.
Ниже приведены пояснения ко всем параметрам:
input_path
: обязательный параметр для ввода пути к файлу или папке для обработки, этот параметр связан с is_valid_path()
ранее определенной функцией.action
: действие, которое необходимо выполнить среди списка предопределенных параметров, чтобы избежать ошибочного выбора.search_str
: текст для поиска для обработки.pages
: страницы, которые следует учитывать при обработке файла PDF.generate_content
: указывает, следует ли генерировать захваченное содержимое входного файла, будь то изображение или PDF в файл CSV или нет.output_file
: путь к выходному файлу. Заполнение этого аргумента ограничено выбором файла в качестве входных данных, а не каталога.highlight_readable_text
: для рисования зеленых прямоугольников вокруг читаемых текстовых полей с показателем достоверности больше 30.show_comparison
: Отображает окно, показывающее сравнение между исходным изображением и обработанным изображением.recursive
: обрабатывать ли папку рекурсивно или нет. Заполнение этого аргумента ограничено выбором каталога.Наконец, давайте напишем основной код, который использует ранее определенные функции:
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Process a file
if filetype.is_image(args['input_path']):
ocr_img(
# if 'search_str' in (args.keys()) else None
img=None, input_file=args['input_path'], search_str=args['search_str'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
else:
ocr_file(
input_file=args['input_path'], output_file=args['output_file'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
ocr_folder(
input_folder=args['input_path'], recursive=args['recursive'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], action=args['action'], generate_output=args['generate_output']
)
Проверим нашу программу:
$ python pdf_ocr.py
Выход:
usage: pdf_ocr.py [-h] -i INPUT_PATH [-a {Highlight,Redact}] [-s SEARCH_STR] [-p PAGES] [-g GENERATE_OUTPUT]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Highlight,Redact}, --action {Highlight,Redact}
Choose to highlight or to redact
-s SEARCH_STR, --search_str SEARCH_STR
Enter a valid search string
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: (0,1)
-g GENERATE_OUTPUT, --generate_output GENERATE_OUTPUT
Generate content in a CSV file
Перед изучением наших тестовых сценариев обратите внимание на следующее:
PermissionError
ошибки, закройте входной файл перед запуском этой утилиты.Во-первых, давайте попробуем ввести изображение (вы можете получить его здесь , если хотите получить тот же результат), без участия какого-либо PDF-файла:
$ python pdf_ocr.py -s "BERT" -a Highlight -i example-image-containing-text.jpg
На выходе будет следующее:
## Command Arguments #################################################
input_path:example-image-containing-text.jpg
action:Highlight
search_str:BERT
pages:None
generate_output:False
output_file:None
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:example-image-containing-text.jpg
Total readable words:192
Total matches:3
Confidence score:89.89337547979804
###################################################################
И в текущем каталоге появилось новое изображение:
Вы можете передать
-t
или --highlight-readable-text
выделить весь обнаруженный текст (в другом формате, чтобы отличить искомую строку от других).
Вы также можете передать -c
или --show-comparison
отобразить исходное изображение и отредактированное изображение в одном окне.
Теперь это работает с изображениями, давайте попробуем с файлами PDF:
$ python pdf_ocr.py -s "BERT" -i image.pdf -o output.pdf --generate-output -a "Highlight"
image.pdf
— это простой PDF-файл, содержащий изображение из предыдущего примера (опять же, вы можете получить его здесь ).
На этот раз мы передали PDF-файл в качестве -i
аргумента и output.pdf
в качестве результирующего PDF-файла (где происходит выделение). Приведенная выше команда генерирует следующий вывод:
## Command Arguments #################################################
input_path:image.pdf
action:Highlight
search_str:BERT
pages:None
generate_output:True
output_file:output.pdf
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:image.pdf
Total pages:1
Processed pages:1
Total readable words:192.0
Total matches:3.0
Confidence score:83.1775128855722
Output file:output.pdf
Content file:image.csv
Pages Statistics:
page page_readable_items page_matches page_total_confidence
0 1.0 192.0 3.0 83.177513
###################################################################
Файл output.pdf
создается после выполнения и включает тот же исходный PDF-файл, но с выделенным текстом. Кроме того, теперь у нас есть статистика по нашему PDF-файлу, где всего было обнаружено 192 слова, и 3 из них были сопоставлены с помощью нашего поиска с достоверностью около 83,2%.
Также создается файл CSV, который включает обнаруженный текст из изображения в каждой строке.
Есть и другие параметры, которые мы не использовали в наших примерах, не стесняйтесь исследовать их. Вы также можете передать всю папку в -i
аргумент для сканирования коллекции PDF-файлов.
Tesseract идеально подходит для сканирования чистых и четких документов. Некачественное сканирование может привести к плохим результатам в OCR. Обычно он не дает точных результатов для изображений, затронутых артефактами, включая частичную окклюзию, искаженную перспективу и сложный фон.
Оригинальный источник статьи на https://www.thepythoncode.com
1648833540
Heutzutage haben mittlere und große Unternehmen riesige Mengen an gedruckten Dokumenten im täglichen Gebrauch. Darunter sind Rechnungen, Quittungen, Unternehmensdokumente, Berichte und Medienmitteilungen.
Für diese Unternehmen kann der Einsatz eines OCR-Scanners viel Zeit sparen und gleichzeitig die Effizienz sowie die Genauigkeit verbessern.
Mithilfe von Algorithmen zur optischen Zeichenerkennung (OCR) können Computer gedruckte oder handgeschriebene Dokumente automatisch analysieren und Textdaten in bearbeitbaren Formaten aufbereiten, damit Computer sie effizient verarbeiten können. OCR-Systeme wandeln ein zweidimensionales Textbild, das maschinengedruckten oder handgeschriebenen Text enthalten könnte, von seiner Bilddarstellung in maschinenlesbaren Text um.
Im Allgemeinen umfasst eine OCR-Engine mehrere Schritte, die erforderlich sind, um einen maschinellen Lernalgorithmus für eine effiziente Problemlösung mit Hilfe der optischen Zeichenerkennung zu trainieren.
Die folgenden Schritte, die sich von einer Engine zur anderen unterscheiden können, sind ungefähr erforderlich, um sich der automatischen Zeichenerkennung zu nähern:
In diesem Tutorial zeige ich Ihnen Folgendes:
Bitte beachten Sie, dass es in diesem Tutorial um das Extrahieren von Text aus Bildern in PDF-Dokumenten geht.
Um zu beginnen, müssen wir die folgenden Bibliotheken verwenden:
Tesseract OCR : ist eine Open-Source-Texterkennungs-Engine, die unter der Apache 2.0-Lizenz verfügbar ist und deren Entwicklung seit 2006 von Google gesponsert wird. Im Jahr 2006 galt Tesseract als eine der genauesten Open-Source-OCR-Engines. Sie können es direkt verwenden oder die API verwenden, um den gedruckten Text aus Bildern zu extrahieren. Das Beste daran ist, dass es eine große Auswahl an Sprachen unterstützt.
Die Installation der Tesseract-Engine würde den Rahmen dieses Artikels sprengen. Sie müssen jedoch der offiziellen Installationsanleitung von Tesseract folgen , um es auf Ihrem Betriebssystem zu installieren.
Um das Tesseract-Setup zu validieren, führen Sie bitte den folgenden Befehl aus und überprüfen Sie die generierte Ausgabe:
Python-tesseract : ist ein Python-Wrapper für die Tesseract-OCR-Engine von Google. Es ist auch als eigenständiges Aufrufskript für Tesseract nützlich, da es alle Bildtypen lesen kann, die von den Pillow- und Leptonica-Imaging-Bibliotheken unterstützt werden, einschließlich jpeg, png, gif, bmp, tiff und andere.
OpenCV : ist eine Python-Open-Source-Bibliothek für Computer Vision, maschinelles Lernen und Bildverarbeitung. OpenCV unterstützt eine Vielzahl von Programmiersprachen wie Python, C++, Java usw. Es kann Bilder und Videos verarbeiten, um Objekte, Gesichter oder sogar die Handschrift eines Menschen zu identifizieren.
PyMuPDF : MuPDF ist eine äußerst vielseitige, anpassbare PDF-, XPS- und eBook-Interpreterlösung, die in einer Vielzahl von Anwendungen als PDF-Renderer, Viewer oder Toolkit verwendet werden kann. PyMuPDF ist eine Python-Bindung für MuPDF. Es ist ein leichter PDF- und XPS-Viewer.
Numpy: ist ein universelles Array-Verarbeitungspaket. Es stellt ein hochleistungsfähiges multidimensionales Array-Objekt und Tools zum Arbeiten mit diesen Arrays bereit. Es ist das grundlegende Paket für wissenschaftliches Rechnen mit Python. Außerdem kann Numpy auch als effizienter mehrdimensionaler Container für generische Daten verwendet werden.
Pillow: baut auf PIL (Python Image Library) auf. Es ist ein wesentliches Modul für die Bildverarbeitung in Python.
Pandas: ist eine BSD-lizenzierte Open-Source-Python-Bibliothek, die leistungsstarke, einfach zu verwendende Datenstrukturen und Datenanalysetools für die Programmiersprache Python bereitstellt.
Dateityp: Kleines und abhängigkeitsfreies Python-Paket zum Ableiten von Dateityp und MIME-Typ.
Dieses Tutorial zielt darauf ab, ein leichtes, befehlszeilenbasiertes Dienstprogramm zu entwickeln, um einen Text zu extrahieren, zu redigieren oder hervorzuheben, der in einem Bild oder einer gescannten PDF-Datei oder in einem Ordner enthalten ist, der eine Sammlung von PDF-Dateien enthält.
Lassen Sie uns zunächst die Anforderungen installieren:
$ pip install Filetype==1.0.7 numpy==1.19.4 opencv-python==4.4.0.46 pandas==1.1.4 Pillow==8.0.1 PyMuPDF==1.18.9 pytesseract==0.3.7
Beginnen wir mit dem Importieren der erforderlichen Bibliotheken:
import os
import re
import argparse
import pytesseract
from pytesseract import Output
import cv2
import numpy as np
import fitz
from io import BytesIO
from PIL import Image
import pandas as pd
import filetype
# Path Of The Tesseract OCR engine
TESSERACT_PATH = "C:\Program Files\Tesseract-OCR\tesseract.exe"
# Include tesseract executable
pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH
TESSERACT_PATH
Hier befindet sich die ausführbare Tesseract-Datei. Offensichtlich müssen Sie es für Ihren Fall ändern.
def pix2np(pix):
"""
Converts a pixmap buffer into a numpy array
"""
# pix.samples = sequence of bytes of the image pixels like RGBA
#pix.h = height in pixels
#pix.w = width in pixels
# pix.n = number of components per pixel (depends on the colorspace and alpha)
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.h, pix.w, pix.n)
try:
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
except IndexError:
# Convert Gray to RGB
im = cv2.cvtColor(im, cv2.COLOR_GRAY2RGB)
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
return im
Diese Funktion konvertiert einen Pixmap-Puffer, der einen mit der PyMuPDF-Bibliothek erstellten Screenshot darstellt, in ein NumPy-Array.
Um die Tesseract-Genauigkeit zu verbessern, definieren wir einige Vorverarbeitungsfunktionen mit OpenCV:
# Image Pre-Processing Functions to improve output accurracy
# Convert to grayscale
def grayscale(img):
return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Remove noise
def remove_noise(img):
return cv2.medianBlur(img, 5)
# Thresholding
def threshold(img):
# return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# dilation
def dilate(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.dilate(img, kernel, iterations=1)
# erosion
def erode(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.erode(img, kernel, iterations=1)
# opening -- erosion followed by a dilation
def opening(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
# canny edge detection
def canny(img):
return cv2.Canny(img, 100, 200)
# skew correction
def deskew(img):
coords = np.column_stack(np.where(img > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
(h, w) = img.shape[:2]
center = (w//2, h//2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(
img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
return rotated
# template matching
def match_template(img, template):
return cv2.matchTemplate(img, template, cv2.TM_CCOEFF_NORMED)
def convert_img2bin(img):
"""
Pre-processes the image and generates a binary output
"""
# Convert the image into a grayscale image
output_img = grayscale(img)
# Invert the grayscale image by flipping pixel values.
# All pixels that are grater than 0 are set to 0 and all pixels that are = to 0 are set to 255
output_img = cv2.bitwise_not(output_img)
# Converting image to binary by Thresholding in order to show a clear separation between white and blacl pixels.
output_img = threshold(output_img)
return output_img
Wir haben Funktionen für viele Vorverarbeitungsaufgaben definiert, darunter das Konvertieren von Bildern in Graustufen, das Spiegeln von Pixelwerten, das Trennen von weißen und schwarzen Pixeln und vieles mehr.
Als nächstes definieren wir eine Funktion zum Anzeigen eines Bildes:
def display_img(title, img):
"""Displays an image on screen and maintains the output until the user presses a key"""
cv2.namedWindow('img', cv2.WINDOW_NORMAL)
cv2.setWindowTitle('img', title)
cv2.resizeWindow('img', 1200, 900)
# Display Image on screen
cv2.imshow('img', img)
# Mantain output until user presses a key
cv2.waitKey(0)
# Destroy windows when user presses a key
cv2.destroyAllWindows()
Die display_img()
Funktion zeigt auf dem Bildschirm ein Bild in einem Fenster mit einem auf den title
Parameter gesetzten Titel an und hält dieses Fenster geöffnet, bis der Benutzer eine Taste auf der Tastatur drückt.
def generate_ss_text(ss_details):
"""Loops through the captured text of an image and arranges this text line by line.
This function depends on the image layout."""
# Arrange the captured text after scanning the page
parse_text = []
word_list = []
last_word = ''
# Loop through the captured text of the entire page
for word in ss_details['text']:
# If the word captured is not empty
if word != '':
# Add it to the line word list
word_list.append(word)
last_word = word
if (last_word != '' and word == '') or (word == ss_details['text'][-1]):
parse_text.append(word_list)
word_list = []
return parse_text
Die obige Funktion durchläuft den erfassten Text eines Bildes und ordnet den erfassten Text Zeile für Zeile an. Dies hängt vom Bildlayout ab und kann bei einigen Bildformaten eine Anpassung erfordern.
Als nächstes definieren wir eine Funktion, um mit regulären Ausdrücken nach Text zu suchen :
def search_for_text(ss_details, search_str):
"""Search for the search string within the image content"""
# Find all matches within one page
results = re.findall(search_str, ss_details['text'], re.IGNORECASE)
# In case multiple matches within one page
for result in results:
yield result
Wir werden diese Funktion verwenden, um im erfassten Inhalt eines Bildes nach bestimmtem Text zu suchen. Es gibt einen Generator der gefundenen Übereinstimmungen zurück.
def save_page_content(pdfContent, page_id, page_data):
"""Appends the content of a scanned page, line by line, to a pandas DataFrame."""
if page_data:
for idx, line in enumerate(page_data, 1):
line = ' '.join(line)
pdfContent = pdfContent.append(
{'page': page_id, 'line_id': idx, 'line': line}, ignore_index=True
)
return pdfContent
save_page_content()
Funktion hängt den erfassten Inhalt eines Bildes Zeile für Zeile an, nachdem es an den pdfContent
Pandas-Datenrahmen gescannt wurde.
Lassen Sie uns nun eine Funktion erstellen, um den resultierenden Datenrahmen in einer CSV-Datei zu speichern:
def save_file_content(pdfContent, input_file):
"""Outputs the content of the pandas DataFrame to a CSV file having the same path as the input_file
but with different extension (.csv)"""
content_file = os.path.join(os.path.dirname(input_file), os.path.splitext(
os.path.basename(input_file))[0] + ".csv")
pdfContent.to_csv(content_file, sep=',', index=False)
return content_file
Als Nächstes schreiben wir eine Funktion, die den Konfidenzwert des aus dem gescannten Bild entnommenen Textes berechnet:
def calculate_ss_confidence(ss_details: dict):
"""Calculate the confidence score of the text grabbed from the scanned image."""
# page_num --> Page number of the detected text or item
# block_num --> Block number of the detected text or item
# par_num --> Paragraph number of the detected text or item
# line_num --> Line number of the detected text or item
# Convert the dict to dataFrame
df = pd.DataFrame.from_dict(ss_details)
# Convert the field conf (confidence) to numeric
df['conf'] = pd.to_numeric(df['conf'], errors='coerce')
# Elliminate records with negative confidence
df = df[df.conf != -1]
# Calculate the mean confidence by page
conf = df.groupby(['page_num'])['conf'].mean().tolist()
return conf[0]
Zur Hauptfunktion gehen: Bild scannen:
def ocr_img(
img: np.array, input_file: str, search_str: str,
highlight_readable_text: bool = False, action: str = 'Highlight',
show_comparison: bool = False, generate_output: bool = True):
"""Scans an image buffer or an image file.
Pre-processes the image.
Calls the Tesseract engine with pre-defined parameters.
Calculates the confidence score of the image grabbed content.
Draws a green rectangle around readable text items having a confidence score > 30.
Searches for a specific text.
Highlight or redact found matches of the searched text.
Displays a window showing readable text fields or the highlighted or redacted text.
Generates the text content of the image.
Prints a summary to the console."""
# If image source file is inputted as a parameter
if input_file:
# Reading image using opencv
img = cv2.imread(input_file)
# Preserve a copy of this image for comparison purposes
initial_img = img.copy()
highlighted_img = img.copy()
# Convert image to binary
bin_img = convert_img2bin(img)
# Calling Tesseract
# Tesseract Configuration parameters
# oem --> OCR engine mode = 3 >> Legacy + LSTM mode only (LSTM neutral net mode works the best)
# psm --> page segmentation mode = 6 >> Assume as single uniform block of text (How a page of text can be analyzed)
config_param = r'--oem 3 --psm 6'
# Feeding image to tesseract
details = pytesseract.image_to_data(
bin_img, output_type=Output.DICT, config=config_param, lang='eng')
# The details dictionary contains the information of the input image
# such as detected text, region, position, information, height, width, confidence score.
ss_confidence = calculate_ss_confidence(details)
boxed_img = None
# Total readable items
ss_readable_items = 0
# Total matches found
ss_matches = 0
for seq in range(len(details['text'])):
# Consider only text fields with confidence score > 30 (text is readable)
if float(details['conf'][seq]) > 30.0:
ss_readable_items += 1
# Draws a green rectangle around readable text items having a confidence score > 30
if highlight_readable_text:
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
boxed_img = cv2.rectangle(
img, (x, y), (x+w, y+h), (0, 255, 0), 2)
# Searches for the string
if search_str:
results = re.findall(
search_str, details['text'][seq], re.IGNORECASE)
for result in results:
ss_matches += 1
if action:
# Draw a red rectangle around the searchable text
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
# Details of the rectangle
# Starting coordinate representing the top left corner of the rectangle
start_point = (x, y)
# Ending coordinate representing the botton right corner of the rectangle
end_point = (x + w, y + h)
#Color in BGR -- Blue, Green, Red
if action == "Highlight":
color = (0, 255, 255) # Yellow
elif action == "Redact":
color = (0, 0, 0) # Black
# Thickness in px (-1 will fill the entire shape)
thickness = -1
boxed_img = cv2.rectangle(
img, start_point, end_point, color, thickness)
if ss_readable_items > 0 and highlight_readable_text and not (ss_matches > 0 and action in ("Highlight", "Redact")):
highlighted_img = boxed_img.copy()
# Highlight found matches of the search string
if ss_matches > 0 and action == "Highlight":
cv2.addWeighted(boxed_img, 0.4, highlighted_img,
1 - 0.4, 0, highlighted_img)
# Redact found matches of the search string
elif ss_matches > 0 and action == "Redact":
highlighted_img = boxed_img.copy()
#cv2.addWeighted(boxed_img, 1, highlighted_img, 0, 0, highlighted_img)
# save the image
cv2.imwrite("highlighted-text-image.jpg", highlighted_img)
# Displays window showing readable text fields or the highlighted or redacted data
if show_comparison and (highlight_readable_text or action):
title = input_file if input_file else 'Compare'
conc_img = cv2.hconcat([initial_img, highlighted_img])
display_img(title, conc_img)
# Generates the text content of the image
output_data = None
if generate_output and details:
output_data = generate_ss_text(details)
# Prints a summary to the console
if input_file:
summary = {
"File": input_file, "Total readable words": ss_readable_items, "Total matches": ss_matches, "Confidence score": ss_confidence
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("###################################################################")
return highlighted_img, ss_readable_items, ss_matches, ss_confidence, output_data
# pass image into pytesseract module
# pytesseract is trained in many languages
#config_param = r'--oem 3 --psm 6'
#details = pytesseract.image_to_data(img,config=config_param,lang='eng')
# print(details)
# return details
Das obige führt Folgendes aus:
def image_to_byte_array(image: Image):
"""
Converts an image into a byte array
"""
imgByteArr = BytesIO()
image.save(imgByteArr, format=image.format if image.format else 'JPEG')
imgByteArr = imgByteArr.getvalue()
return imgByteArr
def ocr_file(**kwargs):
"""Opens the input PDF File.
Opens a memory buffer for storing the output PDF file.
Creates a DataFrame for storing pages statistics
Iterates throughout the chosen pages of the input PDF file
Grabs a screen-shot of the selected PDF page.
Converts the screen-shot pix to a numpy array
Scans the grabbed screen-shot.
Collects the statistics of the screen-shot(page).
Saves the content of the screen-shot(page).
Adds the updated screen-shot (Highlighted, Redacted) to the output file.
Saves the whole content of the PDF file.
Saves the output PDF file if required.
Prints a summary to the console."""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
highlight_readable_text = kwargs.get('highlight_readable_text')
action = kwargs.get('action')
show_comparison = kwargs.get('show_comparison')
generate_output = kwargs.get('generate_output')
# Opens the input PDF file
pdfIn = fitz.open(input_file)
# Opens a memory buffer for storing the output PDF file.
pdfOut = fitz.open()
# Creates an empty DataFrame for storing pages statistics
dfResult = pd.DataFrame(
columns=['page', 'page_readable_items', 'page_matches', 'page_total_confidence'])
# Creates an empty DataFrame for storing file content
if generate_output:
pdfContent = pd.DataFrame(columns=['page', 'line_id', 'line'])
# Iterate throughout the pages of the input file
for pg in range(pdfIn.pageCount):
if str(pages) != str(None):
if str(pg) not in str(pages):
continue
# Select a page
page = pdfIn[pg]
# Rotation angle
rotate = int(0)
# PDF Page is converted into a whole picture 1056*816 and then for each picture a screenshot is taken.
# zoom = 1.33333333 -----> Image size = 1056*816
# zoom = 2 ---> 2 * Default Resolution (text is clear, image text is hard to read) = filesize small / Image size = 1584*1224
# zoom = 4 ---> 4 * Default Resolution (text is clear, image text is barely readable) = filesize large
# zoom = 8 ---> 8 * Default Resolution (text is clear, image text is readable) = filesize large
zoom_x = 2
zoom_y = 2
# The zoom factor is equal to 2 in order to make text clear
# Pre-rotate is to rotate if needed.
mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
# To captue a specific part of the PDF page
# rect = page.rect #page size
# mp = rect.tl + (rect.bl - (0.75)/zoom_x) #rectangular area 56 = 75/1.3333
# clip = fitz.Rect(mp,rect.br) #The area to capture
# pix = page.getPixmap(matrix=mat, alpha=False,clip=clip)
# Get a screen-shot of the PDF page
# Colorspace -> represents the color space of the pixmap (csRGB, csGRAY, csCMYK)
# alpha -> Transparancy indicator
pix = page.getPixmap(matrix=mat, alpha=False, colorspace="csGRAY")
# convert the screen-shot pix to numpy array
img = pix2np(pix)
# Erode image to omit or thin the boundaries of the bright area of the image
# We apply Erosion on binary images.
#kernel = np.ones((2,2) , np.uint8)
#img = cv2.erode(img,kernel,iterations=1)
upd_np_array, pg_readable_items, pg_matches, pg_total_confidence, pg_output_data \
= ocr_img(img=img, input_file=None, search_str=search_str, highlight_readable_text=highlight_readable_text # False
, action=action # 'Redact'
, show_comparison=show_comparison # True
, generate_output=generate_output # False
)
# Collects the statistics of the page
dfResult = dfResult.append({'page': (pg+1), 'page_readable_items': pg_readable_items,
'page_matches': pg_matches, 'page_total_confidence': pg_total_confidence}, ignore_index=True)
if generate_output:
pdfContent = save_page_content(
pdfContent=pdfContent, page_id=(pg+1), page_data=pg_output_data)
# Convert the numpy array to image object with mode = RGB
#upd_img = Image.fromarray(np.uint8(upd_np_array)).convert('RGB')
upd_img = Image.fromarray(upd_np_array[..., ::-1])
# Convert the image to byte array
upd_array = image_to_byte_array(upd_img)
# Get Page Size
"""
#To check whether initial page is portrait or landscape
if page.rect.width > page.rect.height:
fmt = fitz.PaperRect("a4-1")
else:
fmt = fitz.PaperRect("a4")
#pno = -1 -> Insert after last page
pageo = pdfOut.newPage(pno = -1, width = fmt.width, height = fmt.height)
"""
pageo = pdfOut.newPage(
pno=-1, width=page.rect.width, height=page.rect.height)
pageo.insertImage(page.rect, stream=upd_array)
#pageo.insertImage(page.rect, stream=upd_img.tobytes())
#pageo.showPDFpage(pageo.rect, pdfDoc, page.number)
content_file = None
if generate_output:
content_file = save_file_content(
pdfContent=pdfContent, input_file=input_file)
summary = {
"File": input_file, "Total pages": pdfIn.pageCount,
"Processed pages": dfResult['page'].count(), "Total readable words": dfResult['page_readable_items'].sum(),
"Total matches": dfResult['page_matches'].sum(), "Confidence score": dfResult['page_total_confidence'].mean(),
"Output file": output_file, "Content file": content_file
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("\nPages Statistics:")
print(dfResult, sep='\n')
print("###################################################################")
pdfIn.close()
if output_file:
pdfOut.save(output_file)
pdfOut.close()
Die image_to_byte_array()
Funktion konvertiert ein Bild in ein Byte-Array.
Die ocr_file()
Funktion macht folgendes:
Fügen wir eine weitere Funktion zum Verarbeiten eines Ordners hinzu, der mehrere PDF-Dateien enthält:
def ocr_folder(**kwargs):
"""Scans all PDF Files within a specified path"""
input_folder = kwargs.get('input_folder')
# Run in recursive mode
recursive = kwargs.get('recursive')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
action = kwargs.get('action')
generate_output = kwargs.get('generate_output')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
output_file = None
if search_str:
# Generate an output file
output_file = os.path.join(os.path.dirname(
inp_pdf_file), 'ocr_' + os.path.basename(inp_pdf_file))
ocr_file(
input_file=inp_pdf_file, output_file=output_file, search_str=search_str, pages=pages, highlight_readable_text=False, action=action, show_comparison=False, generate_output=generate_output
)
if not recursive:
break
Diese Funktion soll die PDF-Dateien scannen, die in einem bestimmten Ordner enthalten sind. Es durchläuft die Dateien des angegebenen Ordners entweder rekursiv oder nicht, abhängig vom Wert des Parameters recursive, und verarbeitet diese Dateien eine nach der anderen.
Es akzeptiert die folgenden Parameter:
input_folder
: Der Pfad des Ordners, der die zu verarbeitenden PDF-Dateien enthält.search_str
: Der Text, nach dem gesucht werden soll, um bearbeitet zu werden.recursive
: ob dieser Prozess rekursiv ausgeführt werden soll, indem die Unterordner durchlaufen werden oder nicht.action
: die auszuführende Aktion unter den folgenden: Hervorheben, Schwärzen.pages
: die zu berücksichtigenden Seiten.generate_output
: Wählen Sie aus, ob der Inhalt der eingegebenen PDF-Datei in einer CSV-Datei gespeichert werden soll oder nichtBevor wir fertig sind, definieren wir nützliche Funktionen zum Analysieren von Befehlszeilenargumenten:
def is_valid_path(path):
"""Validates the path inputted and checks whether it is a file path or a folder path"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input-path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', choices=[
'Highlight', 'Redact'], type=str, help="Choose to highlight or to redact")
parser.add_argument('-s', '--search-str', dest='search_str',
type=str, help="Enter a valid search string")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider in the PDF file, e.g. (0,1)")
parser.add_argument("-g", "--generate-output", action="store_true", help="Generate text content in a CSV file")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file',
type=str, help="Enter a valid output file")
parser.add_argument("-t", "--highlight-readable-text", action="store_true", help="Highlight readable text in the generated image")
parser.add_argument("-c", "--show-comparison", action="store_true", help="Show comparison between captured image and the generated image")
if os.path.isdir(path):
parser.add_argument("-r", "--recursive", action="store_true", help="Whether to process the directory recursively")
# To Porse The Command Line Arguments
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
Die is_valid_path()
Funktion validiert einen als Parameter eingegebenen Pfad und prüft, ob es sich um einen Dateipfad oder einen Verzeichnispfad handelt.
Die parse_args()
Funktion definiert und legt die entsprechenden Einschränkungen für die Befehlszeilenargumente des Benutzers fest, wenn dieses Dienstprogramm ausgeführt wird.
Nachfolgend finden Sie Erläuterungen zu allen Parametern:
input_path
: Ein erforderlicher Parameter zur Eingabe des Pfads der zu verarbeitenden Datei oder des zu verarbeitenden Ordners. Dieser Parameter ist mit der is_valid_path()
zuvor definierten Funktion verknüpft.action
: Die Aktion, die in einer Liste vordefinierter Optionen ausgeführt werden soll, um eine fehlerhafte Auswahl zu vermeiden.search_str
: Der Text, nach dem gesucht werden soll, um bearbeitet zu werden.pages
: die Seiten, die bei der Verarbeitung einer PDF-Datei zu berücksichtigen sind.generate_content
: Gibt an, ob der erfasste Inhalt der Eingabedatei, ob ein Bild oder eine PDF-Datei, in eine CSV-Datei umgewandelt werden soll oder nicht.output_file
: Der Pfad der Ausgabedatei. Das Ausfüllen dieses Arguments wird durch die Auswahl einer Datei als Eingabe eingeschränkt, nicht eines Verzeichnisses.highlight_readable_text
: Grüne Rechtecke um lesbare Textfelder mit einem Konfidenzwert von mehr als 30 zeichnen.show_comparison
: Zeigt ein Fenster mit einem Vergleich zwischen dem Originalbild und dem verarbeiteten Bild an.recursive
: ob ein Ordner rekursiv verarbeitet werden soll oder nicht. Das Ausfüllen dieses Arguments wird durch die Auswahl eines Verzeichnisses eingeschränkt.Schreiben wir zum Schluss den Hauptcode, der zuvor definierte Funktionen verwendet:
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Process a file
if filetype.is_image(args['input_path']):
ocr_img(
# if 'search_str' in (args.keys()) else None
img=None, input_file=args['input_path'], search_str=args['search_str'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
else:
ocr_file(
input_file=args['input_path'], output_file=args['output_file'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
ocr_folder(
input_folder=args['input_path'], recursive=args['recursive'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], action=args['action'], generate_output=args['generate_output']
)
Testen wir unser Programm:
$ python pdf_ocr.py
Ausgabe:
usage: pdf_ocr.py [-h] -i INPUT_PATH [-a {Highlight,Redact}] [-s SEARCH_STR] [-p PAGES] [-g GENERATE_OUTPUT]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Highlight,Redact}, --action {Highlight,Redact}
Choose to highlight or to redact
-s SEARCH_STR, --search_str SEARCH_STR
Enter a valid search string
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: (0,1)
-g GENERATE_OUTPUT, --generate_output GENERATE_OUTPUT
Generate content in a CSV file
Beachten Sie Folgendes, bevor Sie unsere Testszenarien untersuchen:
PermissionError
diesen Fehler zu vermeiden, schließen Sie bitte die Eingabedatei, bevor Sie dieses Dienstprogramm ausführen.Lassen Sie uns zunächst versuchen, ein Bild einzugeben (Sie können es hier abrufen, wenn Sie die gleiche Ausgabe erhalten möchten), ohne dass eine PDF-Datei beteiligt ist:
$ python pdf_ocr.py -s "BERT" -a Highlight -i example-image-containing-text.jpg
Folgendes wird die Ausgabe sein:
## Command Arguments #################################################
input_path:example-image-containing-text.jpg
action:Highlight
search_str:BERT
pages:None
generate_output:False
output_file:None
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:example-image-containing-text.jpg
Total readable words:192
Total matches:3
Confidence score:89.89337547979804
###################################################################
Und im aktuellen Verzeichnis ist ein neues Bild erschienen:
Sie können den gesamten erkannten Text übergeben
-t
oder --highlight-readable-text
hervorheben (mit einem anderen Format, um die Suchzeichenfolge von den anderen zu unterscheiden).
Sie können auch übergeben -c
oder --show-comparison
das Originalbild und das bearbeitete Bild im selben Fenster anzeigen.
Das funktioniert jetzt für Bilder, versuchen wir es mit PDF-Dateien:
$ python pdf_ocr.py -s "BERT" -i image.pdf -o output.pdf --generate-output -a "Highlight"
image.pdf
ist eine einfache PDF-Datei, die das Bild im vorherigen Beispiel enthält (auch hier können Sie es abrufen ).
Dieses Mal haben wir eine PDF-Datei an das -i
Argument und output.pdf
als resultierende PDF-Datei übergeben (in der die gesamte Hervorhebung erfolgt). Der obige Befehl erzeugt die folgende Ausgabe:
## Command Arguments #################################################
input_path:image.pdf
action:Highlight
search_str:BERT
pages:None
generate_output:True
output_file:output.pdf
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:image.pdf
Total pages:1
Processed pages:1
Total readable words:192.0
Total matches:3.0
Confidence score:83.1775128855722
Output file:output.pdf
Content file:image.csv
Pages Statistics:
page page_readable_items page_matches page_total_confidence
0 1.0 192.0 3.0 83.177513
###################################################################
Die output.pdf
Datei wird nach der Ausführung erstellt, wobei sie das gleiche Original-PDF enthält, jedoch mit hervorgehobenem Text. Darüber hinaus haben wir jetzt Statistiken über unsere PDF-Datei, in der insgesamt 192 Wörter erkannt wurden und 3 mit unserer Suche mit einer Zuverlässigkeit von etwa 83,2 % abgeglichen wurden.
Außerdem wird eine CSV-Datei generiert, die den erkannten Text aus dem Bild in jeder Zeile enthält.
Es gibt andere Parameter, die wir in unseren Beispielen nicht verwendet haben, fühlen Sie sich frei, sie zu erkunden. Sie können auch einen ganzen Ordner an das -i
Argument übergeben, um eine Sammlung von PDF-Dateien zu scannen.
Tesseract ist perfekt zum Scannen sauberer und klarer Dokumente. Ein Scan von schlechter Qualität kann zu schlechten OCR-Ergebnissen führen. Normalerweise liefert es keine genauen Ergebnisse der Bilder, die von Artefakten wie teilweiser Okklusion, verzerrter Perspektive und komplexem Hintergrund betroffen sind.
Quelle des Originalartikels unter https://www.thepythoncode.com
1648829847
Atualmente, as empresas de médio e grande porte possuem enormes quantidades de documentos impressos em uso diário. Entre eles estão faturas, recibos, documentos corporativos, relatórios e comunicados à mídia.
Para essas empresas, o uso de um scanner OCR pode economizar uma quantidade considerável de tempo, melhorando a eficiência e a precisão.
Os algoritmos de reconhecimento óptico de caracteres (OCR) permitem que os computadores analisem documentos impressos ou manuscritos automaticamente e preparem dados de texto em formatos editáveis para que os computadores os processem com eficiência. Os sistemas OCR transformam uma imagem bidimensional de texto que pode conter texto impresso por máquina ou manuscrito de sua representação de imagem em texto legível por máquina.
Geralmente, um mecanismo de OCR envolve várias etapas necessárias para treinar um algoritmo de aprendizado de máquina para uma solução eficiente de problemas com a ajuda do reconhecimento óptico de caracteres.
As etapas a seguir, que podem diferir de um mecanismo para outro, são aproximadamente necessárias para abordar o reconhecimento automático de caracteres:
Dentro deste tutorial, vou mostrar o seguinte:
Observe que este tutorial é sobre como extrair texto de imagens em documentos PDF.
Para começar, precisamos usar as seguintes bibliotecas:
Tesseract OCR : é um mecanismo de reconhecimento de texto de código aberto que está disponível sob a licença Apache 2.0 e seu desenvolvimento é patrocinado pelo Google desde 2006. No ano de 2006, o Tesseract foi considerado um dos mecanismos de OCR de código aberto mais precisos. Você pode usá-lo diretamente ou pode usar a API para extrair o texto impresso das imagens. A melhor parte é que ele suporta uma extensa variedade de idiomas.
A instalação do mecanismo Tesseract está fora do escopo deste artigo. No entanto, você precisa seguir o guia de instalação oficial do Tesseract para instalá-lo em seu sistema operacional.
Para validar a configuração do Tesseract, execute o seguinte comando e verifique a saída gerada:
Python-tesseract : é um wrapper Python para o mecanismo Tesseract-OCR do Google. Também é útil como um script de invocação independente para tesseract, pois pode ler todos os tipos de imagem suportados pelas bibliotecas de imagens Pillow e Leptonica, incluindo jpeg, png, gif, bmp, tiff e outros.
OpenCV : é uma biblioteca de código aberto Python, para visão computacional, aprendizado de máquina e processamento de imagens. OpenCV suporta uma grande variedade de linguagens de programação como Python, C++, Java, etc. Ele pode processar imagens e vídeos para identificar objetos, rostos ou até mesmo a caligrafia de um ser humano.
PyMuPDF : MuPDF é uma solução de intérprete de PDF, XPS e eBook altamente versátil e personalizável que pode ser usada em uma ampla variedade de aplicativos como renderizador, visualizador ou kit de ferramentas de PDF. PyMuPDF é uma ligação Python para MuPDF. É um visualizador leve de PDF e XPS.
Numpy: é um pacote de processamento de array de propósito geral. Ele fornece um objeto array multidimensional de alto desempenho e ferramentas para trabalhar com esses arrays. É o pacote fundamental para computação científica com Python. Além disso, o Numpy também pode ser usado como um contêiner multidimensional eficiente de dados genéricos.
Travesseiro: é construído em cima do PIL (Python Image Library). É um módulo essencial para processamento de imagens em Python.
Pandas: é uma biblioteca Python licenciada por BSD de código aberto que fornece estruturas de dados de alto desempenho e fáceis de usar e ferramentas de análise de dados para a linguagem de programação Python.
Tipo de arquivo: pacote Python pequeno e sem dependência para deduzir o tipo de arquivo e o tipo MIME.
Este tutorial visa desenvolver um utilitário leve baseado em linha de comando para extrair, redigir ou destacar um texto incluído em uma imagem ou arquivo PDF digitalizado, ou em uma pasta contendo uma coleção de arquivos PDF.
Para começar, vamos instalar os requisitos:
$ pip install Filetype==1.0.7 numpy==1.19.4 opencv-python==4.4.0.46 pandas==1.1.4 Pillow==8.0.1 PyMuPDF==1.18.9 pytesseract==0.3.7
Vamos começar importando as bibliotecas necessárias:
import os
import re
import argparse
import pytesseract
from pytesseract import Output
import cv2
import numpy as np
import fitz
from io import BytesIO
from PIL import Image
import pandas as pd
import filetype
# Path Of The Tesseract OCR engine
TESSERACT_PATH = "C:\Program Files\Tesseract-OCR\tesseract.exe"
# Include tesseract executable
pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH
TESSERACT_PATH
é onde o executável do Tesseract está localizado. Obviamente, você precisa alterá-lo para o seu caso.
def pix2np(pix):
"""
Converts a pixmap buffer into a numpy array
"""
# pix.samples = sequence of bytes of the image pixels like RGBA
#pix.h = height in pixels
#pix.w = width in pixels
# pix.n = number of components per pixel (depends on the colorspace and alpha)
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.h, pix.w, pix.n)
try:
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
except IndexError:
# Convert Gray to RGB
im = cv2.cvtColor(im, cv2.COLOR_GRAY2RGB)
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # RGB To BGR
return im
Esta função converte um buffer de pixmap representando uma captura de tela feita usando a biblioteca PyMuPDF em um array NumPy.
Para melhorar a precisão do Tesseract, vamos definir algumas funções de pré-processamento usando o OpenCV:
# Image Pre-Processing Functions to improve output accurracy
# Convert to grayscale
def grayscale(img):
return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Remove noise
def remove_noise(img):
return cv2.medianBlur(img, 5)
# Thresholding
def threshold(img):
# return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# dilation
def dilate(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.dilate(img, kernel, iterations=1)
# erosion
def erode(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.erode(img, kernel, iterations=1)
# opening -- erosion followed by a dilation
def opening(img):
kernel = np.ones((5, 5), np.uint8)
return cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
# canny edge detection
def canny(img):
return cv2.Canny(img, 100, 200)
# skew correction
def deskew(img):
coords = np.column_stack(np.where(img > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
(h, w) = img.shape[:2]
center = (w//2, h//2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(
img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
return rotated
# template matching
def match_template(img, template):
return cv2.matchTemplate(img, template, cv2.TM_CCOEFF_NORMED)
def convert_img2bin(img):
"""
Pre-processes the image and generates a binary output
"""
# Convert the image into a grayscale image
output_img = grayscale(img)
# Invert the grayscale image by flipping pixel values.
# All pixels that are grater than 0 are set to 0 and all pixels that are = to 0 are set to 255
output_img = cv2.bitwise_not(output_img)
# Converting image to binary by Thresholding in order to show a clear separation between white and blacl pixels.
output_img = threshold(output_img)
return output_img
Definimos funções para muitas tarefas de pré-processamento, incluindo conversão de imagens em escala de cinza, inversão de valores de pixel, separação de pixels brancos e pretos e muito mais.
Em seguida, vamos definir uma função para exibir uma imagem:
def display_img(title, img):
"""Displays an image on screen and maintains the output until the user presses a key"""
cv2.namedWindow('img', cv2.WINDOW_NORMAL)
cv2.setWindowTitle('img', title)
cv2.resizeWindow('img', 1200, 900)
# Display Image on screen
cv2.imshow('img', img)
# Mantain output until user presses a key
cv2.waitKey(0)
# Destroy windows when user presses a key
cv2.destroyAllWindows()
A display_img()
função exibe na tela uma imagem em uma janela com um título definido para o title
parâmetro e mantém esta janela aberta até que o usuário pressione uma tecla do teclado.
def generate_ss_text(ss_details):
"""Loops through the captured text of an image and arranges this text line by line.
This function depends on the image layout."""
# Arrange the captured text after scanning the page
parse_text = []
word_list = []
last_word = ''
# Loop through the captured text of the entire page
for word in ss_details['text']:
# If the word captured is not empty
if word != '':
# Add it to the line word list
word_list.append(word)
last_word = word
if (last_word != '' and word == '') or (word == ss_details['text'][-1]):
parse_text.append(word_list)
word_list = []
return parse_text
A função acima itera em todo o texto capturado de uma imagem e organiza o texto capturado linha por linha. Depende do layout da imagem e pode exigir ajustes para alguns formatos de imagem.
Em seguida, vamos definir uma função para pesquisar texto usando expressões regulares :
def search_for_text(ss_details, search_str):
"""Search for the search string within the image content"""
# Find all matches within one page
results = re.findall(search_str, ss_details['text'], re.IGNORECASE)
# In case multiple matches within one page
for result in results:
yield result
Estaremos usando esta função para pesquisar um texto específico dentro do conteúdo capturado de uma imagem. Ele retorna um gerador das correspondências encontradas.
def save_page_content(pdfContent, page_id, page_data):
"""Appends the content of a scanned page, line by line, to a pandas DataFrame."""
if page_data:
for idx, line in enumerate(page_data, 1):
line = ' '.join(line)
pdfContent = pdfContent.append(
{'page': page_id, 'line_id': idx, 'line': line}, ignore_index=True
)
return pdfContent
save_page_content()
A função anexa o conteúdo capturado de uma imagem linha por linha após digitalizá-la no pdfContent
dataframe do pandas.
Agora vamos fazer uma função para salvar o dataframe resultante em um arquivo CSV:
def save_file_content(pdfContent, input_file):
"""Outputs the content of the pandas DataFrame to a CSV file having the same path as the input_file
but with different extension (.csv)"""
content_file = os.path.join(os.path.dirname(input_file), os.path.splitext(
os.path.basename(input_file))[0] + ".csv")
pdfContent.to_csv(content_file, sep=',', index=False)
return content_file
Em seguida, vamos escrever uma função que calcule a pontuação de confiança do texto extraído da imagem digitalizada:
def calculate_ss_confidence(ss_details: dict):
"""Calculate the confidence score of the text grabbed from the scanned image."""
# page_num --> Page number of the detected text or item
# block_num --> Block number of the detected text or item
# par_num --> Paragraph number of the detected text or item
# line_num --> Line number of the detected text or item
# Convert the dict to dataFrame
df = pd.DataFrame.from_dict(ss_details)
# Convert the field conf (confidence) to numeric
df['conf'] = pd.to_numeric(df['conf'], errors='coerce')
# Elliminate records with negative confidence
df = df[df.conf != -1]
# Calculate the mean confidence by page
conf = df.groupby(['page_num'])['conf'].mean().tolist()
return conf[0]
Indo para a função principal: escanear a imagem:
def ocr_img(
img: np.array, input_file: str, search_str: str,
highlight_readable_text: bool = False, action: str = 'Highlight',
show_comparison: bool = False, generate_output: bool = True):
"""Scans an image buffer or an image file.
Pre-processes the image.
Calls the Tesseract engine with pre-defined parameters.
Calculates the confidence score of the image grabbed content.
Draws a green rectangle around readable text items having a confidence score > 30.
Searches for a specific text.
Highlight or redact found matches of the searched text.
Displays a window showing readable text fields or the highlighted or redacted text.
Generates the text content of the image.
Prints a summary to the console."""
# If image source file is inputted as a parameter
if input_file:
# Reading image using opencv
img = cv2.imread(input_file)
# Preserve a copy of this image for comparison purposes
initial_img = img.copy()
highlighted_img = img.copy()
# Convert image to binary
bin_img = convert_img2bin(img)
# Calling Tesseract
# Tesseract Configuration parameters
# oem --> OCR engine mode = 3 >> Legacy + LSTM mode only (LSTM neutral net mode works the best)
# psm --> page segmentation mode = 6 >> Assume as single uniform block of text (How a page of text can be analyzed)
config_param = r'--oem 3 --psm 6'
# Feeding image to tesseract
details = pytesseract.image_to_data(
bin_img, output_type=Output.DICT, config=config_param, lang='eng')
# The details dictionary contains the information of the input image
# such as detected text, region, position, information, height, width, confidence score.
ss_confidence = calculate_ss_confidence(details)
boxed_img = None
# Total readable items
ss_readable_items = 0
# Total matches found
ss_matches = 0
for seq in range(len(details['text'])):
# Consider only text fields with confidence score > 30 (text is readable)
if float(details['conf'][seq]) > 30.0:
ss_readable_items += 1
# Draws a green rectangle around readable text items having a confidence score > 30
if highlight_readable_text:
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
boxed_img = cv2.rectangle(
img, (x, y), (x+w, y+h), (0, 255, 0), 2)
# Searches for the string
if search_str:
results = re.findall(
search_str, details['text'][seq], re.IGNORECASE)
for result in results:
ss_matches += 1
if action:
# Draw a red rectangle around the searchable text
(x, y, w, h) = (details['left'][seq], details['top']
[seq], details['width'][seq], details['height'][seq])
# Details of the rectangle
# Starting coordinate representing the top left corner of the rectangle
start_point = (x, y)
# Ending coordinate representing the botton right corner of the rectangle
end_point = (x + w, y + h)
#Color in BGR -- Blue, Green, Red
if action == "Highlight":
color = (0, 255, 255) # Yellow
elif action == "Redact":
color = (0, 0, 0) # Black
# Thickness in px (-1 will fill the entire shape)
thickness = -1
boxed_img = cv2.rectangle(
img, start_point, end_point, color, thickness)
if ss_readable_items > 0 and highlight_readable_text and not (ss_matches > 0 and action in ("Highlight", "Redact")):
highlighted_img = boxed_img.copy()
# Highlight found matches of the search string
if ss_matches > 0 and action == "Highlight":
cv2.addWeighted(boxed_img, 0.4, highlighted_img,
1 - 0.4, 0, highlighted_img)
# Redact found matches of the search string
elif ss_matches > 0 and action == "Redact":
highlighted_img = boxed_img.copy()
#cv2.addWeighted(boxed_img, 1, highlighted_img, 0, 0, highlighted_img)
# save the image
cv2.imwrite("highlighted-text-image.jpg", highlighted_img)
# Displays window showing readable text fields or the highlighted or redacted data
if show_comparison and (highlight_readable_text or action):
title = input_file if input_file else 'Compare'
conc_img = cv2.hconcat([initial_img, highlighted_img])
display_img(title, conc_img)
# Generates the text content of the image
output_data = None
if generate_output and details:
output_data = generate_ss_text(details)
# Prints a summary to the console
if input_file:
summary = {
"File": input_file, "Total readable words": ss_readable_items, "Total matches": ss_matches, "Confidence score": ss_confidence
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("###################################################################")
return highlighted_img, ss_readable_items, ss_matches, ss_confidence, output_data
# pass image into pytesseract module
# pytesseract is trained in many languages
#config_param = r'--oem 3 --psm 6'
#details = pytesseract.image_to_data(img,config=config_param,lang='eng')
# print(details)
# return details
O acima executa o seguinte:
def image_to_byte_array(image: Image):
"""
Converts an image into a byte array
"""
imgByteArr = BytesIO()
image.save(imgByteArr, format=image.format if image.format else 'JPEG')
imgByteArr = imgByteArr.getvalue()
return imgByteArr
def ocr_file(**kwargs):
"""Opens the input PDF File.
Opens a memory buffer for storing the output PDF file.
Creates a DataFrame for storing pages statistics
Iterates throughout the chosen pages of the input PDF file
Grabs a screen-shot of the selected PDF page.
Converts the screen-shot pix to a numpy array
Scans the grabbed screen-shot.
Collects the statistics of the screen-shot(page).
Saves the content of the screen-shot(page).
Adds the updated screen-shot (Highlighted, Redacted) to the output file.
Saves the whole content of the PDF file.
Saves the output PDF file if required.
Prints a summary to the console."""
input_file = kwargs.get('input_file')
output_file = kwargs.get('output_file')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
highlight_readable_text = kwargs.get('highlight_readable_text')
action = kwargs.get('action')
show_comparison = kwargs.get('show_comparison')
generate_output = kwargs.get('generate_output')
# Opens the input PDF file
pdfIn = fitz.open(input_file)
# Opens a memory buffer for storing the output PDF file.
pdfOut = fitz.open()
# Creates an empty DataFrame for storing pages statistics
dfResult = pd.DataFrame(
columns=['page', 'page_readable_items', 'page_matches', 'page_total_confidence'])
# Creates an empty DataFrame for storing file content
if generate_output:
pdfContent = pd.DataFrame(columns=['page', 'line_id', 'line'])
# Iterate throughout the pages of the input file
for pg in range(pdfIn.pageCount):
if str(pages) != str(None):
if str(pg) not in str(pages):
continue
# Select a page
page = pdfIn[pg]
# Rotation angle
rotate = int(0)
# PDF Page is converted into a whole picture 1056*816 and then for each picture a screenshot is taken.
# zoom = 1.33333333 -----> Image size = 1056*816
# zoom = 2 ---> 2 * Default Resolution (text is clear, image text is hard to read) = filesize small / Image size = 1584*1224
# zoom = 4 ---> 4 * Default Resolution (text is clear, image text is barely readable) = filesize large
# zoom = 8 ---> 8 * Default Resolution (text is clear, image text is readable) = filesize large
zoom_x = 2
zoom_y = 2
# The zoom factor is equal to 2 in order to make text clear
# Pre-rotate is to rotate if needed.
mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
# To captue a specific part of the PDF page
# rect = page.rect #page size
# mp = rect.tl + (rect.bl - (0.75)/zoom_x) #rectangular area 56 = 75/1.3333
# clip = fitz.Rect(mp,rect.br) #The area to capture
# pix = page.getPixmap(matrix=mat, alpha=False,clip=clip)
# Get a screen-shot of the PDF page
# Colorspace -> represents the color space of the pixmap (csRGB, csGRAY, csCMYK)
# alpha -> Transparancy indicator
pix = page.getPixmap(matrix=mat, alpha=False, colorspace="csGRAY")
# convert the screen-shot pix to numpy array
img = pix2np(pix)
# Erode image to omit or thin the boundaries of the bright area of the image
# We apply Erosion on binary images.
#kernel = np.ones((2,2) , np.uint8)
#img = cv2.erode(img,kernel,iterations=1)
upd_np_array, pg_readable_items, pg_matches, pg_total_confidence, pg_output_data \
= ocr_img(img=img, input_file=None, search_str=search_str, highlight_readable_text=highlight_readable_text # False
, action=action # 'Redact'
, show_comparison=show_comparison # True
, generate_output=generate_output # False
)
# Collects the statistics of the page
dfResult = dfResult.append({'page': (pg+1), 'page_readable_items': pg_readable_items,
'page_matches': pg_matches, 'page_total_confidence': pg_total_confidence}, ignore_index=True)
if generate_output:
pdfContent = save_page_content(
pdfContent=pdfContent, page_id=(pg+1), page_data=pg_output_data)
# Convert the numpy array to image object with mode = RGB
#upd_img = Image.fromarray(np.uint8(upd_np_array)).convert('RGB')
upd_img = Image.fromarray(upd_np_array[..., ::-1])
# Convert the image to byte array
upd_array = image_to_byte_array(upd_img)
# Get Page Size
"""
#To check whether initial page is portrait or landscape
if page.rect.width > page.rect.height:
fmt = fitz.PaperRect("a4-1")
else:
fmt = fitz.PaperRect("a4")
#pno = -1 -> Insert after last page
pageo = pdfOut.newPage(pno = -1, width = fmt.width, height = fmt.height)
"""
pageo = pdfOut.newPage(
pno=-1, width=page.rect.width, height=page.rect.height)
pageo.insertImage(page.rect, stream=upd_array)
#pageo.insertImage(page.rect, stream=upd_img.tobytes())
#pageo.showPDFpage(pageo.rect, pdfDoc, page.number)
content_file = None
if generate_output:
content_file = save_file_content(
pdfContent=pdfContent, input_file=input_file)
summary = {
"File": input_file, "Total pages": pdfIn.pageCount,
"Processed pages": dfResult['page'].count(), "Total readable words": dfResult['page_readable_items'].sum(),
"Total matches": dfResult['page_matches'].sum(), "Confidence score": dfResult['page_total_confidence'].mean(),
"Output file": output_file, "Content file": content_file
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("\nPages Statistics:")
print(dfResult, sep='\n')
print("###################################################################")
pdfIn.close()
if output_file:
pdfOut.save(output_file)
pdfOut.close()
A image_to_byte_array()
função converte uma imagem em uma matriz de bytes.
A ocr_file()
função faz o seguinte:
Vamos adicionar outra função para processar uma pasta que contém vários arquivos PDF:
def ocr_folder(**kwargs):
"""Scans all PDF Files within a specified path"""
input_folder = kwargs.get('input_folder')
# Run in recursive mode
recursive = kwargs.get('recursive')
search_str = kwargs.get('search_str')
pages = kwargs.get('pages')
action = kwargs.get('action')
generate_output = kwargs.get('generate_output')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file =", inp_pdf_file)
output_file = None
if search_str:
# Generate an output file
output_file = os.path.join(os.path.dirname(
inp_pdf_file), 'ocr_' + os.path.basename(inp_pdf_file))
ocr_file(
input_file=inp_pdf_file, output_file=output_file, search_str=search_str, pages=pages, highlight_readable_text=False, action=action, show_comparison=False, generate_output=generate_output
)
if not recursive:
break
Esta função destina-se a digitalizar os arquivos PDF incluídos em uma pasta específica. Ele percorre os arquivos da pasta especificada recursivamente ou não, dependendo do valor do parâmetro recursivo e processa esses arquivos um por um.
Ele aceita os seguintes parâmetros:
input_folder
: o caminho da pasta que contém os arquivos PDF a serem processados.search_str
: O texto a ser pesquisado para manipular.recursive
: se esse processo deve ser executado recursivamente, fazendo um loop entre as subpastas ou não.action
: a ação a ser executada entre as seguintes: Realçar, Redigir.pages
: as páginas a serem consideradas.generate_output
: selecione se deseja salvar o conteúdo do arquivo PDF de entrada em um arquivo CSV ou nãoAntes de terminarmos, vamos definir funções úteis para analisar argumentos de linha de comando:
def is_valid_path(path):
"""Validates the path inputted and checks whether it is a file path or a folder path"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
def parse_args():
"""Get user command line parameters"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input-path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', choices=[
'Highlight', 'Redact'], type=str, help="Choose to highlight or to redact")
parser.add_argument('-s', '--search-str', dest='search_str',
type=str, help="Enter a valid search string")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider in the PDF file, e.g. (0,1)")
parser.add_argument("-g", "--generate-output", action="store_true", help="Generate text content in a CSV file")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file',
type=str, help="Enter a valid output file")
parser.add_argument("-t", "--highlight-readable-text", action="store_true", help="Highlight readable text in the generated image")
parser.add_argument("-c", "--show-comparison", action="store_true", help="Show comparison between captured image and the generated image")
if os.path.isdir(path):
parser.add_argument("-r", "--recursive", action="store_true", help="Whether to process the directory recursively")
# To Porse The Command Line Arguments
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
A is_valid_path()
função valida um caminho inserido como parâmetro e verifica se é um caminho de arquivo ou um caminho de diretório.
A parse_args()
função define e define as restrições apropriadas para os argumentos de linha de comando do usuário ao executar esse utilitário.
Abaixo estão as explicações para todos os parâmetros:
input_path
: Um parâmetro necessário para inserir o caminho do arquivo ou da pasta a processar, este parâmetro está associado à is_valid_path()
função definida anteriormente.action
: A ação a ser executada em uma lista de opções predefinidas para evitar qualquer seleção errada.search_str
: O texto a ser pesquisado para manipular.pages
: as páginas a serem consideradas ao processar um arquivo PDF.generate_content
: especifica se o conteúdo capturado do arquivo de entrada deve ser gerado, seja uma imagem ou um PDF para um arquivo CSV ou não.output_file
: O caminho do arquivo de saída. O preenchimento deste argumento é limitado pela seleção de um arquivo como entrada, não de um diretório.highlight_readable_text
: para desenhar retângulos verdes ao redor de campos de texto legíveis com uma pontuação de confiança maior que 30.show_comparison
: Exibe uma janela mostrando uma comparação entre a imagem original e a imagem processada.recursive
: se deve processar uma pasta recursivamente ou não. O preenchimento deste argumento é limitado pela seleção de um diretório.Por fim, vamos escrever o código principal que usa funções definidas anteriormente:
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Process a file
if filetype.is_image(args['input_path']):
ocr_img(
# if 'search_str' in (args.keys()) else None
img=None, input_file=args['input_path'], search_str=args['search_str'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
else:
ocr_file(
input_file=args['input_path'], output_file=args['output_file'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Process a folder
ocr_folder(
input_folder=args['input_path'], recursive=args['recursive'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], action=args['action'], generate_output=args['generate_output']
)
Vamos testar nosso programa:
$ python pdf_ocr.py
Saída:
usage: pdf_ocr.py [-h] -i INPUT_PATH [-a {Highlight,Redact}] [-s SEARCH_STR] [-p PAGES] [-g GENERATE_OUTPUT]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {Highlight,Redact}, --action {Highlight,Redact}
Choose to highlight or to redact
-s SEARCH_STR, --search_str SEARCH_STR
Enter a valid search string
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: (0,1)
-g GENERATE_OUTPUT, --generate_output GENERATE_OUTPUT
Generate content in a CSV file
Antes de explorar nossos cenários de teste, tenha cuidado com o seguinte:
PermissionError
erro, feche o arquivo de entrada antes de executar este utilitário.Primeiro, vamos tentar inserir uma imagem (você pode obtê-la aqui se quiser obter a mesma saída), sem nenhum arquivo PDF envolvido:
$ python pdf_ocr.py -s "BERT" -a Highlight -i example-image-containing-text.jpg
O seguinte será a saída:
## Command Arguments #################################################
input_path:example-image-containing-text.jpg
action:Highlight
search_str:BERT
pages:None
generate_output:False
output_file:None
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:example-image-containing-text.jpg
Total readable words:192
Total matches:3
Confidence score:89.89337547979804
###################################################################
E uma nova imagem apareceu no diretório atual:
Você pode passar
-t
ou --highlight-readable-text
destacar todo o texto detectado (com um formato diferente, para distinguir a string de pesquisa das demais).
Você também pode passar -c
ou --show-comparison
exibir a imagem original e a imagem editada na mesma janela.
Agora que está funcionando para imagens, vamos tentar arquivos PDF:
$ python pdf_ocr.py -s "BERT" -i image.pdf -o output.pdf --generate-output -a "Highlight"
image.pdf
é um arquivo PDF simples contendo a imagem do exemplo anterior (mais uma vez, você pode obtê-la aqui ).
Desta vez passamos um arquivo PDF para o -i
argumento, e output.pdf
como o arquivo PDF resultante (onde ocorre todo o realce). O comando acima gera a seguinte saída:
## Command Arguments #################################################
input_path:image.pdf
action:Highlight
search_str:BERT
pages:None
generate_output:True
output_file:output.pdf
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:image.pdf
Total pages:1
Processed pages:1
Total readable words:192.0
Total matches:3.0
Confidence score:83.1775128855722
Output file:output.pdf
Content file:image.csv
Pages Statistics:
page page_readable_items page_matches page_total_confidence
0 1.0 192.0 3.0 83.177513
###################################################################
O output.pdf
arquivo é produzido após a execução, onde inclui o mesmo PDF original, mas com o texto destacado. Além disso, agora temos estatísticas sobre nosso arquivo PDF, onde 192 palavras no total foram detectadas e 3 foram correspondidas usando nossa pesquisa com uma confiança de cerca de 83,2%.
Também é gerado um arquivo CSV que inclui o texto detectado da imagem em cada linha.
Existem outros parâmetros que não usamos em nossos exemplos, fique à vontade para explorá-los. Você também pode passar uma pasta inteira para o -i
argumento para escanear uma coleção de arquivos PDF.
O Tesseract é perfeito para digitalizar documentos limpos e claros. Uma digitalização de baixa qualidade pode produzir resultados ruins no OCR. Normalmente, ele não fornece resultados precisos das imagens afetadas por artefatos, incluindo oclusão parcial, perspectiva distorcida e fundo complexo.
Fonte do artigo original em https://www.thepythoncode.com