Cómo extraer metadatos PDF en Python

Aprenda a usar la biblioteca pikepdf para extraer información útil de archivos PDF en Python.

Los metadatos en PDF son información útil sobre el documento PDF, incluye el título del documento, el autor, la fecha de la última modificación, la fecha de creación, el tema y mucho más. Algunos archivos PDF tienen más información que otros y, en este tutorial, aprenderá a extraer metadatos PDF en Python.

Hay muchas bibliotecas y utilidades en Python para lograr lo mismo, pero me gusta usar pikepdf , ya que es una biblioteca activa y mantenida. Vamos a instalarlo:

$ pip install pikepdf

Pikepdf es un contenedor Pythonic de la biblioteca C ++ QPDF . Importémoslo en nuestro script:

import pikepdf
import sys

También usaremos el módulo sys para obtener el nombre de archivo de los argumentos de la línea de comandos:

# get the target pdf file from the command-line arguments
pdf_filename = sys.argv[1]

Carguemos el archivo PDF usando la biblioteca y obtengamos los metadatos:

# read the pdf file
pdf = pikepdf.Pdf.open(pdf_filename)
docinfo = pdf.docinfo
for key, value in docinfo.items():
    print(key, ":", value)

El docinfoatributo contiene un diccionario de los metadatos del documento. Aquí hay una ejecución de ejemplo:

$ python extract_pdf_metadata_simple.py bert-paper.pdf


/Author : 
/CreationDate : D:20190528000751Z
/Creator : LaTeX with hyperref package
/Keywords :
/ModDate : D:20190528000751Z
/PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
/Producer : pdfTeX-1.40.17
/Subject :
/Title :
/Trapped : /False

Aquí hay otro archivo PDF:

$ python extract_pdf_metadata_simple.py python_cheat_sheet.pdf


/CreationDate : D:20201002181301Z
/Creator : wkhtmltopdf 0.12.5
/Producer : Qt 4.8.7
/Title : Markdown To PDF

Como puede ver, no todos los documentos tienen los mismos campos, algunos contienen mucha menos información.

Tenga en cuenta que /ModDatey /CreationDateson la fecha de la última modificación y la fecha de creación, respectivamente, en el formato de fecha y hora de PDF. Si desea convertir este formato al formato de fecha y hora de Python, entonces he copiado este código de StackOverflow y lo edito un poco para ejecutarlo en Python 3:

import pikepdf
import datetime
import re
from dateutil.tz import tzutc, tzoffset
import sys

pdf_date_pattern = re.compile(''.join([

def transform_date(date_str):
    Convert a pdf date such as "D:20120321183444+07'00'" into a usable datetime
    :param date_str: pdf date string
    :return: datetime object
    global pdf_date_pattern
    match = re.match(pdf_date_pattern, date_str)
    if match:
        date_info = match.groupdict()

        for k, v in date_info.items():  # transform values
            if v is None:
            elif k == 'tz_offset':
                date_info[k] = v.lower()  # so we can treat Z as z
                date_info[k] = int(v)

        if date_info['tz_offset'] in ('z', None):  # UTC
            date_info['tzinfo'] = tzutc()
            multiplier = 1 if date_info['tz_offset'] == '+' else -1
            date_info['tzinfo'] = tzoffset(None, multiplier*(3600 * date_info['tz_hour'] + 60 * date_info['tz_minute']))

        for k in ('tz_offset', 'tz_hour', 'tz_minute'):  # no longer needed
            del date_info[k]

        return datetime.datetime(**date_info)

# get the target pdf file from the command-line arguments
pdf_filename = sys.argv[1]
# read the pdf file
pdf = pikepdf.Pdf.open(pdf_filename)
docinfo = pdf.docinfo
for key, value in docinfo.items():
    if str(value).startswith("D:"):
        # pdf datetime format, convert to python datetime
        value = transform_date(str(pdf.docinfo["/CreationDate"]))
    print(key, ":", value)

Aquí está el mismo resultado anterior, pero con formatos de fecha y hora convertidos a objetos de fecha y hora de Python:

/Author : 
/CreationDate : 2019-05-28 00:07:51+00:00
/Creator : LaTeX with hyperref package
/Keywords :
/ModDate : 2019-05-28 00:07:51+00:00
/PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
/Producer : pdfTeX-1.40.17
/Subject :
/Title :
/Trapped : /False

Mucho mejor. Espero que este rápido tutorial te haya ayudado a obtener los metadatos de documentos PDF con Python.

Consulta el código completo aquí .


What is GEEK

Buddha Community

Cómo extraer metadatos PDF en Python
Ray  Patel

Ray Patel


top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

Ray  Patel

Ray Patel


Lambda, Map, Filter functions in python

Welcome to my Blog, In this article, we will learn python lambda function, Map function, and filter function.

Lambda function in python: Lambda is a one line anonymous function and lambda takes any number of arguments but can only have one expression and python lambda syntax is

Syntax: x = lambda arguments : expression

Now i will show you some python lambda function examples:

#python #anonymous function python #filter function in python #lambda #lambda python 3 #map python #python filter #python filter lambda #python lambda #python lambda examples #python map

August  Larson

August Larson


Creating PDF Invoices in Python with pText


The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.

To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.

In this guide, we’ll be using pText - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).

We’ll take a look at how to create a PDF invoice in Python using pText.

#python #pdf #creating pdf invoices in python with ptext #creating pdf invoices #pdf invoice #creating pdf invoices in python with ptext

Art  Lind

Art Lind


Python Tricks Every Developer Should Know

Python is awesome, it’s one of the easiest languages with simple and intuitive syntax but wait, have you ever thought that there might ways to write your python code simpler?

In this tutorial, you’re going to learn a variety of Python tricks that you can use to write your Python code in a more readable and efficient way like a pro.

Let’s get started

Swapping value in Python

Instead of creating a temporary variable to hold the value of the one while swapping, you can do this instead

>>> FirstName = "kalebu"
>>> LastName = "Jordan"
>>> FirstName, LastName = LastName, FirstName 
>>> print(FirstName, LastName)
('Jordan', 'kalebu')

#python #python-programming #python3 #python-tutorials #learn-python #python-tips #python-skills #python-development

Art  Lind

Art Lind


How to Remove all Duplicate Files on your Drive via Python

Today you’re going to learn how to use Python programming in a way that can ultimately save a lot of space on your drive by removing all the duplicates.


In many situations you may find yourself having duplicates files on your disk and but when it comes to tracking and checking them manually it can tedious.

Heres a solution

Instead of tracking throughout your disk to see if there is a duplicate, you can automate the process using coding, by writing a program to recursively track through the disk and remove all the found duplicates and that’s what this article is about.

But How do we do it?

If we were to read the whole file and then compare it to the rest of the files recursively through the given directory it will take a very long time, then how do we do it?

The answer is hashing, with hashing can generate a given string of letters and numbers which act as the identity of a given file and if we find any other file with the same identity we gonna delete it.

There’s a variety of hashing algorithms out there such as

  • md5
  • sha1
  • sha224, sha256, sha384 and sha512

#python-programming #python-tutorials #learn-python #python-project #python3 #python #python-skills #python-tips