PDF Metadata

The primary metadata in a PDF is stored in an XMP (Extensible Metadata Platform) Metadata stream, where XMP is a metadata specification in XML format. For full information on XMP, see Adobe’s XMP Developer Center. It supercedes the older Document Info dictionaries, which are removed in the PDF 2.0 specification. The XMP data entry is optional and does not appear in all PDFs.

The XMP Specification also provides useful information.

pikepdf provides an interface to simplify viewing and making minor edits to XMP. In particular, compound quantities may be read, but only scalar quantities can be modified.

For more complex changes consider using the python-xmp-toolkit library and its libexempi dependency; but note that it is not capable of synchronizing changes to the older DocumentInfo metadata.

Accessing metadata

The XMP metadata stream is attached the PDF’s root object, but to simplify management of this, use pikepdf.Pdf.open_metadata(). The returned pikepdf.models.PdfMetadata object may be used for reading, or entered with a with block to modify and commit changes. If you use this interface, pikepdf will synchronize changes to new and old metadata.

A PDF must still be saved after metadata is changed.

In [1]: pdf = pikepdf.open('../tests/resources/sandwich.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-24cdded42ad1> in <module>()
----> 1 pdf = pikepdf.open('../tests/resources/sandwich.pdf')

NameError: name 'pikepdf' is not defined

In [2]: meta = pdf.open_metadata()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-d4db48daf176> in <module>()
----> 1 meta = pdf.open_metadata()

NameError: name 'pdf' is not defined

In [3]: meta['xmp:CreatorTool']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-5438a6793498> in <module>()
----> 1 meta['xmp:CreatorTool']

NameError: name 'meta' is not defined

If no XMP metadata exists, an empty XMP metadata container will be created.

Open metadata in a with block to open it for editing. When the block is exited, changes are committed (updating XMP and the Document Info dictionary) and attached to the PDF object. The PDF must still be saved. If an exception occurs in the block, changes are discarded.

In [4]: with pdf.open_metadata() as meta:
   ...:     meta['dc:title'] = "Let's change the title"
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-94e3b5196c84> in <module>()
----> 1 with pdf.open_metadata() as meta:
      2     meta['dc:title'] = "Let's change the title"
      3 

NameError: name 'pdf' is not defined

The list of available metadata fields may be found in the XMP Specification.

Checking PDF/A conformance

The metadata interface can also test if a file claims to be conformant to the PDF/A specification.

In [5]: pdf = pikepdf.open('../tests/resources/veraPDF test suite 6-2-10-t02-pass-a.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-851f84133ed8> in <module>()
----> 1 pdf = pikepdf.open('../tests/resources/veraPDF test suite 6-2-10-t02-pass-a.pdf')

NameError: name 'pikepdf' is not defined

In [6]: meta = pdf.open_metadata()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-d4db48daf176> in <module>()
----> 1 meta = pdf.open_metadata()

NameError: name 'pdf' is not defined

In [7]: meta.pdfa_status
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-7e280a7ea291> in <module>()
----> 1 meta.pdfa_status

NameError: name 'meta' is not defined

Note

Note that this property merely tests if the file claims to be conformant to the PDF/A standard. Use a tool such as veraPDF to verify conformance.

The Document Info dictionary

The Document Info block is an older, now deprecated object in which metadata may be stored. The Document Info is not attached to the /Root object. It may be accessed using the .docinfo property. If no Document Info exists, touching the .docinfo will properly initialize an empty one.

Here is an example of a Document Info block.

In [8]: pdf = pikepdf.open('../tests/resources/sandwich.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-24cdded42ad1> in <module>()
----> 1 pdf = pikepdf.open('../tests/resources/sandwich.pdf')

NameError: name 'pikepdf' is not defined

In [9]: pdf.docinfo
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-e8c9eb8a5aa0> in <module>()
----> 1 pdf.docinfo

NameError: name 'pdf' is not defined

It is permitted in pikepdf to directly interact with Document Info as with other PDF dictionaries. However, it is better to use .open_metadata() because that interface will apply changes to both XMP and Document Info in a consistent manner.

You may copy from data from a Document Info object in the current PDF or another PDF into XMP metadata using load_from_docinfo().