← Blog

Metadata when combining — title, author, XMP, document ID

PDFs carry two independent metadata streams — both need handling on combine /Info dictionary (since 1.0) /Title (Annual report) /Author (Finance team) /Subject (FY 2024) /Keywords (revenue 2024) /Creator (Microsoft Word) /Producer (Adobe PDF Library) /CreationDate (D:20240315…) /ModDate (D:20240320…) strings, simple key-value XMP packet (since 1.4) <rdf:Description xmlns:dc=…> <dc:title> <rdf:Alt> <rdf:li xml:lang="en">… </rdf:Alt> </dc:title> <dc:creator>…</dc:creator> <xmp:CreateDate>… RDF/XML, multi-language, extensible by namespaces

A PDF carries two parallel metadata systems: the Info dictionary (a simple key-value map, present since PDF 1.0) and the XMP packet (an embedded RDF/XML document, added in PDF 1.4 and required by PDF/A and most modern publishing pipelines). They mostly overlap in content but use different formats. Both need handling during a combine.

The /Info dictionary

The trailer's /Info entry points to a dictionary with a fixed set of keys: /Title, /Author, /Subject, /Keywords, /Creator (the application that created the source content, e.g. "Microsoft Word"), /Producer (the library that wrote the PDF, e.g. "Adobe PDF Library"), /CreationDate, and /ModDate. All values are PDF strings, encoded in PDFDocEncoding or UTF-16BE.

The dates use a quirky format: D:YYYYMMDDHHmmSSOHH'mm' where O is +, -, or Z for the UTC offset. Example: D:20240320143015+02'00'.

The XMP packet

XMP (Extensible Metadata Platform, an Adobe-led ISO standard) lives in the catalog's /Metadata stream as embedded RDF/XML, wrapped in <?xpacket begin="..." id="..."?> markers. It uses Dublin Core (dc:title, dc:creator) and XMP Basic (xmp:CreateDate) namespaces, and any number of additional namespaces for application-specific metadata (Photoshop, Illustrator, custom workflow tools).

The XMP packet is the "modern" metadata source. It supports multi-language titles via rdf:Alt, structured author lists, version history, and extensible vocabularies. PDF/A and PDF/X both require XMP and treat /Info as advisory.

What CombinePDF writes after a merge

The only metadata write the converter does explicitly is to set the output PDF's /Title to "Combined PDF". Everything else in the merged file's /Info dictionary, XMP packet, and /ID array is whatever the underlying merge tool produces.

In typical merger behavior:

Document ID — the /ID array

The trailer carries an /ID array of two strings: the original creation ID and the current modification ID. Both are 16-byte values (typically MD5 hashes of file content + timestamp). They identify the file uniquely across editing sessions.

The merged file gets its own fresh /ID — the merger generates a new pair of 16-byte hashes from the combined content. The IDs from inputs are not preserved — they identified those files, not this new one.

Custom metadata keys

Many enterprise document management systems write custom keys into /Info (e.g. /RFP_Number, /Department) or into XMP (custom namespaces). A naïve combine drops them along with the rest of the input metadata. If you have a workflow that depends on custom metadata, plan to re-apply it after combining.

The /Producer string and forensic tracing

The Producer field is the closest thing a PDF has to a fingerprint. Forensic analysts use it to identify the tool chain that built a file. After a combine, the Producer reflects the merge tool — the input files' Producer strings are lost.

If you care about tracing or audit, save the original Producer values somewhere outside the PDF before combining. Some compliance pipelines verify the Producer matches an expected value; combining a verified PDF defeats that check.

Reducing metadata leakage

If you are combining PDFs containing internal author names, file paths, or workflow metadata you don't want to publish:

  1. Combine first.
  2. Open the combined PDF in your PDF editor (Adobe Acrobat, Preview, or similar) and use the "Document Properties → Metadata" or "Sanitize Document" feature to clear author, title, subject, and similar fields.
  3. Apps that focus on metadata (photo-management tools, file-property editors) typically expose a "remove all metadata" toggle that wipes both the /Info dictionary and the XMP packet at once.

A combine usually drops most input metadata already, but the merger's freshly written /Info entries (Producer, CreationDate, ModDate) and any surviving fields from inputs may still reveal more than you intend. A post-strip is a safe habit before publication.