← Blog

Combine PDF — what actually happens under the hood

You cannot combine two PDFs by appending the bytes of one file to the end of another. PDFs are not text streams — they are random-access object databases with a cross-reference table at the end pointing to every object by absolute byte offset. Concatenate two of them and the second file's offsets are off by tens of kilobytes; both halves become unreadable.

The real shape of a PDF

A PDF is a directed graph of indirect objects: a Catalog at the root, a Pages tree below it, Page objects holding content streams, and a constellation of resource dictionaries pointing to fonts, images, ICC profiles, and external graphics states. At the very end of the file is the xref table — a numeric index mapping (object number, generation) → byte offset — and a trailer dictionary pointing to the catalog and the previous xref (if the file has been incrementally updated).

Every reference inside the body of the PDF — 3 0 R, 17 0 R — is a number. Any merge must rewrite those numbers so that A's 5 0 R and B's 5 0 R don't collide.

How the merge runs

From the converter's perspective, the merge is one operation. The list of input PDFs (each non-PDF input has been converted to a single-page PDF first — see "Image inputs" below) is handed to an underlying merge tool, which performs all the heavy lifting in one pass: parsing each input, renumbering objects to avoid collisions, concatenating page trees, reattaching resources, rewriting cross-references, and serializing the result. The converter's only explicit step after the merge is to set the output PDF's /Title to "Combined PDF". Everything else — outline reconciliation, metadata handling, /ID generation, version selection — is the merge tool's behavior, not something we hand-code.

What does not survive a merge

Many higher-level features are scoped to a single file's catalog: form-field name dictionaries (AcroForm), document-wide JavaScript, named destinations, embedded file attachments, and some annotations referencing pages by name rather than object. A typical merge concatenates page trees but doesn't reconcile these higher-level structures, so they tend to drop or partially break in the output. Digital signatures break too — any byte change to a signed PDF invalidates the signature, and merging is by definition a byte change.

If a workflow depends on AcroForm fields, named destinations, or document-level scripts surviving, verify the result in a viewer; if anything is missing, the workaround is to combine in a desktop tool that does field-level reconciliation, then upload the result.

Image inputs

CombinePDF accepts not just PDFs but also JPG, PNG, BMP, TIFF, HEIC, WebP, AVIF, and SVG. Each non-PDF input is rendered to a single PDF page first — one image per page, fitted to A4 at 96 DPI — and then merged into the output the same way as native PDFs.

For combinepdf specifically, the image-to-PDF step is tuned for a small final file rather than archival fidelity: JPEG inputs are re-encoded at quality 85 when the source quality is higher; palette and grayscale PNGs are lossily quantized for size. If you need pixel-exact preservation of image inputs, use a tool with a higher-quality target (the JPG-to-PDF or PNG-to-PDF tools in this family use quality 90 for JPEG and lossless optimization for PNG).

Why the output can be larger than the sum

If both inputs embed the same font (a common Latin sans-serif), a naïve combine writes both copies. Smart combiners deduplicate font streams by hashing — but most don't. Expect a 1.0 MB + 1.0 MB combine to produce 1.95–2.05 MB, occasionally 2.1 MB if metadata blocks duplicate.

Why it can be smaller

If you re-save through a writer that does object stream compression (PDF 1.5+) or removes orphaned objects from incremental-update layers, the output can be 5–15% smaller than the inputs combined. This is why some tools advertise "compress" and "combine" as separate steps when they are essentially the same code path with one extra pass.