How PDF.js Works

2020 Jun 2020

author
Adam Pez

PDF.js is an open-source JavaScript PDF viewer that renders PDF using web standards-compliant HTML5.

Primarily seen in Mozilla Firefox’s as the built-in PDF viewer, PDF.js also serves as an easy way for developers and integrators to embed PDF viewing capabilities in a web app or server.

In this article, we walk you through how PDF.js works to render PDFs, what technologies it uses, and implications for projects using PDF.js.

Basic Architecture

The pre-built PDF.js download is structured into three layers:

Core - In charge of parsing and interpreting PDF binary instructions.

Display - An API exposing functions to render PDF pages.

Viewer - A sample user interface with features like search, rotate, zoom, a page thumbnail sidebar, different viewer modes, and so on.

Developers can choose to use the pre-built viewer -- or build a custom viewer over top of the PDF.js rendering engine by hooking it into the Display API, a method we use for our React-based PDF.js Express UI.

The file structure for pre-built PDF.js is also presented in the PDF.js FAQ as follows:

Prebuilt
├── build/
│   ├── pdf.js                             - display layer
│   ├── pdf.js.map                         - display layer's source map
│   ├── pdf.worker.js                      - core layer
│   └── pdf.worker.js.map                  - core layer's source map
├── web/
│   ├── cmaps/                             - character maps (required by core)
│   ├── compressed.tracemonkey-pldi-09.pdf - PDF file for testing purposes
│   ├── debugger.js                        - helpful debugging features
│   ├── images/                            - images for the viewer and annotation icons
│   ├── locale/                            - translation files
│   ├── viewer.css                         - viewer style sheet
│   ├── viewer.html                        - viewer layout
│   ├── viewer.js                          - viewer layer
│   └── viewer.js.map                      - viewer layer's source map
└── LICENSE

Technologies Used

PDF.js leverages Asynchronous JavaScript and XML (AJAX) to download the PDF file from a web server and parse its contents. Once prepared, content is then rendered onto an HTML5 <canvas> element using canvas drawing commands.

HTML and CSS is then used to specify UI elements as well as a transparent text overlay, enabling text select, text search, and copy/paste.

How PDF.js Renders and Processes PDFs

The canvas and and PDF graphic models are similar, each supporting text, raster images, and vector graphics. Nevertheless, there are also substantial differences, with PDF supporting many advanced shadings, patterns, and transparencies not found in canvas.

In order to make PDF binary understandable to canvas, PDF.js must first process the PDF file, including translating some PDF graphics for canvas.

Processing PDFs directly in the main browser thread would be problematic for the UX, however, as a large document could block the UI, causing the page to freeze and become unresponsive for the user.

PDF.js therefore uses Web Workers to process PDFs in a background thread.

Processing then follows these steps, in rough order of sequence:

  • Decompressing and decrypting the PDF binary
  • Parsing information into an object tree, including:
    • Correcting issues with malformed or corrupt PDFs
    • Creating a fetch table to locate page resources
  • Extracting text, images, and vector graphics:
    • Decoding non-JPEG and CMYK image formats not supported by the browser (e.g., JBIG2, JPEG 2000, etc.)
    • Converting all fonts to scalable OpenType, dealing with non-standard fonts as well as characters without Unicode values
    • Converting CMYK color spaces to browser RGB
    • Transforming some PDF objects for the canvas graphics model
  • And more

Next, PDF.js loads pages for display via JavaScript Promises, and pages render and draw onto a canvas element within a viewport. Viewport dimensions are specified in HTML, while canvas page images are set to the same scale as the PDF page by default.

Canvas also supplies the 2D transformation matrix, used to rotate and resize pages as users interact on the document via the UI (e.g., zoom in or out). This transformation matrix, in turn, relies on a page coordinate system, also used within PDF to add form fields and annotations to a page, locate words on a page, and for any other operations involving page geometry (e.g., measurement).

Building support for custom annotations, form filling, or signature features on top of PDF.js would require understanding this coordinate system and its PDF-to-Canvas translations -- otherwise handled automatically within a commercial UI such as PDF.js Express, or PDFTron WebViewer, a full-fledged JavaScript PDF Library.

The Interactive Text Layer

In order to enable text select, text search, and copy/paste, PDF.js also supports an interactive text layer.

This text layer is created using a transparent DOM overlay, rendered over top of the canvas, placed using CSS positioning, and scaled using CSS transform. HTML elements are then used to specify lines, with placement of spans decided by an algorithm.

But sometimes the algorithm does not detect lines and phrases correctly, creating issues for the UX, where text select skips over a line or paragraph, or text search has difficulty locating search results including phrases.

There are 90+ open issues on the PDF.js community GitHub support forum for text select alone -- more than any other single issue.

Support for Long/Complex Documents

PDF.js supports some features to deal with long documents with many pages. For example, it supports document streaming via byte range requests. This enables documents previously optimized for fast web view to display for the user almost instantly when served via a URL -- without having to wait on the entire file to download first.

According to one benchmark, PDF.js also performs well with most common PDFs, such as reports and invoices. But it may struggle with “very large or graphics-heavy” documents such as technical drawings, maps, design, large annual reports, and so on.

As PDF.js uses basic JavaScript, not WebAssembly, it can’t leverage multi-threading to speed up rendering of many pages simultaneously.

Bitmap/raster Rendering

PDF.js will work in vector graphics, allowing for precise, blurr-free rendering and legible text on most documents up to zoom magnification factors of approximately 400%.

Beyond 400%, however, one encounters reported blurriness within documents, which can make it difficult to read small text or measure.

These image quality issues stem from a large canvas page taking up a lot of memory space and PDF.js switching to resizing the page as raster due to the absence of canvas tiling to break up the page into sections. Canvas tiling has been an open feature request on the PDF.js GitHub forum since September, 2015.

Correcting Corrupt or Malformed Documents

PDFs are produced by literally thousands of different tools in many different ways. And a good portion of these PDFs are non-standard or otherwise corrupted. Users will still expect documents that open within their desktop readers to open in their web viewer. Therefore, PDF.js contains code that tries to correct for issues with malformed PDFs on the fly so they still open in PDF.js.

"[Parsing & extracting] is relatively straightforward until you get to bad PDFs. There are a lot of bad PDFs out there that don't follow the specification. A lot of our code is going back to handle these strange cases.”

~Mozilla Developer Brendan Dahl

According to one analysis, however, between 1-3% of certain types of documents will crash or freeze the PDF.js viewer, either due to a corrupted file or a complex file.

Support for the PDF Spec and Rendering Accuracy

Over the years, Mozilla and the open-source contributor community has extended PDF.js to support more of the PDF spec. And, today, PDF.js supports most features found in most common PDFs online.

Certain concepts, however, are not supported or are incomplete. These include:

  • Interactive and fillable forms
  • Spot colors
  • Overprint simulation
  • ICC Color Profiles
  • Optional Content Groups
  • A few patterns & shadings
  • Transparency groups (knockout/isolation)

Check out this guide to PDF.js rendering quality to learn more about the implications for PDF.js rendering behavior.

How PDF.js Renders Non-JPEG Images

PDF.js relies on the browser for image conversion, and the browser supports JPEG and a few other image compression formats, which, therefore, convert in PDF.js very quickly.

However, PDF can embed almost any image type, including non-JPEG formats such as JBIG2 and JPEG 2000, and other image compression formats supporting CMYK colors not supported by the browser.

PDF.js must therefore first convert these formats using custom image decoders, developed by the open-source contributor community.

Where PDF.js has trouble converting an image, some non-JPEG images will fail to render as expected -- or at all -- leaving some scanned documents with blank pages.

There are currently 23 open issues on the PDF.js GitHub support forum for issues related to image conversion.

How PDF.js Renders Non-standard Fonts

PDF.js also has a lot of code dedicated to working with fonts -- such as code to support non-standard fonts and font subsetting to make rendering of fonts more efficient.

Where a non-standard font is not embedded in the document file or supported locally, PDF.js will attempt to simulate it. This may lead to some slowness in document rendering as well as rendering errors: e.g., incorrect spacing and kerning, or illegible text in a worst case scenario.

The PDF.js GitHub currently has 27 open issues related to font conversion.

Support for High-fidelity Printing and Color Management

On Mozilla Firefox, PDF.js supports fast, high-quality printing through use of mozPrintCallback. This enables PDF.js to pass information to the printer as vectors and rasterize content at a later stage.

On other browsers, however, PDF.js will render every page and pass these to the printer as memory-intensive raster images, resulting in blurriness and slowness, and failure to print some large and complicated documents.

PDF.js also does not currently support color management features such as spot colors, ICC color profiles, and overprint simulation. These features are often crucial to many marketing, publishing, advertising, and other pre-print workflows, where colors including brand colors approved for production must display on screen and print exactly as expected.

Browser Support

PDF.js viewer behavior will depend on the degree to which the browser supports PDF.js’s required features.

Where a specific feature is not supported by the browser, such as dashed lines on Safari 6, PDF.js will attempt to emulate them, including by use of polyfils.

The contributor community notes that where PDF.js is required to emulate features, rendering speed and accuracy will not be as good.

PDF.js behavior is generally considered the best within Chrome and Firefox, which are fully supported by the open-source contributor community and subject to automated testing. Older browsers such as Internet Explorer 9/10 are not supported due to too many missing required browser features.

Learn more in our article on what browsers PDF.js supports.

SVG vs Canvas Backends

PDF.js also supports a Scalable Vector Graphics (SVG) backend, initially considered by the contributor community as an alternative to canvas. The SVG backend was intended to resolve canvas’s issues with regards to text select and high-fidelity printing, as SVG offered better, built-in support for both these features.

Today, however, the SVG backend is considered by the community to be “not production-ready”, as it is less developed, slower with text-heavy documents, and supports less of the PDF spec than the canvas backend. Most developers therefore prefer to use PDF.js with canvas.

Further Resources

Those who wish to drill further into how a simple PDF.js viewer works and other aspects of PDF.js can check out the interactive demo and code samples on the PDF.js Wiki, as well as the resources provided in our PDF.js Express documentation and blog, including our main PDF.js guide -- what is PDF.js.

We hope this article was helpful! Don’t hesitate to reach out with any questions.