Dispatch from Berlin

The Next Trillion PDFs

Sep 16, 2025

I slipped into the late morning session at PDF Days Europe just as Matthew Hardy from Adobe took the stage. Hardy has been in the PDF world longer than many have been in their careers—26 years with the format, 20 of those at Adobe, plus deep involvement in ISO standards and the PDF Association. He’s usually the accessibility guy, chairing PDF/UA and reuse working groups, but this talk was broader. He wanted to explore what happens when artificial intelligence collides with PDF—the document format that quietly underpins trillions of files around the world.

In theory, PDFs could be rich and semantically aware. In practice, though, most were created the quick and dirty way: by printing to PDF.

Hardy began with a reminder of how PDF started: a fixed-layout, page-independent carrier of visual content. The early focus was print fidelity. But from the late 1990s onward, the format began to absorb logical structure, tagged content, and metadata. In theory, PDFs could be rich and semantically aware. In practice, though, most were created the quick and dirty way: by printing to PDF. That stripped out structure, leaving just pictures of text. Adobe’s own estimates match what others here have reported—only about 20 percent of PDFs are tagged, and just a fraction of those are well-tagged.

That leaves us with a huge problem: trillions of flat PDFs that lock up data, inaccessible to screen readers, impossible to reflow on mobile devices, and resistant to reuse. A decade ago, Adobe and others began looking at AI to fix this. Hardy described experiments with object-detection models that could recognize headings, tables, and reading order. The famous “PDFs are not cats” line came from this period—models trained to spot cats and dogs struggle when the “objects” are boxes of text, footnotes, and nested tables. Still, with enough training data and money, you could produce reflowable, responsive PDFs that adapted to different screen sizes. For accessibility, this meant a giant leap forward, though Hardy admitted the process was brittle, costly, and far from perfect.

LLMs, on the other hand, arrived preloaded with a vast (if imperfect) knowledge base. Combine that with user documents, and suddenly the possibilities expanded.

Then came the big shift: large language models. Classic machine learning could only go so far—it needed handcrafted features, labeled data, and expensive retraining for every new use case. LLMs, on the other hand, arrived preloaded with a vast (if imperfect) knowledge base. Combine that with user documents, and suddenly the possibilities expanded. PDF could become conversational. Instead of scrolling through hundreds of pages, you could ask questions of the document. Instead of one piece of alt text struggling to describe an image, AI could interpret both the picture and its context. Instead of hunting through tables, you could ask for charts, summaries, or comparisons across multiple reports.

The Use Cases Hardy Highlighted

Hardy spent much of his talk sketching out the real-world uses he sees for AI-powered PDF:

Conversational PDFs: Ask questions of a report and get answers tied back to the source. No more guessing where the AI pulled something from—citations can link directly to document passages.
Self-explanatory documents: Drop multiple financial statements into the system and ask for trend analysis over time. The AI does the synthesis, sparing you from manual comparison.
Personalized outputs: The same PDF can adapt differently for a software engineer, a medical doctor, or a high school student—each receiving a version tailored to their knowledge level and needs.
Smarter navigation: Bookmarks can’t anticipate every use case, but AI can. Topic- or theme-based navigation lets you jump through documents in new ways.
Collaborative reading: Groups can interact with the same PDF in a chat-like environment, comparing interpretations and working from the same grounded content.
Alt text reimagined: Instead of one string per image, AI can look at the picture and the surrounding context, then allow users to interrogate the image: who’s in it, what’s happening, what the chart means.
Dynamic visualizations: Complex tables can be turned into charts, graphs, or even simplified visual summaries on demand, tailored to what a user wants to see.
Synthesis and creation: Hardy described using AI to generate a draft ISO tech note from a transcript of a working group discussion. Not perfect, but a solid starting point that saved hours of work.
Accessibility unlocks: AI can simplify language, translate content on the fly, or generate alternative formats like audio—all while still grounding in the authoritative PDF of record.

Hardy stressed that this isn’t just about convenience. For accessibility, this is transformative. For research and compliance, it means faster ways to synthesize information while still grounding answers in the source documents. For design and reuse, it means the ability to re-present content—turning static tables into visualizations, or translating technical material into plain language for new audiences. And crucially, because the original PDF remains the “document of record,” provenance is preserved even as the content is reshaped.

It wasn’t all utopian. Hardy acknowledged the risks: cost, brittleness, hallucinations. He emphasized that well-tagged PDFs remain the gold standard—UA-2 plus reuse features give AI the best starting point. But he also sees a future where AI can bridge the gap for the messy majority of PDFs already out there.

His closing line stuck with me: “The next trillion PDFs won’t just preserve knowledge—they’ll empower it.”

For those of us in archives and records, that’s a challenge as much as an opportunity. If AI makes legacy PDFs more usable, do we treat those as new records or derivative tools? How do we weigh provenance against transformation? And as the standards community, where do we draw the lines for what AI should—and shouldn’t—do with documents of record? These are questions that go far beyond the PDF Association hall in Berlin. But Hardy was right: the next trillion PDFs are already on their way.

Andrew Potter

Discussion about this post