Search Without Evidence
What the NARA Catalog API Knows But Does Not Expose
This article began as a measurement exercise: a plan to use the NARA Catalog API to audit OCR coverage across the federal archive, mapping which record groups had it, which did not, and where the fault line fell between machine-readable and effectively dark collections.
The plan collapsed on contact with the data. What emerged instead is a different finding, more precise and harder to dismiss: the catalog’s search retrieves records using text that its standard API response never returns.
The Experiment
The NARA Catalog API v2 is the programmatic interface to the National Archives’ online catalog. It returns JSON metadata for archival records: titles, scope notes, date ranges, creator information, physical descriptions, and, for digitized records, direct links to the scanned image files themselves. NARA’s own documentation describes it as providing access to “all available archival descriptions, authority records, metadata about digital objects ... extracted text (such as optical character recognition or OCR text) ... and public contributions.”
That final item, extracted text including OCR, is the subject of this investigation.
The methodology was empirical. The /records/search endpoint was queried for digitized textual records across a stratified sample: six record types (Textual Records, Still Picture Records, Moving Images, Sound Recordings, Architectural and Engineering Drawings, and Maps and Charts), five historical eras (pre-1940 through 2001 and later), and ten record groups including the CIA, the Office of Strategic Services, the State Department, the Office of War Information, and the Veterans Administration pension files. For each returned record, the full API response was inspected for any field containing extracted text, OCR content, or any indication that a document’s machine-readable content had been processed and made available.
Across 750 records in the sample (approximately 370 unique items after deduplication across strata), the results were uniform. Every record returned the same set of fields: title, ancestors, physical occurrence data, access restrictions, and digital object references. No record contained an extractedText field. No record contained hasExtractedText. The seven fields present in every digital object were objectFilename, objectUrl, objectFileSize, objectDesignator, objectId, objectDescription, and objectType. One of these, objectUrl, is a direct link to the digitized file itself, the scanned page image or PDF held in NARA’s cloud storage. The API, therefore, returns the document image to an automated user. The text printed inside that image was not present in any field of the standard /records/search response.
The initial interpretation was obvious: the API does not return OCR text. But a careful reading of the evidence suggested a more cautious conclusion was required before publication. Absence from the response does not prove absence from the index. The API could search OCR text internally for retrieval, while suppressing it from responses. Those are different things, and the distinction matters.
The Decisive Test
To determine whether OCR participates in API retrieval, a phrase search test was designed. The logic is simple: certain phrases appear routinely in the body of historical legal and administrative documents, specifically in the handwritten text of pension affidavits, pardon applications, crew manifests, and oath records, but they would almost never appear in archival metadata fields such as titles or scope notes. If searches for these phrases return records and those records have no metadata explaining the match, then OCR text must be driving the retrieval.
Ten phrases were tested:
“hereby certify” “personally appeared before me” “being duly sworn” “witnesseth” “subscribed and sworn” “aforesaid personally appeared” “deponent saith” “before the undersigned” “subscribed and sworn to before me” “in the year of our lord”
For each phrase, the /records/search endpoint was queried with availableOnline=true, and up to twenty returned records per phrase were inspected, with every exposed text field (title, scope note, and ancestor titles) checked for any occurrence of the phrase.
The results were unambiguous. Every phrase returned substantial results. On 3 June 2026, the query for “subscribed and sworn” returned 594,130 records, “being duly sworn” returned 493,036, and “personally appeared before me” returned 293,923, each retrieved as GET /api/v2/records/search?q=”“&availableOnline=true. Because a live search index changes over time, these are point-in-time counts rather than stable values. Across all ten phrases and all spot-checked records, the phrase appeared in exposed metadata in exactly one case: the scope note for a photograph of Jesse James, which happened to contain the phrase “hereby certify” in its archival description. Every other spot-checked match came from records whose titles are date ranges, file numbers, and series labels. Records like “1815 (D) - 1815 (P)” and “Inward Crew Lists, 1858-1859” and “Alphabetical File - S.” Records with empty scope notes.
These records are being retrieved based on text that does not appear in any exposed text field of their standard search responses.
The Exhibit
To close the evidentiary chain, one record from the “personally appeared before me” results was examined through both routes: NAID 57399071, “Confederate Applications for Pardon and Amnesty, Georgia: Hammock, J D.”
Via the API, records/search?naId=57399071 returned the full record: onlineResources, levelOfDescription, recordType, useRestriction, title, physicalOccurrences, accessRestriction, dataControlGroup, generalRecordsTypes, microformPublications, variantControlNumbers, digitalObjects, ancestors, naId. The fields extractedText and hasExtractedText were both absent, and each of the three digital objects carried only file-level fields such as objectUrl, objectFilename, and objectFileSize. Every objectUrl resolved to one of the three scanned pages, stored as an image. The API returned the pages as pictures, not as words.
Via the web interface, the same record displayed an 1865 pardon application in 19th-century cursive, with a sidebar panel headed “Optical Character Recognition (OCR) Extracted Text, Auto-generated by the National Archives Catalog.” The output is imperfect, as machine processing of antebellum handwriting tends to be, but on the second page, the panel reads “this day personally appeared before me,” the phrase legible in the cursive original. That text appears in the UI panel but is absent from every field of the API response, even though the response links directly to the image it was extracted from.
Together, these observations strongly indicate that OCR-derived text participates in retrieval for at least some classes of digitized records in the NARA Catalog search index, yet is absent from the standard search responses returned to API users. The most plausible explanation for the match is that text captured by NARA’s extracted-text process and placed in the search index drove the retrieval: the API returned the image for those words, but not the words themselves.
A Second Exhibit, at Scale, The Hammock pardon establishes the gap for a single record, traced end to end. A second test shows the same pattern holding across independent records and across record types.
On 17 June 2026, an exact-phrase search for “the said deponent further deposes and says,” a clause that belongs to the body of a sworn statement rather than to any cataloging field, returned four records (queried as q=”the said deponent further deposes and says”, availableOnline=true, resultTypes=item).
Each of the four was then inspected in full: every string value anywhere in its API response was checked for the phrase. In all four, the phrase was absent from every field. For the first record alone, that check covered 414 separate string values and found nothing.
The four records are substantial. They include Revolutionary War Pension and Bounty Land Warrant Application File W. 1245 for Daniel De Wolf of Connecticut (NAID 54421828, 60 page images) and pension and bounty-land file R. 306 (NAID 53850631, 95 page images). The remaining two are even more telling, because their titles contain no descriptive language at all. One is titled only “File # 17 2 of 2” (NAID 77327296, 167 images). The other is titled “763.72115-768.7515/10” (NAID 26774252, 649 images). A title that is a file number or a shelf range cannot explain why a search for a deposition clause returned it. The only field that could have produced the match is the extracted text that the response does not include.
That is the finding restated without reliance on a single example. A distinctive sentence from inside scanned documents retrieves four separate records spanning hundreds of page images, and not one of those records exposes, anywhere in its API response, the sentence that retrieved it. For one of the four, the De Wolf file, the matching text was located directly. An interior page holds the sworn statement of Hubbel Stevens of West Stockbridge, in Berkshire County, Massachusetts, attesting that he knew Daniel De Wolf at the time of De Wolf’s enlistment in the Continental Army in the spring of 1777. That statement includes the clause “And the said deponent further deposes and says”; the catalog’s Extracted Text panel reproduces it from the partner-contributed extracted text, and the API response for the same record contains it nowhere. The match derives from indexed full-text content that the standard API response does not return, thereby confirming that the catalog treats the quoted string as an exact phrase.
What This Means
NARA’s catalog presents two different archives depending on the method of access. The records are the same in both. What differs is how much of each record the access path reveals.
Through the web interface, a researcher can search the full text of millions of digitized pages. OCR output, however imperfect, participates in retrieval and is displayed alongside the document image in a dedicated panel. A genealogist searching for a name, a historian tracing a legal phrase, a journalist looking for a specific document: all of them benefit from a search layer that reaches inside the images.
Through the API, a researcher encounters a different situation. Retrieval still happens, as this testing establishes, but the text that drives it is not returned. The API user can discover that a record matched. They cannot see what matched. They cannot read the matching text. They cannot determine which of a record’s many pages contains the relevant content. They cannot verify that the match is meaningful. They cannot reproduce their search results from the data the standard API response provides, because that data does not contain the information that generated the results.
The precise shape of that gap matters because it is narrower and sharper than it first appears. The API does not return a placeholder that forces a detour to the catalog website. It returns substantial descriptive and structural metadata, and within the digitalObjects array, it returns objectUrl, a direct link to the digitized document: the same scanned page image a person views in the catalog viewer. An automated user can follow that link and download the actual document without ever loading a catalog web page. What the response omits is the single layer that would explain the result: the text the search index extracted from those pages by OCR and used to match and rank the record. The document image is handed over. The words read from it are not.
This is not primarily a coverage problem. It is a transparency and reproducibility problem.
In most search systems, retrieval and evidence travel together. A search engine returns a result and a snippet showing why it matched. The NARA Catalog’s standard API separates those functions. A record can be retrieved because of OCR-derived text, yet the text itself is absent from the response, even when the response links to the very image that text came from. That makes independent verification difficult and limits researchers’ ability to build reproducible workflows on top of the catalog.
For human researchers using the catalog’s web interface, OCR-driven retrieval is a useful feature, if imperfect and partially implemented. For computational researchers, digital humanists, AI systems, and anyone building programmatic pipelines against the catalog, the situation is more constrained. They can benefit from OCR retrieval without knowing it, finding records that metadata search alone would miss. They can also download the images from the linked page. But they cannot read those images as text without doing the work over again, they cannot know whether a result was driven by a title match or a text match, and they cannot cite text that the response never provided.
The study also documents a discrepancy between documentation and observable behavior. NARA’s documentation describes the API as providing access to “extracted text (such as optical character recognition or OCR text),” yet this testing did not locate extracted text in any standard search response examined. The /records/search endpoint returned no OCR content across a stratified sample of hundreds of records and ten targeted phrase searches. Whether other endpoints expose this content falls outside the scope of this study; the API’s Swagger documentation at catalog.archives.gov/api/v2/api-docs/ is the authoritative reference. What can be documented is the gap between what the standard search interface returns and what the catalog’s own web interface displays to users of the same records.
Implications for AI Systems The gap between retrieval and response has particular consequences for AI systems, a growing class of NARA catalog users that includes research assistants, document retrieval tools, and archival query interfaces. NARA’s Catalog API is explicitly designed to support reuse of archival data outside the Catalog, and NARA has publicly discussed AI-assisted discovery, search, and access as emerging areas of archival practice.
When a language model or retrieval pipeline queries the catalog for a phrase like “personally appeared before me,” the API returns matching records, but the text that drove the match is absent from the response. For records with sparse or empty scope notes, the response contains little beyond a title and a link to the page images. The system retrieved a record because of its contents, yet those contents are not available through the response.
This decouples retrieval from comprehension in a way that is invisible to the system. The search index reaches inside scanned images; the response links to those images but not to their text. The search succeeds and returns records, so nothing indicates that the matching evidence is missing.
The effect is most acute for Retrieval-Augmented Generation pipelines built solely on the /records/search response. RAG depends on retrieving relevant text and passing it to a model as context. A tool built on this endpoint retrieves records but has no extracted text to pass downstream, and no way to inspect or cite the text that caused the match. Confident retrieval paired with absent content can raise rather than lower the risk of a fluent but unsupported answer.
The workaround available to a human researcher, opening the record in the web interface and reading the Extracted Text panel, does not scale. A pipeline could follow each objectUrl, download the linked image, and run extraction independently, but that is expensive, brittle at archival scale, and forces every consumer to rebuild text the catalog has already produced.
The archive has already done the hard work. The extracted text exists, is indexed, and is displayed to web users. Exposing it through the standard API response, as a field alongside the image URLs, as a retrieval snippet, or through a dedicated endpoint, would make that investment available to the computational workflows the API was built to support.
A Note on Scope and Limitations
This article documents the behavior of the NARA Catalog API v2 /records/search endpoint as tested in June 2026. It does not test every API endpoint, and OCR text may be accessible through endpoints or parameters not examined here. The finding is scoped to what API users relying on the standard search interface will encounter.
The article is also not an attack on NARA. The extracted-text layer is real and significant. NARA’s own Tesseract OCR runs across millions of JPG images, and partner-contributed machine-generated text adds to it, for example, FamilySearch-supplied extracted text on digitized pension files. Together, they index handwritten affidavits, pardon applications, and crew manifests at scale. That is a genuine archival achievement. The catalog team built something that works. Human researchers benefit from it.
The gap between the UI and the API is almost certainly not a deliberate choice. It is more likely an implementation consequence: the search index serves one interface to the web frontend and another to the API layer, and the connection between retrieval and response was not built for programmatic consumers. Other cultural heritage institutions have addressed similar gaps by returning match snippets, exposing confidence scores, and providing direct access to extracted text fields. The technology exists.
What is worth naming, precisely because it is invisible, is this: a researcher can retrieve a record because the system matched text inside a scanned image. The API returns a link to that image, but not the text; the researcher cannot read it, and it leaves no trace in the standard response. For anyone attempting reproducible computational research using the catalog, that asymmetry matters.
The Methodological Contribution
This project began as an OCR coverage audit and evolved into something else during testing. The original question, how much OCR does NARA have, turned out to be unanswerable through the standard API, because the standard API does not expose the field that would answer it. The pivot to phrase-search testing, and then to manual verification against the UI, produced a more important finding than the original question would have yielded even if the data had cooperated.
That methodological evolution is worth acknowledging, because it reflects something real about empirical research with archival APIs: the systems do not always behave as documented, and the gap between documentation and behavior is itself evidence. This project set out to measure OCR coverage. Instead, it found that the question of coverage is obscured by a more fundamental question about what the standard API response exposes and what it does not.
The answer to that more fundamental question is the finding. Retrieval and evidence have been separated. The archive’s search layer exposes more information than its standard API responses reveal.
For Researchers and Developers
For research pipelines built against the NARA Catalog API, several practical implications follow from this finding:
Keyword searches are richer than they appear. Because OCR participates in retrieval, a search for a specific name or phrase will return records that a metadata search alone would miss. This is useful. It also means the result set cannot be fully reverse-engineered from the metadata the API returns.
Zero results do not prove archival absence. If a phrase search returns nothing through the API, that absence is not evidence that the archive lacks relevant documents. A document may exist, may be digitized, may have been OCR’d, and may be unretrievable through the parameter set used, for unrelated reasons.
The digitalObjects array links the file, not its text. Each objectUrl points to the scanned image or PDF, which any pipeline can download directly from the API response. This means the document image is available programmatically. The OCR text derived from that image is not included in the response. Obtaining it requires downloading the file and running OCR on it independently.
The “Extracted Text” panel in the UI is the human-facing OCR surface. For any record of interest, the catalog web interface displays OCR output in a dedicated sidebar panel. This is the text that participates in the search. It is not returned in standard /records/search responses, even though those responses link to the image the text was extracted from.
Direct record endpoints behaved differently from search during testing. Requests to records/{naId} returned an HTML page rather than a JSON representation. The records/search?naId={naId} pattern is the correct approach for retrieving a specific record programmatically, and it returns the same field set as any other search, without extracted text.
The 1865 pardon application of J.D. Hammock of Taliaferro County, Georgia, contains the words “this day personally appeared before me.” A researcher searching the NARA catalog for that phrase will find it, and the API will even return a link to the scanned page it appears on. A researcher building a tool on top of the NARA API will not know why the record matched, and will not be given the words that matched it.
That asymmetry, between what the system knows and what it tells, is the shadow of the archive.
Research conducted in June 2026. All queries run against the NARA Catalog API v2 /records/search endpoint. NAID 57399071 verified via both API query and the National Archives Catalog web interface at catalog.archives.gov. The Swagger documentation for the full API surface is available at catalog.archives.gov/api/v2/api-docs/.
MetaArchivist publishes independent research on archives, AI, and the future of the historical record.




