No Access Without Control

The UVA Protocol and the Rise of Epidata as Archival Governance

Apr 07, 2026

The Crisis Is Not Data. It Is Context.

Picture a major research university’s special collections library at the end of a successful digitization campaign. Hundreds of thousands of documents now live online. Finding aids are current. Metadata is rich. The collections are described, linked, and discoverable. By any traditional measure of archival accomplishment, this is a success story.

Then an AI vendor arrives with a proposal.

The vendor wants to train a foundation model on those same collections. The archivists hesitate. They ask, “Which specific items would you use?” How would outputs be attributed? Could we stop you later if we changed our minds? The vendor cannot answer any of these questions with confidence. The archivists decline. The conversation ends.

This scenario plays out across archival institutions today, and it illuminates a problem that has nothing to do with data volume or technological sophistication. The problem is context. The archivists intuitively understood that moving their collections into an AI training pipeline would strip those records of everything surrounding them: the provenance, the intent behind their creation, the ethical constraints attached to their use, and the institutional authority that governs them. The data would remain. The context would be gone.

Stephen Clarke, writing in his 2025 Substack essay “Beyond Metadata: Why ‘Epidata’ is the Next Step-Change for Information Management and AI-Readiness,” names this condition with precision. We are not facing a data problem, Clarke argues. We face a context deficit. We possess more data and less shared understanding than at any point in history. Our metadata practices are passive, fragmented, platform-dependent, and incapable of carrying information meaning across systems and time.

The University of Virginia Archival AI Protocol (UVA AAIP), developed by Leo S. Lo and released in version 1.1 in January 2026, is one of the first institutional responses to this deficit. It operates as a governance document, a negotiating framework, and a statement of values. But reading it alongside Clarke’s epidata framework and the paradata scholarship of Cameron, Franks, and Hamidzadeh reveals something deeper. The UVA protocol is not just a policy. It is an attempt, imperfect and partially realized, to defend archival context against the forces that would dissolve it.

The thesis of this article is this: The UVA protocol is an early institutional response to a structural failure in how information systems handle context. Epidata, as Clarke defines it, is the missing conceptual layer that explains both the nature of that failure and what a real solution would require. Understanding the protocol through this lens illuminates not only what it achieves but also where it falls short, and where the profession needs to go next.

From Metadata Failure to Context Deficit

Clarke’s critique of contemporary metadata practice is direct: the foundational model that governs our profession is no longer adequate. The standard tripartite division of metadata into descriptive, structural, and administrative categories traces its heritage to library and information science, where it served the cataloging of discrete, relatively static resources. This model worked brilliantly for books on shelves. It fails for dynamic, distributed digital information in AI environments.

The failure is architectural. Descriptive metadata sits in a data catalog. Structural metadata is locked in a database schema. Administrative metadata, including provenance and access rights, is fragmented across dozens of platform-specific tools. The model itself fractures the holistic view that governance requires. As Clarke writes, the primary failure of traditional metadata is its passivity. It is documentation that humans consult. It is a static label that records basic lineage and attributes but cannot capture the rich, complex relationships and dependencies that give information its meaning.

Archivists know a version of this problem intimately. Traditional archival metadata works reasonably well within controlled institutional environments. It describes the record, notes its creator, and indicates where it lives. But when records leave those environments, when they are digitized, shared across platforms, ingested into AI systems, or incorporated into training pipelines, the metadata does not travel reliably. It gets stripped away, reformatted, or ignored. The record arrives at its new destination as raw content, severed from its context.

The Cameron, Franks, and Hamidzadeh paper on paradata in archival contexts, published in the ACM Journal on Computing and Cultural Heritage in 2023, identifies a related problem from a different direction. As AI tools proliferate in archival practice, the profession lacks adequate frameworks for documenting how those tools affect records. Traditional metadata schemas were designed to document human agency. They assume that the actors who create, describe, and manage records are people who can explain their decisions. AI systems are often opaque even to their designers. They produce outputs from processes that cannot be reconstructed or explained in the ways archivists and accountability frameworks require.

Paradata, as the authors define it, drawing on InterPARES Trust AI, is “information about the procedure(s) and tools used to create and process information resources, along with information about the persons carrying out those procedures.” It documents not what a record is but what has been done to it, by whom, with what tools, and under what constraints. In this sense, it is closer to context than traditional metadata. But even paradata has limits, which Clarke’s epidata framework helps expose.

The UVA protocol exists because metadata does not survive computational transformation. When an archival collection enters an AI training pipeline, the metadata describing its items, collections, and restrictions does not reach the model. It stays behind. The weights that emerge encode patterns extracted from the content while discarding the contextual infrastructure that gives that content its meaning and governs its use. This is the architectural reality the protocol is trying to counter.

Provenance and the Illusion of Sufficiency

The UVA protocol’s core rule is blunt: “Irreversible models do not get access unless item level provenance and meaningful attribution can be demonstrated in practice, and the archival organization retains contractually enforceable control to stop further use.”

This is a provenance requirement stated as a gating condition. The protocol elevates provenance from a descriptive practice to a prerequisite for access. If an AI system cannot demonstrate that it can trace its outputs back to specific archival items, it cannot use those items. This is an important and principled stance.

But Clarke’s framework helps reveal the limitation. Provenance, even when robust, is documentation that humans consult. It is a record of origin, chain of custody, and transformation history. It can be queried, audited, and referenced. What it cannot do, on its own, is act. It does not enforce constraints on how a record is used. It does not automatically travel with the record as it moves across systems. It does not embed itself into computational processes in ways that prevent misuse.

Clarke frames this distinction with clarity. Traditional provenance metadata is passive infrastructure. It provides a trail that investigators can follow after the fact. It does not prevent the stripping of context in the first place. An AI training pipeline that ingests archival materials alongside their provenance metadata does not necessarily preserve that provenance in any meaningful way. The provenance remains in the source systems while the model weights encode the content. The connection is broken at the moment of transformation.

The paradata framework offers a partial solution. Patricia Franks’ slides on paradata as AI processual documentation distinguish between what XAI (explainable AI) does and what paradata requires. XAI asks why a given tool produced a given output from a given set of inputs. Paradata asks why, how, and to what effect a given tool was used in a particular context. Paradata attempts to document the AI process, not just the AI output. Applied to archival materials, paradata would include records of which items were used for training, which model parameters were applied, which transformations occurred, and who authorized each step.

This is what the UVA protocol’s Appendix B minimum provenance standard gestures toward. The appendix requires citation at the source item level, specific fields including collection ID, item ID, holding organization, and access link, and logging of interactions, retrieved items, and citations shown to users. These are paradata requirements. They demand documentation of the AI process, not just the record.

Yet even this falls short of what Clarke’s epidata framework envisions. Paradata documents what happened. It does not specify what should have happened. It creates a record that can be audited after the fact. It does not embed constraints that travel with the record and enforce appropriate use in real time. As Clarke argues, provenance in the epidata sense includes “the who, what, where, when, and why of the information’s creation and transformation” as a core component of the information asset itself, not as a separate document consulted after the fact. Epidata makes provenance intrinsic and portable. Current provenance practice, including the UVA protocol’s requirements, makes it essential but still external.

Paradata: The System Tries to Explain Itself

The UVA protocol’s logging requirements represent the profession’s best current attempt to create paradata for AI systems using archival materials. Appendix B specifies that systems should capture the date and time, the system name and version, the user category, the query text, the retrieved items with their identifiers, the citations shown to users, and the output text where retention is permitted. These requirements apply to any tool that produces user-facing outputs and any internal workflow that generates descriptions, transcripts, metadata, or research summaries.

This is paradata in practice. It captures the process, the transformation, and the interaction among the user, the system, and the archival source. Cameron and colleagues argue that paradata provides the framework archivists need to articulate accountability for AI processes, both for their own professional obligations and for the researchers who encounter records that AI has processed.

The Franks slides elaborate on what paradata must capture in an AI context. Technical paradata includes the AI model tested and selected, evaluation and performance metrics, generated logs, the model training dataset, training parameters, vendor documentation, and versioning information. Organizational paradata includes AI policy, design plans, employee training, ethical considerations, impact assessments, implementation processes, and regulatory requirements. Together, these two layers of paradata attempt to document not just the algorithm but the full human and institutional context of its deployment.

Clarke aligns with this direction by describing the shift from passive to active metadata. Gartner, he notes, formally recognized this need by abandoning its Magic Quadrant for Metadata Management Solutions and shifting focus to active metadata as operational infrastructure that drives automation. The logging requirements in the UVA protocol are a step toward active metadata. They create a record that the system produces about itself, a kind of self-documentation that makes the process legible.

But paradata answers the question of what happened. It does not answer the question of what should have happened. This is the gap that the UVA protocol’s “Right to Stop” provision attempts to address through governance rather than through technical design.

The Break: Context Cannot Be Reconstructed After the Fact

Clarke’s most important epistemological claim is that all data is theory-laden and interest-based. Metadata is not a neutral description of reality. It reflects the worldview of its creator, the purposes of the system that generated it, and the interests that governed its creation. A “customer” in a sales CRM is fundamentally different from a “customer” in a support system, even though both records might use identical field names. When AI systems treat data as neutral and context-free, they reproduce the biases and distortions built into the original descriptions while discarding the contextual signals that would allow those biases to be identified and corrected.

The UVA protocol implicitly tries to prevent this core failure mode. Archival materials carry deeply embedded interests. They reflect the collection decisions, descriptive practices, access policies, and social contexts of the institutions and communities that created them. A letter from a family collection carries the provenance of donation, the archivists’ processing decisions, the finding aid’s descriptive choices, and the donor agreement’s constraints. Model training transfers none of this context into the weights. The model learns patterns from the content while discarding the framework that gives that content meaning and governs its appropriate use.

The Cameron et al. paper identifies this as a fundamental challenge for archival accountability. AI introduces a “non-human agency” into archival processes, pushing the limits of existing practice and demanding new concepts and vocabularies. The black box problem compounds this challenge. Sophisticated AI systems produce outputs whose reasoning cannot be reconstructed, directly undermining the archival principles of transparency, accountability, and impartiality. When an AI system trained on archival materials produces an output, neither the archivist nor the researcher can reliably trace that output back to specific records or determine which aspects of the training data shaped the response.

The paradata framework addresses this problem by establishing documentation requirements that capture the process before it becomes opaque. However, paradata shares the same limitation as provenance. It produces documentation that people can consult, not constraints that travel with the data. Systems strip context before computation, not after. By the time a researcher encounters an AI output and attempts to trace its provenance, the context that would make that tracing meaningful has already disappeared.

Clarke argues that this failure cannot be repaired after the fact. In his epistemological framework, understanding is more valuable than knowledge because it integrates multiple pieces of knowledge. Current systems manage isolated facts. They do not manage the holistic, coordinated web of understanding that makes information meaningful and governable. Managing understanding requires a different kind of infrastructure, one that captures not only what information is, but also why it exists, what perspective it represents, and under what constraints it may be used.

Epidata: Context as a Governing Layer

Clarke defines epidata as “a dynamic, active, and holistic abstraction layer that sits above an organization’s fragmented data silos and platforms.” The term derives from the Greek “epi,” meaning “upon” or “above.” Epidata is not more metadata. It is a higher-level abstraction model designed to manage information as a holistic, unified, and strategic asset, portable across systems and across contexts.

The epidata framework comprises three integrated components.

The first is the semantic blueprint, consisting of enterprise entity modeling, ontologies, and metamodeling. This component defines, in Clarke’s phrase, “the business on a page.” It identifies the major areas of interest that define the enterprise, the formal descriptions of those concepts and their relationships, and the structural rules that make the model portable and scalable across different systems and use cases. For archival contexts, this would mean a formal representation of what records are, how collections relate to one another, what provenance means for different material types, and how access constraints apply across contexts.

The second component is the relational engine, implemented through knowledge graphs. A knowledge graph connects structured and unstructured data sources and enriches them with contextual meaning. It moves beyond the two-dimensional world of rows and columns and models data the way people think: as a dynamic, multi-dimensional network of relationships. Clarke argues that knowledge graphs provide the best available tool for breaking down data silos because they are structurally designed to create a unified view by connecting disparate data sources.

The third component is the contextual wrapper, which combines epistemological encoding, provenance, and multi-entity models. Clarke identifies this as “the most critical and innovative layer of the Epidata model.” It captures the “why” of data. It formally encodes the interest-based nature of descriptions and allows the same entity to carry multiple valid classifications depending on context. It embeds universal provenance as a core component of the asset itself rather than as a separate document. It enables cross-context usability by capturing the intent and context of an asset’s creation and allowing that asset to be appropriately recontextualized for new use cases.

For archival purposes, the three components of epidata map onto familiar concerns with new technical precision. The semantic blueprint corresponds to controlled vocabularies, ontologies, and the conceptual modeling of archival entities. The relational engine corresponds to the linked data and knowledge graph work already underway in some archival institutions. The contextual wrapper corresponds to the full complex of provenance, restriction metadata, donor agreements, and ethical constraints that govern archival access, now conceived not as separate documents but as intrinsic attributes of the information asset.

The critical insight is portability. Clarke’s epidata framework insists that context must travel with the record, not remain in the source system while the content travels alone. “Compliance travels with the data,” he writes. “It becomes an intrinsic, universal attribute, independent of any single platform, enabling true governance, auditability, and trust.” This is what current metadata practice cannot achieve, and it is what the UVA protocol requires, though the technical infrastructure to realize it is not yet in place.

Reinterpreting the UVA Protocol as an Epidata System

Reading the UVA protocol through the lens of Clarke’s epidata framework reveals it as an attempt to implement epidata principles via policy rather than system design. Each major component of the protocol corresponds to what an epidata-native system would do automatically and technically.

The core rule, “No access without control,” encodes non-negotiable constraints on transformation. In epidata terms, this is an epistemic layer requirement. Before any computational transformation occurs, the system must demonstrate that it can preserve and enforce the contextual constraints attached to the source materials. The protocol cannot implement this technically, so it implements it contractually. The contract serves as a substitute for the technical architecture that would make the constraint automatic.

Pillar 1 establishes provenance and attribution requirements to preserve traceability. The protocol requires mechanisms to record which items and collections are used for training, evaluation, or retrieval, and systems to link AI outputs back to the source items. These requirements create paradata. They establish a documented chain connecting AI outputs to archival sources. However, this chain depends on the AI partner maintaining those records and on the archival organization verifying them. It remains a human-maintained record rather than a system-enforced constraint.

Pillar 3 introduces the “Right to Stop” to assert future authority over use. The archival organization claims the right to order the cessation of material ingestion and to demand the decommissioning or destruction of a model if permission is withdrawn. This provision attempts to preserve the temporal dimension of archival authority across the irreversible transformation of training. In epidata terms, it attempts to make governance constraints persist across system states, including future states that current information cannot anticipate. The protocol enforces this through a contract. An epidata system would enforce it through technical mechanisms.

The institutional benefit and mission fit requirements encode institutional intent. The protocol requires that AI uses of archives produce clear benefit to the archival organization, researchers, and the broader public, and prohibits uses that mainly benefit external commercial products. This is an epistemological constraint, a requirement that the context of use align with the context of creation and custody. In epidata terms, it is a cross-context usability requirement: the recontextualization of archival materials in AI systems must be legitimate relative to the purposes that govern those materials’ custody.

The UVA protocol, then, is epidata expressed as policy instead of system design. It identifies, with considerable sophistication, the contextual requirements that the use of AI on archival materials must satisfy. It then attempts to enforce those requirements through contract, governance, and oversight, rather than through technical infrastructure that would automate enforcement.

Retrieval vs. Training: The Epidata Survival Test

The UVA protocol’s most fundamental distinction is between retrieval-based AI systems and training-based AI systems. The protocol supports retrieval-based services that keep source materials under the archival organization’s control and tightly scoped internal models that the organization can shut down or replace. It blocks the training or fine-tuning of broad commercial, general-purpose models using archival materials and systems that absorb knowledge into model weights.

This distinction maps directly onto what Clarke’s framework would recognize as the epidata survival test: does the epidata of the archival source survive the computational transformation?

In retrieval-based systems, the source materials remain in institutional custody. At query time, the system searches the collection, retrieves relevant items, and returns outputs that cite the originating material. The epidata of each source item, its provenance, restrictions, contextual relationships, and institutional governance, remains attached to the source. The retrieval system operates on top of this infrastructure without disrupting it. When the user receives a response citing a specific item from a specific collection, the full context of that item remains accessible. The epidata survives.

Training-based systems transform data irreversibly. They collapse records, metadata, and paradata into model weights through that process. The model retains patterns from archival content but loses the contextual infrastructure that governed it. It does not know which specific items shaped its responses. It cannot cite provenance at the item level. It cannot enforce access restrictions. Training destroys the epidata at the moment it occurs.

Clarke’s framework helps explain why this distinction is so significant. Training systems collapse meaning into statistical patterns while discarding the semantic, relational, and epistemic layers that constitute the epidata of the source materials. What survives in the weights is an approximation of content patterns divorced from the contextual framework that gave those patterns their significance. This is not just a provenance problem. It is a category error: the system treats archival materials as if they were data objects rather than context-bearing records with their own governance infrastructure.

The protocol’s preferred alternative, retrieval-augmented AI, preserves the essential architecture of archival governance. The system augments AI-generated responses with retrieved archival evidence, cited at the item level and linked back to source materials. The epidata remains in the archival system where it belongs. The AI system borrows context rather than absorbing and destroying it.

The “Right to Stop” as Epidata Enforcement

The UVA protocol’s “Right to Stop” provision is the mechanism through which the protocol attempts to extend governance authority across the irreversibility of AI training. When an AI partner trains a model on archival materials, the protocol requires contractually enforceable control allowing the archival organization to order cessation of materials ingest, stop use of archival content in evaluation pipelines, remove archives from retrieval or indexing services, and, for narrow models trained primarily on defined archival collections, demand decommissioning or destruction of the model.

Clarke’s framework treats governance as something that must be embedded rather than reactive. The epidata model achieves this by making compliance an intrinsic attribute of the information asset: it travels with the data, the system enforces it, and it does not depend on periodic audits or human intervention. The “Right to Stop” does the opposite. It remains reactive, relies on human maintenance, and depends on contractual rather than technical enforcement. It intervenes after the fact instead of preventing problematic transformations.

Yet the right to stop represents the most realistic implementation available to archival institutions today. No technical infrastructure currently exists to propagate archival governance constraints through AI training pipelines as Clarke’s epidata framework envisions. Until such infrastructure exists, contractual rights are the only mechanism available. The right to stop acknowledges the irreversibility of training while asserting authority over the future states of systems that depend on archival materials.

In epidata terms, the right to stop is epidata asserting control over future states through policy rather than through system design. It is an attempt to extend the temporal dimension of archival authority, the principle that archival governance persists across time and across transfers of custody, into the AI environment. The protocol cannot make this technically enforceable, so it makes it legally enforceable. This is a stopgap, not a solution.

The Continuum Model Finally Becomes Technical

Clarke makes a striking claim about the relationship between epidata and archival theory. Epidata, he argues, operationalizes the Records Continuum Model. The continuum model, developed by Frank Upward in the 1990s, rejects the linear lifecycle conception of records management in favor of a multi-dimensional framework. Records possess archival value from the moment of their creation. They exist simultaneously in multiple contexts. They are “always in a process of becoming,” capable of being reinterpreted and recontextualized as they move through what Upward calls spacetime.

The continuum model’s fourth dimension, “Pluralize,” frames records as part of an all-encompassing framework that constitutes collective social, historical, and cultural memory. Clarke argues that for thirty years, this was a brilliant but largely academic theory, lacking a practical, scalable technical implementation. Epidata fills that gap. Whereas the continuum model theorizes the Pluralize dimension, the epidata knowledge graph provides the unified fabric of knowledge that makes it real. Where the continuum demands recontextualization, epidata’s epistemological layer provides the mechanism for cross-context usability. Where the continuum theorizes multiple simultaneous provenance, epidata’s multi-entity models and universal provenance layer provide the architecture to capture it.

The UVA protocol independently reaches toward continuum thinking. Its emphasis on control across time, its assertion of authority over future states of AI systems, and its insistence that institutional governance persists through computational transformation all reflect continuum principles. The right to stop is a continuum governance mechanism: it asserts that the archival organization’s authority over its materials does not end when those materials enter a computational system. The provenance requirements reflect the continuum’s insistence that records remain intelligible across contexts. The mission fit requirements reflect the continuum’s concern with the social and cultural dimensions of record creation and use.

What the protocol lacks is the technical infrastructure to operationalize these principles across the computational environments in which AI systems operate. The continuum model describes how records should work. Epidata, as Clarke envisions it, provides the technical architecture to make them work that way. The UVA protocol, caught between archival theory and current technical limitations, uses governance to approximate what a mature epidata infrastructure would achieve automatically.

This convergence of continuum theory and AI governance may represent the most significant intellectual development in archival science in a generation. The continuum model was always more technically ambitious than the tools available to implement it could support. The AI era has created both the urgency and the conceptual vocabulary to close that gap.

Why Digital Humanities Was Not Enough

Digital humanities scholarship transformed archival practice in meaningful ways. It made collections computationally accessible, enabled distant reading and pattern analysis across large corpora, and demonstrated the scholarly potential of digitized archival materials. Projects across major research institutions showed that archival collections could support new kinds of questions when made available in machine-readable formats.

But digital humanities, for all its contributions, treated context as compressible. The dominant logic of DH projects was that digitizing and describing materials was sufficient to make them intellectually useful. Metadata schemas were populated, finding aids were digitized, and collections were made discoverable. What received less attention was the architecture of contextual preservation: how the meaning of records, the relationships among them, the constraints governing their use, and the institutional frameworks surrounding their custody would survive and remain operative in computational environments.

Clarke’s critique of the traditional metadata model applies with full force to much digital humanities practice. Making collections computable requires decisions about representation, encoding, and description that are always interest-laden and theory-dependent. Digital humanities projects that made these choices without explicitly documenting their epistemological assumptions produced computational representations of archival materials that appeared neutral but were, in fact, deeply shaped by the perspectives and interests of the project teams.

The AI era has exposed this limitation sharply. When digitized archival collections enter AI training pipelines, the interpretive infrastructure that digital humanities projects built to make those collections meaningful, the ontologies, the entity extractions, and the relationship modeling, do not necessarily survive. What enters the training data is often the raw text or image content, stripped of the contextual apparatus that gave it meaning.

The paradata framework offers a partial remedy. Cameron and colleagues argue that documenting AI processes in archival contexts requires capturing the full scope of application and context of use, not just the algorithm itself. Digital humanities projects that document their methodological choices as paradata create a record that allows future researchers to understand and critically assess the computational representations they encounter. But even robust paradata does not solve the fundamental problem Clarke identifies: context must be intrinsic to the information asset, not stored in a separate documentation layer.

Digital humanities enabled datafication without context preservation. It made collections computationally available while leaving their governing contextual infrastructure vulnerable to being stripped away by computational processes that treat content as data rather than as context-bearing records. Epidata represents the next step: making context travel with the data rather than remaining behind in the systems that created it.

Why Computational Archival Science Must Evolve

Computational archival science (CAS) has developed significant capacity for addressing the technical challenges of digital archives. Its emphasis on provenance documentation, paradata creation, and systems awareness positions it better than most sub-fields to grapple with AI governance. The Cameron et al. paper is itself an example of CAS working at the edge of the problem, developing conceptual frameworks for AI processual documentation in archival contexts.

Clarke’s epidata framework exposes a gap in current computational archival science thinking. The field requires provenance, paradata, and systems awareness, but these elements alone are insufficient. The profession still lacks a holistic abstraction layer capable of managing meaning, intent, and contextual constraints across systems. Practitioners can document individual data elements and log individual processes. However, they cannot yet make those elements travel together as a unified contextual package that persists through computational transformation.

The Cameron et al. paper identifies the core challenge: AI introduces “a less-than-transparent and frequently uninterrogable actor into administrative processes formerly enacted by human agents.” Current archival documentation frameworks assume human agency. They expect decision-makers to explain their actions, accept accountability, and be identifiable in the record. AI systems undermine all three assumptions at once. They make decisions through processes that cannot be reconstructed, accept no accountability, and leave no individual decision-maker to identify.

Paradata addresses this gap by documenting the human choices surrounding AI deployment, even when the AI’s own decisions cannot be documented. This work is valuable and necessary. But, as Clarke might argue, it still operates at the level of knowledge rather than understanding. It creates a record of what happened without building the infrastructure needed to prevent problematic outcomes in the first place.

CAS must move toward epidata as a primary design principle. This means shifting from documentation to design: building contextual requirements into AI systems from the beginning rather than attempting to reconstruct them after the fact. It means developing technical standards for what archival institutions require of AI systems that use their materials, including requirements for contextual preservation that go beyond citation and attribution. It means treating the epidata of archival materials, their meaning, intent, relationships, and constraints, as design requirements for AI systems rather than as documentation problems.

The profession has a theoretical foundation. The Records Continuum Model provides the intellectual framework. Paradata scholarship provides the documentation vocabulary. Clarke’s epidata framework provides the technical architecture. What remains is the hard work of translating these into standards, systems, and practices.

The Hidden Failure of the UVA Protocol

The UVA protocol is, by archival governance standards, a sophisticated and well-reasoned document. It correctly identifies the problem: current AI model training is effectively irreversible, and archival institutions that allow their materials to be ingested into training pipelines without adequate controls permanently surrender meaningful authority over those materials. It correctly identifies the solution in principle: archival institutions must retain real, enforceable control over AI use of their collections, not merely symbolic promises.

The protocol recognizes context loss. Its core rule exists precisely because context, including provenance, access restrictions, and institutional authority, does not survive AI training. It recognizes irreversibility. The principle of reversibility, borrowed from conservation practice, drives the protocol’s preference for retrieval-based systems over training-based systems. It recognizes the lack of technical control and attempts to compensate through contractual control.

But the protocol cannot fully solve the problem it correctly diagnoses. This is not a criticism of the protocol’s ambition or its logic. It reflects a fundamental limitation of policy-based governance in a domain where technical architecture determines what is actually possible.

Clarke envisions epidata, but AI training contexts do not yet support it in machine-readable form. No current AI training infrastructure accepts epidata as a governing constraint on model behavior. No mechanism carries contextual requirements through a training pipeline in ways that the resulting model respects. The “Right to Stop” remains contractual, not systemic. Compliance depends on the AI partner’s willingness and capacity to honor contractual commitments, verified through audit mechanisms that remain limited in scope and certainty.

The protocol simulates epidata without implementing it. It creates the governance logic that an epidata system would enforce technically, then attempts to enforce that logic through human oversight, legal agreement, and institutional authority. This is enormously better than nothing. It is the most realistic response available to archival institutions today. But it is a workaround for a technical problem that requires a technical solution.

This is not a counsel of despair. It is a precise diagnosis. Understanding what the protocol cannot achieve clarifies what the profession must build next.

What an Epidata-Native Archival System Would Look Like

Clarke provides enough technical detail in his epidata framework to sketch what a genuinely epidata-native archival AI system would require.

It would incorporate a unified context layer built from ontologies and entity models that formally represent archival concepts: records, collections, provenance chains, access constraints, donor agreements, and community protocols. These representations would be shared across systems and expressed in machine-readable formats that AI systems could process and respect.

It would integrate knowledge graphs to preserve relationships across systems. It would model relationships among records, between records and their creators, between collections and the communities they document, and between items and the restrictions governing their use as graph relationships that AI systems can query and navigate. Processing a record through an AI system would trigger queries against the knowledge graph to retrieve and apply relevant constraints.

It would implement epistemic encoding: formal representation of the intent, perspective, and use constraints attached to archival materials. A record donated by a community with specific cultural protocols would carry those protocols as intrinsic attributes. An item under deed-of-gift restrictions would carry those restrictions in a machine-readable format that AI systems could process. Access constraints would not be external policies applied by human decision-makers after the fact, but intrinsic attributes of the information asset that systems could query and enforce.

It would achieve universal provenance: contextual tracking that travels with the record regardless of platform. Provenance would not be recorded in the source system while the content travels to a destination system. It would be an intrinsic attribute of the information asset, updated as the asset moves through different systems and processes, and available for query by any system that processes the asset.

The result would be records that function as context-bearing entities rather than data objects. AI systems processing such records would encounter not just content but a full contextual package specifying what the record is, how it connects to other records, why it exists, what perspective it represents, and under what constraints it may legitimately be used. The technical architecture itself would enforce the contextual requirements that the UVA protocol currently enforces through contract.

This is not a description of a system that exists today. It is a description of what the profession needs to build.

Implications

For archival institutions, the epidata framework implies a significant shift in their self-understanding of mission. The traditional custody mandate, which preserves and provides access to records in institutional care, is necessary but not sufficient in an AI environment. Archives must become context engineers: institutions that not only hold records but actively design and maintain the contextual infrastructure that makes those records meaningful and governable as they move across AI systems.

This means investing in ontologies and knowledge graph infrastructure. It means developing machine-readable representations of access constraints, donor agreements, and community protocols. It means treating the epidata of archival collections as a primary professional responsibility, not a secondary enhancement. It means training archivists who can work at the intersection of traditional archival practice and the technical architectures of AI systems.

For AI systems deployed in archival contexts, the epidata framework requires context-aware computation. AI systems that use archival materials must receive and process contextual constraints in machine-readable form. They must support retrieval architectures that preserve source attribution and apply governing constraints at query time. They must produce outputs that trace provenance to specific source items. The minimum provenance standard in UVA Appendix B provides a starting point for these technical requirements.

For professional standards bodies, the epidata framework creates an opportunity that the profession has not previously had: a conceptually coherent, technically grounded framework for archival governance in AI environments that aligns with both archival theory and emerging technical infrastructure. The work of defining epidata standards for archives, including formal ontologies for archival concepts, specifications for machine-readable contextual constraints, and requirements for AI systems that use archival materials, falls squarely within the profession’s standard-setting mandate. The UVA protocol provides a policy foundation. Paradata scholarship provides a documentation framework. Clarke’s epidata concept provides the technical architecture. The synthesis of these three into professional standards would represent the profession’s most significant technical achievement in the AI era.

The Archive as Context Engine

Return to the scene at the beginning. The archivists who declined the AI vendor’s proposal were not being obstructionist. They were correctly recognizing a structural problem that they lacked the vocabulary to name, but that they understood intuitively. Archivists recognize that archival materials derive their value not from raw content but from the contextual infrastructure that surrounds them: provenance, restrictions, community relationships, institutional authority, and temporal continuity that keep those materials accountable to the purposes for which they were created.

They also knew, though perhaps less explicitly, that this contextual infrastructure does not survive training. The proposal would have given the vendor access to content while stripping away the context. The archivists held the line.

The UVA protocol holds the line in the same way, with more formal language and more systematic design. It establishes governance requirements that archival institutions can deploy in negotiations with AI partners. It articulates, with considerable precision, what archival institutions need to retain: item-level provenance, meaningful attribution, and contractually enforceable control. It provides sample clauses, a decision framework, and implementation guidance that archival institutions of any size can adapt.

The protocol is not the final answer. Clarke’s epidata framework reveals why: the final answer requires technical architecture that does not yet exist. Metadata describes. Paradata explains. Epidata governs. The UVA protocol tries to hold the line with provenance and contractual control. But the real battle is upstream. If context does not travel with the record through technical design rather than merely contractual aspiration, then nothing else the profession preserves will matter.

The archive’s new mandate, understood through the lens of epidata, is to become a context engine: an institution that maintains not just the custody of records but the integrity of the contextual infrastructure that makes those records meaningful, governable, and trustworthy across whatever computational environments the future produces. The UVA protocol is an early marker on that path. The profession must now build the road.

Andrew Potter

Discussion about this post

Ready for more?