Pre-Deployment Verification for Enterprise Agents — The FAOS Trust Certificate

Executive summary

Enterprise AI agents enter production the way senior contractors used to enter buildings — interview, references, hire, hope. The interview is a capability benchmark. The references are vendor demos. The hire is a deployment ticket. The hope is that monitoring, guardrails, and human escalation will catch what slipped past the gate.

This is not how safety-critical software has been deployed for forty years. Avionics ships under DO-178C. Medical devices ship under IEC 62304. Automotive control systems ship under ISO 26262. Each defines a structured pre-deployment evidence package binding a specific version of a system to a body of verification work. Enterprise AI has no analogue. The risk profile suggests it should.

The FAOS platform addresses this gap with three primitives. First, an operational envelope — a five-element specification of what an agent is certified to do, where it may reason, what invariants it must hold, what regulations bound its decisions, and at what level of autonomy. Second, ontology-to-scenario generation — automated derivation of regulatory, operational, and adversarial test scenarios from the industry ontology, eliminating the manual test-authoring bottleneck. Third, a trust certificate — a machine-verifiable attestation binding a specific agent version, ontology version, and model version to its verification evidence, with graduated verdicts that drive a deployment gate.

The headline findings from RA-6, our controlled pilot across four regulated industries and three model families:

Ontology-grounded scenario generation achieves 48.3% coverage against a 125-item primary-source regulatory checklist, versus 33.1% for the persona-scenario baseline that dominates current commercial practice (post-hoc $p_c$ = .0006).
Ontology-grounded scenarios score 4.77 out of 5.0 on industry specificity, the highest by a wide margin against three alternative generation strategies ( $p$ = 2 × 10⁻⁶).
The pattern replicates across Claude Sonnet 4, Qwen 2.5 72B, and Gemma 4 26B across 5,400 total scenarios. The advantage is a property of the methodology, not of one model.

The implication for a CIO running an enterprise agent programme: pre-deployment verification, anchored in your own domain ontology, is a categorically different capability than benchmark scores plus production monitoring. It is what an audit committee, a regulator, or an internal risk function can actually act on.

The problem — why benchmark scores do not predict production-readiness

The narrative around enterprise AI has caught up with one point: leading large language models are not safe enough out of the box to autonomously execute regulated work. The narrative has not caught up on a harder point: the standard remedies — better prompts, retrieval, post-deployment monitoring, human-in-the-loop gates — close part of the gap but leave structural problems intact.

Three failure modes recur in regulated deployments.

Failure mode one — benchmark blindness. Public benchmarks evaluate model capability in the abstract: can it score on MMLU, reason about a math problem, write working code. Recent agent-specific benchmarks have begun to measure something closer to deployment safety, but the results are not reassuring. One recent study found that no tested LLM agent achieved a safety score above 60% across 349 interaction environments. Another demonstrated that leading models are surprisingly compliant with malicious agent requests across 110 harmful tasks. A model that scores in the 90th percentile on MMLU may sit at the 40th percentile on whether it correctly applies a BSA/AML threshold to a specific cash deposit. Benchmark scores do not predict the regulated-domain behaviour a CIO needs to attest to.

Failure mode two — guardrail probability. Prompt-level safety instructions, content moderators, and tool-call filters are probabilistic. They reduce the rate of unsafe outputs but cannot eliminate them, and they degrade under adversarial pressure, prompt injection, or unusual edge cases. For a regulated workflow where one bad outcome triggers a regulator inquiry, probabilistic reduction is not the same product as deterministic bounds.

Failure mode three — the compliance-evidence gap. When the internal audit committee, the FDIC examiner, or the head of risk asks "what evidence supports your decision to deploy this agent in this workflow", the typical answer is a deck, a benchmark printout, a paragraph in a SOC 2 attestation, and a description of the monitoring stack. None of this is a structured pre-deployment evidence package of the kind that DO-178C, IEC 62304, or ISO 26262 provide for safety-critical software in other industries. The gap is not the evidence — teams have evidence. The gap is the structure, the schema, and the machine-verifiability that lets a third party act on it.

What current approaches miss is not better testing in general. It is the absence of a domain-aware specification of what to test for, paired with a machine-verifiable attestation of what was tested. Pen-testing the system prompt is not the same as proving that a 9,500-dollar cash deposit scenario is in the test suite. A 99% pass rate on a generic safety battery is not the same as a 95% coverage rate against the 31 CFR §1010–1020 regulatory checklist. The structure is missing, and where the structure is missing, the regulated-industry risk concentrates.

The FAOS approach — three primitives for a verification-first paradigm

The FAOS platform addresses these failure modes with three connected primitives. Each is named so a CIO and the CIO's principal engineer can refer to the same thing.

Primitive one — the operational envelope

An operational envelope is the formally defined space within which an agent is certified to operate. RA-6 defines it as a five-element tuple — permission boundary, domain scope, safety properties, governance constraints, and autonomy level. In applied terms:

What the agent is permitted to do. The discrete set of authorised actions. A BSA/AML compliance analyst agent can flag suspicious transactions, file CTRs, file SARs, and escalate. It cannot approve loans or execute trades, regardless of what its prompt suggests.
Where the agent may reason. The ontological domains in scope. A banking agent does not have healthcare regulations available. A US-fintech agent does not have SBV circulars unless explicitly added.
What invariants must hold. Safety properties preserved across every execution — for example, "never disclose SAR filing to the subject" or "always verify identity before account access."
What regulations bound decisions. For a BSA/AML analyst, 31 CFR §1010–1020, OCC 2013-29, FinCEN guidance. For a Vietnamese banking analyst, SBV Circular 11/2021 and Decree 116/2013.
At what level of autonomy. Four levels — Suggest, Plan, Execute, Delegate — each with strictly more demanding verification requirements as autonomy rises.

The envelope is derived from the industry ontology rather than written by hand for each agent. A CIO adopting or authoring the banking ontology gets the envelope as a by-product of the same artifact that grounds the agent.

Primitive two — ontology-to-scenario generation

Test scenarios are derived automatically from three sources in the ontology. Regulatory scenarios come from the governance constraints — for each regulation, the generator produces a positive case (the agent should apply it), a negative case (the agent should detect a violation), and boundary cases at numeric thresholds where applicable. Operational scenarios come from the metrics — calculate the combined ratio given these loss and expense figures, assess whether this non-performing-loan trend triggers the supervisory threshold. Adversarial scenarios come from the permission boundary — prompt injection attempts, data exfiltration attempts, regulatory bypass attempts, role confusion attempts.

The structural point is that the generator traverses the ontology systematically. A regulation absent from the ontology is a blind spot in the test suite — and an explicit, visible one. The completeness of the coverage is a function of the completeness of the ontology, which is auditable. Persona-by-scenario approaches do not have this property; their completeness is bounded by the imagination of the persona-scenario matrix author.

Primitive three — the trust certificate

A trust certificate is a machine-verifiable attestation issued at the end of verification. It contains the operational envelope under which certification was performed, the scenario set executed, the per-scenario results with judge evaluations, an overall verdict, a timestamp, and a cryptographic signature.

The verdict has three values. Approved means the agent passed at or above a high threshold (95% in our implementation) and is cleared to deploy at its certified autonomy level. Conditional means the agent passed at or above a low threshold (80%) but below the high threshold; it requires explicit operator approval before production. Rejected means the agent fell below the low threshold; deployment is blocked.

The signature binds the certificate to a specific triple — the agent's code hash, the model version, and the ontology version. Changing any one invalidates the certificate. This is the operational consequence that often surprises CIOs: every model rev, every prompt rev, every ontology rev is a re-certification event, not a free upgrade.

In the FAOS implementation, an architectural deployment gate consumes the certificate. The gate is enforced at the runtime infrastructure layer, not the application layer, so application code cannot bypass it. The gate is environment-aware — enforced in production and customer-VPC environments, skipped in staging and development. Higher levels of verification — runtime monitoring, probabilistic bounded model checking, formal methods — remain proposed in the current architecture; the simulation-based attestation is what the framework operationalises today.

What the research shows

RA-6 reports a controlled pilot designed to test one structural hypothesis: does ontology-grounded scenario generation produce test suites measurably superior to the alternatives currently in commercial practice? Four conditions, five industry-by-regulatory-regime cells, three model families, 5,400 scenarios total.

Study design

Four generation strategies, each receiving strictly more information than the last:

G1 — Baseline. The agent role and the industry name. Nothing more. This is what a naive prompt produces.
G2 — Persona-scenario matrix. Five personas crossed with six scenario categories — the approach that dominates current commercial test-generation engines. This is the strongest representative of current practice.
G3 — RAG-augmented. Eight unstructured text chunks retrieved from the industry ontology. This simulates a retrieval pipeline over the same content G4 uses.
G4 — Ontology-grounded. The full three-layer structured ontology (roles, domain concepts, interaction patterns), with 30% of regulatory constraints held out from the generation prompt to control for circularity. The held-out partition lets us distinguish generalisation from regurgitation.

Five industry cells — US fintech under BSA/AML, US insurance under NAIC and state DOI, US healthcare under HIPAA and EMTALA, Vietnamese banking under SBV circulars, Vietnamese insurance under the 2022 Insurance Business Law. Each condition × industry combination ran three times, yielding 60 independently generated test suites of 30 scenarios each, for 1,800 scenarios per model. The cross-model replication added Qwen 2.5 72B and Gemma 4 26B for an additional 3,600 scenarios.

The ground-truth checklist was 125 regulatory requirements (25 per industry) curated from primary statutory and regulatory sources — 31 CFR for fintech, NAIC Model Acts for insurance, 45 CFR and 42 USC for healthcare, SBV circulars for Vietnamese banking, the Insurance Business Law and Circular 132 for Vietnamese insurance. The checklist was curated independently of the FAOS ontology to avoid circularity. Twenty-five faults (five per industry) were injected into the agent's system prompt and the test suites were evaluated on whether they would detect the faults.

Headline results

The headline outcome is a sharp coverage and specificity advantage for ontology-grounded generation against the persona-scenario baseline, with cross-model replication.

Regulatory coverage. G4 achieved 48.3% mean coverage of the 125-item checklist, compared with 33.1% for G2 — a 15.2 percentage-point advantage, significant after Bonferroni correction ( $p_c$ = .0006).
Industry specificity. G4 scored 4.77 out of 5.0 on the LLM-as-judge specificity rubric, the highest by a clear margin and significant against every alternative ( $p$ = 2 × 10⁻⁶ against G1; $p_c$ < 10⁻⁵ against G2; $p_c$ = .0008 against G3).
Cross-model replication. Across Claude Sonnet 4, Qwen 2.5 72B, and Gemma 4 26B, the ontology-versus-persona pattern replicated. The coverage advantage held across all three families ( $p_c$ = .0006 on Claude, $p_c$ = .005 on Qwen, $p_c$ = .009 on Gemma). The specificity advantage held across all three. The result is a property of the methodology, not an artifact of one model family.

What the paper does not claim

The result has a sharp boundary, and the paper is explicit about it. The coverage advantage of G4 over G1 (plain prompt) and G3 (RAG-augmented prompt) is positive but not Bonferroni-robust. The strong claim that survives correction is G4 over G2 — ontology-grounded generation beats the persona-scenario baseline that current commercial tools use. RA-6 explicitly establishes ontology-grounded generation as a "credible complement" to persona-based test suites for regulatory-intensive domains, not as a unilateral winner over every alternative.

Three further boundaries are honestly acknowledged. The Trust Certificate as currently operationalised supports simulation-based attestation; higher verification levels (probabilistic bounded model checking, formal methods) are proposed and not yet delivered. The LLM-as-judge evaluation pipeline uses Claude Sonnet 4 as both the generator-side primary model and the fixed judge, introducing self-enhancement bias that cross-model replication only partially mitigates. The verdict thresholds (95% high, 80% low) are illustrative engineering parameters, not yet calibrated against real-world deployment incident rates, and no agent in the pilot satisfied the 95% threshold.

These caveats do not undermine the headline results. They shape what those results authorise you to conclude — and what the next phase of verification research has to deliver.

What this means for your enterprise

The practical implications fall into three categories.

Architectural implication — anchored testing replaces benchmark testing

If you are evaluating an agent platform for regulated work, the question is not "what does it score on AgentBench" or "what does it score on AILuminate". Those measure model capability against generic safety surfaces. The question is whether the platform can produce a structured pre-deployment evidence package against your regulatory checklist, your domain ontology, and your operational thresholds.

Three properties of that capability are worth checking. Does the platform let you author and version a domain ontology that the agent's verification suite is derived from? Does the verification produce a machine-verifiable certificate, not only a dashboard? Does the deployment gate enforce the certificate at the runtime layer, or only at the application layer? The third question is where most current "AI governance" offerings fall short — governance as documentation does not constrain behaviour; governance as an enforcement point in the runtime does.

Operational implication — the audit committee gets a structured artifact

The trust certificate is a pre-deployment evidence package designed to be consumed by audit committees, regulators, internal risk functions, and customer security reviews. It binds the agent version, ontology version, and model version to a specific body of verification work. It produces a verdict that drives a deployment gate. It expires when any of the bound components change.

In a regulated industry, this is the difference between answering "do you have AI governance" with a paragraph and answering it with a registry. Regulators and audit committees consume registries. They struggle to consume paragraphs.

The second operational consequence — the one that often surprises CIOs — is the re-certification cadence. A new minor revision of the model is a re-certification event. An ontology change adding a new SBV circular is a re-certification event. A prompt change to the system prompt of a tier-2 agent is a re-certification event. None of these are catastrophic; they are routine; but they have to be planned for. Enterprises running fleets of agents discover that the operational discipline is no different from the discipline already in place for software releases — but the cadence is higher, because models, prompts, and ontologies change more often than application code.

When the approach is not the right tool

Three honest counter-cases.

Discovery-phase agents. Early in a use case, before the regulatory surface and operational thresholds are clear, the cost of ontology authoring may exceed the value. Ontology-grounded verification is a discipline you institute once you know what you are verifying against. Greenfield agents on unregulated workflows do not need it.

Single-task, single-tenant, low-stakes deployments. An internal coding assistant, a marketing copy drafter, a generic customer-support agent on non-regulated workflows — these are less differentiated by ontology grounding than by generic capability and cost. The framework is built for the regulated end of the market.

Wholly-vendor SaaS agents you do not author. If you adopt a third-party SaaS agent that you cannot configure against your own ontology, you have a vendor-attestation problem, not a verification problem. The right response is to require the vendor to attest, not to retrofit your verification framework on top of someone else's black box. The framework presented here is for agents you operate, not agents you consume.

Implementation guide

At a high level, an enterprise adopts ontology-grounded verification in three phases. The phasing assumes a working agent platform — agents in development or production, retrieval in place, basic observability set up.

Phase one — instrument one regulated workflow (first 90 days)

The right starting workflow has three properties. It is regulated. It has a clear domain owner — a head of compliance, a chief actuary, a head of clinical operations — willing to invest in ontology authoring. It is small enough to complete an end-to-end pilot in 90 days.

A representative pilot scope: one agent, one operational envelope, the top 25 regulatory requirements curated from primary sources, 60 to 100 test scenarios derived from the ontology, the fault-injection corpus from RA-6's structure adapted to the workflow's specific risks. The deliverable is a first trust certificate produced for the agent and consumed by the deployment gate, and a measured baseline pass rate.

Success at 90 days looks like: the agent has a trust certificate, the certificate's verdict is reproducible, the domain owner can author and review the regulatory checklist without engineering help, and the certificate is consumed by a deployment-gating mechanism rather than only displayed on a dashboard.

Phase two — extend to adjacent agents and add cross-model replication (next 90 days)

Two parallel tracks.

Track A — agent extension. Add the next two or three agents in the same regulated workflow. Each should reuse the same ontology and the same scenario generator, validating that the marginal cost of certifying an additional agent decays as the ontology accumulates content.

Track B — cross-model replication. Re-run verification under at least one alternative model family. The goal is not to replace the primary model — it is to establish that the verification result is not a one-model artifact. Where the cross-model results agree, the certificate is robust; where they disagree, the disagreement is the next investigation.

Phase three — operationalise the registry (next 6 months)

Two pieces of work that distinguish a research-grade adoption from a production-grade one.

Trust certificate registry. Build the registry that links specific agent versions to their certificates. This is the artifact that audit committees and regulators consume. It is also the artifact that drives the re-certification cadence — when a model version, ontology version, or agent code hash changes, the registry surfaces the affected certificates and routes them for re-verification.

Ontology operations. Decide who owns ontology change management. Decide how regulatory changes propagate into ontology updates. Decide how findings from production monitoring feed back into the next round of verification scenarios. The operational discipline is what keeps the verification surface aligned with the evolving regulatory environment.

Twelve months in, a successful adoption is operating with certificated agents in two or three regulated workflows, with a registry consumed by the audit function, with a documented re-certification cadence, and with explicit playbooks for the three most common re-certification triggers (model rev, ontology rev, prompt rev). The trust certificate is no longer a research artifact; it is an operating-rhythm artifact.

Frequently asked questions

Doesn't penetration testing solve this? Pen-testing finds vulnerabilities. Pre-deployment verification establishes coverage and evidence against a structured specification. These are different problems. Pen-test results belong inside an operational envelope's safety-property section; they do not replace the regulatory checklist, the operational thresholds, or the certificate. In practice, the two are complementary: a mature programme runs both.

What is the ontology maintenance overhead? Material but bounded, and not new work for regulated industries. Compliance functions already track regulatory change as part of their core mandate. The ontology becomes the artifact that work feeds into rather than a parallel burden. The marginal cost is in the structure — moving from documents-and-spreadsheets to a versioned, machine-readable ontology with explicit relationships between regulations, metrics, roles, and handoff patterns. Once that structure is in place, regulatory updates are routine.

What about new model releases — do I have to re-certify every agent? Yes, for production agents on regulated workflows, a model version change is a re-certification event. This is the same discipline software engineering organisations apply to dependency upgrades, with a faster cadence because models release more frequently. The practical mitigation is two-fold: pin model versions per agent rather than tracking the latest, and operate a registry that surfaces affected certificates the moment a pinned model retires. Plan re-certification into the operating rhythm rather than treating it as an exception.

How does this interact with our existing AI governance committee? Cleanly, in principle. The trust certificate is the structured artifact the governance committee was missing. The committee's role becomes setting the policy that maps autonomy levels to verdict thresholds, ratifying which workflows require which verification configurations, and signing off on registry exceptions. The certificate moves the committee from prose-based oversight to artifact-based oversight, which is what most committees are looking for when they ask for "AI governance maturity."

What about agents we didn't build — vendor SaaS, embedded LLM features? Different problem. For agents you operate, the framework here applies directly. For agents you consume, the right move is to require the vendor's attestation — ideally in the same structured form, more typically in whatever form the vendor offers — and to log the vendor attestation in your own registry. The framework does not retrofit onto opaque vendor systems; it gives you the language to ask the vendor for the right artifact.

Does this replace post-deployment monitoring? No. Pre-deployment verification establishes that the agent meets a structured bar before it goes to production. Post-deployment monitoring detects drift and anomalies once it is there. Verification reduces what monitoring has to catch by establishing the baseline; monitoring detects deviations from the baseline. Both are required; neither substitutes for the other.

What to do next

Two paths forward, depending on where you are.

Read the foundation. Pre-deployment verification rests on a grounding architecture — the three-layer enterprise ontology that defines the domain the agent is operating in. RA-3, our companion paper on neurosymbolic enterprise AI, is the architecture that this work builds on. If you are a principal engineer or research-minded architect, start there; the methodology section makes the connection between grounding and verification explicit.

Luong, T. T., and Sanyal, A. (2026). Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents. arXiv preprint 2604.00555. See also the RA-3 whitepaper.

Discuss application to your stack. If you are evaluating pre-deployment verification for a specific regulated workflow — financial services, insurance, healthcare, or a non-English jurisdiction — book a working session with the FAOS team. We are happy to walk through your domain, the operational envelope structure for the relevant agent, and a candidate 90-day pilot scope.

Book a discovery call

Executive summary​

The problem — why benchmark scores do not predict production-readiness​

The FAOS approach — three primitives for a verification-first paradigm​

Primitive one — the operational envelope​

Primitive two — ontology-to-scenario generation​

Primitive three — the trust certificate​

What the research shows​

Study design​

Headline results​

What the paper does not claim​

What this means for your enterprise​

Architectural implication — anchored testing replaces benchmark testing​

Operational implication — the audit committee gets a structured artifact​

When the approach is not the right tool​

Implementation guide​

Phase one — instrument one regulated workflow (first 90 days)​

Phase two — extend to adjacent agents and add cross-model replication (next 90 days)​

Phase three — operationalise the registry (next 6 months)​

Frequently asked questions​

What to do next​