We Simulated 5,400 AI Agent Scenarios. Here Is What Broke.
Your AI agent passes every unit test. It handles demo scenarios perfectly. Then it goes to production in a regulated industry and recommends a solvency ratio that violates Vietnamese insurance law, or bypasses an approval chain that exists for compliance reasons, or confidently cites a regulation that was repealed two years ago.
The problem isn't that your agent is broken. The problem is that nobody tested the scenarios that matter β the regulatory edge cases, the cross-role handoffs, the adversarial inputs that exploit domain-specific knowledge gaps.
We built a system that generates these scenarios automatically from enterprise ontologies, then ran 5,400 simulations across 3 LLM models and 5 regulated industries. Here's what we learned about verifying AI agents before they touch production.
The Verification Gapβ
Enterprise software has decades of mature testing practices β unit tests, integration tests, load tests, penetration tests. But AI agents introduce a category of failure that none of these catch: domain-specific behavioral failures in open-ended reasoning tasks.
Consider a banking AI agent that needs to classify a non-performing loan. The correct threshold under Vietnamese SBV Circular 11/2021 is different from Basel III defaults. The agent needs to know the right regulation, apply the right threshold, and escalate to the right role when the exposure exceeds authority limits. A unit test that checks "does the function return a number" misses all of this.
What we need is a way to systematically generate test scenarios that cover regulatory requirements, operational workflows, and adversarial attacks β specific to each industry and jurisdiction.
Ontology-Powered Scenario Generationβ
Our approach uses the same three-layer enterprise ontology (Role, Domain, Interaction) from our neurosymbolic grounding research β but for a different purpose. Instead of grounding agent reasoning, we use ontologies to generate test scenarios.
The ontology knows:
- What regulations apply β SBV circulars for Vietnamese banking, HIPAA for US healthcare, MoF decrees for Vietnamese insurance
- What roles exist and their authority limits β a Branch Manager can approve loans up to a threshold; above that requires CRO approval
- What handoff protocols should be followed β credit approval chains, claims escalation paths, compliance reporting workflows
From this knowledge, we automatically generate 30 test scenarios per industry suite, covering three categories:
- Regulatory scenarios β Does the agent correctly apply jurisdiction-specific rules?
- KPI scenarios β Does the agent use the right metric thresholds and calculation methods?
- Adversarial scenarios β Can the agent be tricked into bypassing compliance controls?
The Experimentβ
We compared four scenario generation strategies:
| Condition | Method | What It Tests |
|---|---|---|
| G1 β Baseline | LLM generates scenarios from role + industry name only | What the model knows from training |
| G2 β Persona Matrix | 5 personas x 6 scenario categories crossed | Structured diversity without domain knowledge |
| G3 β RAG | 8 retrieved ontology chunks | Standard retrieval-augmented approach |
| G4 β Ontology | Full three-layer structured ontology | Our approach |
Each condition was tested across 5 industries (FinTech, Insurance, Healthcare, Vietnamese Banking, Vietnamese Insurance), with 3 replications, generating 30 scenarios per suite.
We then evaluated every scenario against:
- 125 regulatory requirements curated from primary legal sources (USC, CFR, NAIC model laws, HIPAA, SBV circulars, MoF decrees) β deliberately independent from the ontology to prevent circular validation
- 25 injected faults across 5 categories: threshold errors, missing regulations, role boundary violations, adversarial vulnerabilities, and metric calculation errors
Three LLM models generated scenarios independently: Claude Sonnet 4, Qwen 2.5 72B, and Gemma 4 26B.
Total: 5,400 simulated scenarios.
What We Foundβ
1. Ontology-generated scenarios cover 46% more regulationsβ
Regulatory Coverage (RC): The ontology condition (G4) covered 48.3% of regulatory requirements, vs. 33.1% for the persona matrix approach (G2).
Friedman test: p = .0015 (highly significant). Post-hoc G4 vs G2: p = .0006.
The gains were largest where regulatory knowledge is most specialized:
- Healthcare: +28 percentage points over baseline (HIPAA, EMTALA, Stark Law coverage)
- Vietnamese Insurance: +16 percentage points (MoF solvency margins, bancassurance rules)
2. Industry specificity was dramatically higherβ
Industry Specificity Score (ISS): Rated on a 1-5 scale by an LLM judge, ontology-generated scenarios scored 4.77/5.0 β significantly higher than all alternatives.
Friedman test: p = 2 x 10^-6 (highly significant). All pairwise comparisons: p < .001.
This means ontology-generated scenarios didn't just mention the right industry β they tested behaviors specific to that industry's regulatory environment, organizational structure, and operational workflows.
3. The Coverage-Precision Tradeoffβ
Here's the finding that surprised us most: ontology-generated scenarios had the highest regulatory coverage but the lowest fault-triggering precision at execution time.
| Condition | Design-Stage FDR | Execution-Stage FDR | Gap |
|---|---|---|---|
| G4 (Ontology) | 55% | 40% | +15pp |
| G3 (RAG) | 58% | 52% | +6pp |
| G1 (Baseline) | 56% | 48% | +8pp |
Ontology-generated scenarios are regulatory-comprehensive β they cover a broad surface area of compliance requirements. But they're less effective at triggering specific fault behaviors at runtime. The scenarios are well-structured enough that agents handle them more carefully, paradoxically making faults harder to trigger.
Practical implication: Use ontology for coverage and completeness, but complement with targeted adversarial testing for fault detection. These are different testing objectives that require different strategies.
4. Adversarial coverage is prompt-dependent, not knowledge-dependentβ
All four conditions achieved 88-91% coverage of our 6-category adversarial taxonomy (prompt injection, data exfiltration, regulatory bypass, role confusion, boundary manipulation, social engineering). The differences were not statistically significant (p = .995).
This tells us something important: adversarial attack diversity comes from how you prompt the scenario generator, not from what domain knowledge you provide. A baseline LLM can generate diverse attack scenarios just as well as an ontology-powered one.
5. The results are model-independentβ
We replicated across three architecturally different models:
| Finding | Claude Sonnet 4 | Qwen 2.5 72B | Gemma 4 26B |
|---|---|---|---|
| RC advantage (G4 vs G2) | +15.2pp (p=.0006) | +14.4pp (p=.005) | +11.2pp (p=.009) |
| ISS advantage | 4.77 (p<.001) | 4.37 (p<.001) | 4.68 (p<.001) |
| Weaker model, larger RC uplift | +7.7pp | +12.0pp | +3.7pp |
The ontology advantage is not an artifact of a specific model's capabilities. It replicates across commercial and open-source LLMs with different architectures and training data.
And consistent with the Inverse Parametric Knowledge Effect, weaker models showed larger RC uplift from ontological grounding β because they have less parametric regulatory knowledge to draw on.
From Test Scenarios to Safety Certificatesβ
Generating good test scenarios is step one. But enterprises need more than test results β they need a verifiable attestation that an agent has been validated for a specific operational scope.
We propose an Agent Safety Certificate: a machine-verifiable record that binds together:
- Operational Envelope β the bounded space of inputs, outputs, and behaviors the agent is certified for
- Scenario Set β every test scenario that was executed
- Results Matrix β per-scenario pass/fail outcomes
- Verdict β Approved (pass rate >= 95%), Conditional (80-95%, requires human review), or Rejected (< 80%)
- Cryptographic Signature β binding the certificate to a specific agent version and ontology version
The certificate is enforced by a Simulation Gate β an infrastructure-level checkpoint that blocks deployment if the agent doesn't hold a valid certificate for its target environment. This isn't an application-level flag that can be bypassed. It's enforced at the orchestration layer.
When the ontology is updated (new regulation, changed threshold, additional role), the certificate is automatically invalidated and re-verification is triggered. This creates a continuous verification loop tied to the evolving regulatory landscape.
What This Means for Enterprise AI Teamsβ
If you're deploying AI agents in regulated industries:
-
Your test suite is probably incomplete. Manual scenario authoring misses the long tail of regulatory requirements. Ontology-powered generation covered 46% more regulations than the best alternative β and this was with only 30 scenarios per industry.
-
Coverage and fault detection are different problems. Don't assume that covering more regulations means catching more bugs. Use ontology for systematic coverage, and complement with adversarial red-teaming for fault detection.
-
Test across models. If you're planning to swap LLM providers, your verification results may not transfer. Our cross-model data shows the coverage-precision tradeoff direction is model-dependent.
-
Build verification into your deployment pipeline. Treat agent verification like container security scanning β it happens automatically before deployment, and failures block the release. The Safety Certificate pattern gives you an auditable record for compliance.
-
Connect your ontology to your test infrastructure. The same domain knowledge that grounds your agents can generate your test scenarios. This creates a virtuous cycle: as you improve your ontology, your tests automatically improve too.
The Paperβ
The full study β experimental design, statistical analysis, cross-model validation, and the Safety Certificate framework β is available on arXiv:
"Toward Verifiable Enterprise AI Agents: Ontology-Powered Simulation and Formal Safety Certification" Thanh Luong Tuan, Abhijit Sanyal (arXiv submission pending β preprint available on request)
This research directly informs the verification architecture in FAOS, where ontology-powered simulation is part of the agent deployment pipeline across 22 industry verticals.
This post is part of our research series sharing empirical findings from the FAOS platform. Previously: why ontology beats RAG for enterprise AI agents.
