We Simulated 5,400 AI Agent Scenarios. Here Is What Broke.

April 17, 2026 · 8 min read

Founder & CEO, FAOSX | CIO 100 Asia 2025 | AI & Digital Transformation Leader

Your AI agent passes every unit test. It handles demo scenarios perfectly. Then it goes to production in a regulated industry and recommends a solvency ratio that violates Vietnamese insurance law, or bypasses an approval chain that exists for compliance reasons, or confidently cites a regulation that was repealed two years ago.

The problem isn't that your agent is broken. The problem is that nobody tested the scenarios that matter — the regulatory edge cases, the cross-role handoffs, the adversarial inputs that exploit domain-specific knowledge gaps.

We built a system that generates these scenarios automatically from enterprise ontologies, then ran 5,400 simulations across 3 LLM models and 5 regulated industries. Here's what we learned about verifying AI agents before they touch production.

The Verification Gap

Enterprise software has decades of mature testing practices — unit tests, integration tests, load tests, penetration tests. But AI agents introduce a category of failure that none of these catch: domain-specific behavioral failures in open-ended reasoning tasks.

Consider a banking AI agent that needs to classify a non-performing loan. The correct threshold under Vietnamese SBV Circular 11/2021 is different from Basel III defaults. The agent needs to know the right regulation, apply the right threshold, and escalate to the right role when the exposure exceeds authority limits. A unit test that checks "does the function return a number" misses all of this.

What we need is a way to systematically generate test scenarios that cover regulatory requirements, operational workflows, and adversarial attacks — specific to each industry and jurisdiction.

Ontology-Powered Scenario Generation

Our approach uses the same three-layer enterprise ontology (Role, Domain, Interaction) from our neurosymbolic grounding research — but for a different purpose. Instead of grounding agent reasoning, we use ontologies to generate test scenarios.

The ontology knows:

What regulations apply — SBV circulars for Vietnamese banking, HIPAA for US healthcare, MoF decrees for Vietnamese insurance
What roles exist and their authority limits — a Branch Manager can approve loans up to a threshold; above that requires CRO approval
What handoff protocols should be followed — credit approval chains, claims escalation paths, compliance reporting workflows

From this knowledge, we automatically generate 30 test scenarios per industry suite, covering three categories:

Regulatory scenarios — Does the agent correctly apply jurisdiction-specific rules?
KPI scenarios — Does the agent use the right metric thresholds and calculation methods?
Adversarial scenarios — Can the agent be tricked into bypassing compliance controls?

The Experiment

We compared four scenario generation strategies:

Condition	Method	What It Tests
G1 — Baseline	LLM generates scenarios from role + industry name only	What the model knows from training
G2 — Persona Matrix	5 personas x 6 scenario categories crossed	Structured diversity without domain knowledge
G3 — RAG	8 retrieved ontology chunks	Standard retrieval-augmented approach
G4 — Ontology	Full three-layer structured ontology	Our approach

Each condition was tested across 5 industries (FinTech, Insurance, Healthcare, Vietnamese Banking, Vietnamese Insurance), with 3 replications, generating 30 scenarios per suite.

We then evaluated every scenario against:

125 regulatory requirements curated from primary legal sources (USC, CFR, NAIC model laws, HIPAA, SBV circulars, MoF decrees) — deliberately independent from the ontology to prevent circular validation
25 injected faults across 5 categories: threshold errors, missing regulations, role boundary violations, adversarial vulnerabilities, and metric calculation errors

Three LLM models generated scenarios independently: Claude Sonnet 4, Qwen 2.5 72B, and Gemma 4 26B.

Total: 5,400 simulated scenarios.

What We Found

1. Ontology-generated scenarios cover 46% more regulations

Regulatory Coverage (RC): The ontology condition (G4) covered 48.3% of regulatory requirements, vs. 33.1% for the persona matrix approach (G2).

Friedman test: p = .0015 (highly significant). Post-hoc G4 vs G2: p = .0006.

The gains were largest where regulatory knowledge is most specialized:

Healthcare: +28 percentage points over baseline (HIPAA, EMTALA, Stark Law coverage)
Vietnamese Insurance: +16 percentage points (MoF solvency margins, bancassurance rules)

2. Industry specificity was dramatically higher

Industry Specificity Score (ISS): Rated on a 1-5 scale by an LLM judge, ontology-generated scenarios scored 4.77/5.0 — significantly higher than all alternatives.

Friedman test: p = 2 x 10^-6 (highly significant). All pairwise comparisons: p < .001.

This means ontology-generated scenarios didn't just mention the right industry — they tested behaviors specific to that industry's regulatory environment, organizational structure, and operational workflows.

3. The Coverage-Precision Tradeoff

Here's the finding that surprised us most: ontology-generated scenarios had the highest regulatory coverage but the lowest fault-triggering precision at execution time.

Condition	Design-Stage FDR	Execution-Stage FDR	Gap
G4 (Ontology)	55%	40%	+15pp
G3 (RAG)	58%	52%	+6pp
G1 (Baseline)	56%	48%	+8pp

Ontology-generated scenarios are regulatory-comprehensive — they cover a broad surface area of compliance requirements. But they're less effective at triggering specific fault behaviors at runtime. The scenarios are well-structured enough that agents handle them more carefully, paradoxically making faults harder to trigger.

Practical implication: Use ontology for coverage and completeness, but complement with targeted adversarial testing for fault detection. These are different testing objectives that require different strategies.

4. Adversarial coverage is prompt-dependent, not knowledge-dependent

All four conditions achieved 88-91% coverage of our 6-category adversarial taxonomy (prompt injection, data exfiltration, regulatory bypass, role confusion, boundary manipulation, social engineering). The differences were not statistically significant (p = .995).

This tells us something important: adversarial attack diversity comes from how you prompt the scenario generator, not from what domain knowledge you provide. A baseline LLM can generate diverse attack scenarios just as well as an ontology-powered one.

5. The results are model-independent

We replicated across three architecturally different models:

Finding	Claude Sonnet 4	Qwen 2.5 72B	Gemma 4 26B
RC advantage (G4 vs G2)	+15.2pp (p=.0006)	+14.4pp (p=.005)	+11.2pp (p=.009)
ISS advantage	4.77 (p<.001)	4.37 (p<.001)	4.68 (p<.001)
Weaker model, larger RC uplift	+7.7pp	+12.0pp	+3.7pp

The ontology advantage is not an artifact of a specific model's capabilities. It replicates across commercial and open-source LLMs with different architectures and training data.

And consistent with the Inverse Parametric Knowledge Effect, weaker models showed larger RC uplift from ontological grounding — because they have less parametric regulatory knowledge to draw on.

From Test Scenarios to Safety Certificates

Generating good test scenarios is step one. But enterprises need more than test results — they need a verifiable attestation that an agent has been validated for a specific operational scope.

We propose an Agent Safety Certificate: a machine-verifiable record that binds together:

Operational Envelope — the bounded space of inputs, outputs, and behaviors the agent is certified for
Scenario Set — every test scenario that was executed
Results Matrix — per-scenario pass/fail outcomes
Verdict — Approved (pass rate >= 95%), Conditional (80-95%, requires human review), or Rejected (< 80%)
Cryptographic Signature — binding the certificate to a specific agent version and ontology version

The certificate is enforced by a Simulation Gate — an infrastructure-level checkpoint that blocks deployment if the agent doesn't hold a valid certificate for its target environment. This isn't an application-level flag that can be bypassed. It's enforced at the orchestration layer.

When the ontology is updated (new regulation, changed threshold, additional role), the certificate is automatically invalidated and re-verification is triggered. This creates a continuous verification loop tied to the evolving regulatory landscape.

What This Means for Enterprise AI Teams

If you're deploying AI agents in regulated industries:

Your test suite is probably incomplete. Manual scenario authoring misses the long tail of regulatory requirements. Ontology-powered generation covered 46% more regulations than the best alternative — and this was with only 30 scenarios per industry.
Coverage and fault detection are different problems. Don't assume that covering more regulations means catching more bugs. Use ontology for systematic coverage, and complement with adversarial red-teaming for fault detection.
Test across models. If you're planning to swap LLM providers, your verification results may not transfer. Our cross-model data shows the coverage-precision tradeoff direction is model-dependent.
Build verification into your deployment pipeline. Treat agent verification like container security scanning — it happens automatically before deployment, and failures block the release. The Safety Certificate pattern gives you an auditable record for compliance.
Connect your ontology to your test infrastructure. The same domain knowledge that grounds your agents can generate your test scenarios. This creates a virtuous cycle: as you improve your ontology, your tests automatically improve too.

The Paper

The full study — experimental design, statistical analysis, cross-model validation, and the Safety Certificate framework — is available on arXiv:

"Toward Verifiable Enterprise AI Agents: Ontology-Powered Simulation and Formal Safety Certification" Thanh Luong Tuan, Abhijit Sanyal (arXiv submission pending — preprint available on request)

This research directly informs the verification architecture in FAOS, where ontology-powered simulation is part of the agent deployment pipeline across 25 industry verticals.

This post is part of our research series sharing empirical findings from the FAOS platform. Previously: why ontology beats RAG for enterprise AI agents.

The Verification Gap​

Ontology-Powered Scenario Generation​

The Experiment​

What We Found​

1. Ontology-generated scenarios cover 46% more regulations​

2. Industry specificity was dramatically higher​

3. The Coverage-Precision Tradeoff​

4. Adversarial coverage is prompt-dependent, not knowledge-dependent​

5. The results are model-independent​

From Test Scenarios to Safety Certificates​

What This Means for Enterprise AI Teams​

The Paper​