Skip to main content

We Ran 1,800 Enterprise AI Experiments. Ontology Beat RAG Every Time.

Β· 6 min read
Frank Luong
Founder & CEO, FAOSX | CIO 100 Asia 2025 | AI & Digital Transformation Leader

Most enterprise AI teams are building RAG pipelines β€” retrieval-augmented generation that fetches relevant documents and stuffs them into prompts. It works. But when we tested it against structured ontological grounding across 1,800 controlled experiments, 3 LLM models, and 5 regulated industries, ontology-grounded agents consistently outperformed RAG-augmented ones on the metrics that matter most in enterprise: metric accuracy, regulatory compliance, and role consistency.

Here's what the data shows β€” and why it matters for anyone building AI agents for regulated industries.


The Setup: Four Ways to Ground an AI Agent​

We tested four progressively richer grounding strategies on identical enterprise tasks β€” things like calculating NPL ratios for Vietnamese banks, assessing insurance solvency margins, and evaluating HIPAA compliance in healthcare workflows.

ConditionWhat the Agent GetsThink of It As...
C1 β€” BaselineTask description only"Figure it out yourself"
C2 β€” RAG8 retrieved knowledge chunksStandard RAG pipeline
C3 β€” OntologyStructured 3-layer ontology (roles + domain + interactions)FAOS-style grounding
C4 β€” Ontology + RAGBoth ontology and RAG chunksKitchen sink

Each condition was tested across 50 enterprise tasks spanning FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance β€” with 3 replications per condition and 3 different LLM models (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B).

Total: 1,800 controlled experiment runs.

The Results​

Four metrics. Three models. One consistent pattern.

Metric Accuracy: +46% improvement (p < .001)​

Enterprise decisions depend on precise KPIs β€” capital adequacy ratios, combined ratios, readmission rates. The baseline LLM gets these right about 32% of the time. With ontological grounding, that jumps to 75%.

Why? These numbers are enterprise-specific. The correct NPL classification threshold for Vietnamese banks (SBV Circular 11/2021) or the solvency margin requirement for Vietnamese insurers aren't in any LLM's training data. The ontology supplies exactly what the model can't know.

Role Consistency: +27% improvement (p < .001)​

When you ask an AI to act as a Chief Risk Officer, does it actually reason like one? Or does it default to generic advice?

Ontology-grounded agents maintained role-appropriate decision patterns β€” escalation protocols, KPI priorities, approval chains β€” with a Kendall's W effect size of .614 (large). This improvement was language-invariant: it worked equally well for English and Vietnamese domains.

Regulatory Compliance: significant (p = .003)​

Ontology-grounded agents cited specific regulations (Basel III, HIPAA, SBV circulars) more accurately than any other condition. The effect was strongest in Vietnamese industries, where regulatory knowledge is severely underrepresented in LLM training data.

Terminological Fidelity: The Surprise​

Here's where it gets interesting. For well-known insurance terms like "combined ratio" and "persistency rate," the ontology-grounded agent actually performed worse than the baseline. The LLM already knew these terms β€” and the injected ontology context displaced that existing knowledge.

We call this the Inverse Parametric Knowledge Effect.

The Inverse Parametric Knowledge Effect​

This is the central finding, and it has direct implications for how you architect enterprise AI systems:

The value of ontological grounding is inversely proportional to the LLM's pre-existing knowledge of a domain.

The evidence:

  • Vietnamese industries showed 2x the improvement of English industries (Vietnamese Banking: +29%, Vietnamese Insurance: +28% vs. English average: +12%)
  • Open-source models benefited more than Claude β€” Qwen improved +22% vs. Claude's +15% (Wilcoxon p = .040)
  • Enterprise-specific metrics improved the most (KPI thresholds, regulatory values) while well-known terminology sometimes regressed

The practical implication: don't inject everything. Inject what the LLM doesn't already know. Enterprise-specific KPIs, regulatory thresholds, role-specific decision protocols, and domain vocabulary in underrepresented languages β€” that's where ontological grounding delivers 2-5x more value than generic RAG.

Three-Model Replication​

We ran the full experiment on three architecturally different models to rule out Claude-specific artifacts:

ModelMetric Accuracy (p)Role Consistency (p)Ontology Lift
Claude Sonnet 4< .001< .001+15%
Qwen 2.5 72B< .001< .001+22%
Gemma 4 26B< .001< .001+20%

All three models showed statistically significant improvements on Metric Accuracy and Role Consistency. The Inverse PKE held across all three: weaker models (less parametric domain knowledge) benefited more from ontological grounding.

Why Ontology, Not Just Better RAG?​

RAG retrieves text chunks. Ontology provides structured domain knowledge organized into three layers:

  1. Role Layer β€” Who is this agent? What KPIs does it prioritize? When does it escalate? What's its decision-making style?
  2. Domain Layer β€” What are the industry-specific concepts, metric ranges, regulatory thresholds, and terminology?
  3. Interaction Layer β€” How do agents hand off work? What approval chains exist? What are the escalation protocols?

RAG can approximate some of this β€” our C2 condition used well-curated chunks from the same source material. But structured ontology outperformed RAG on relational reasoning (handoffs, metric ranges, role boundaries) because the format itself carries information. A structured role definition communicates hierarchy and priority in a way that a prose paragraph doesn't.

What This Means for Enterprise AI Teams​

If you're building AI agents for regulated industries:

  1. Don't treat all context equally. Invest in structuring domain knowledge rather than scaling retrieval volume. Our three-layer ontology (Role, Domain, Interaction) outperformed 8 well-curated RAG chunks.

  2. Prioritize underrepresented domains. If your agents operate in non-English regulatory environments or niche industries, ontological grounding delivers outsized returns because LLM parametric knowledge is weakest there.

  3. Build adaptive injection. The Inverse PKE suggests a smarter architecture: measure what the LLM already knows, and inject only what it doesn't. This avoids the context displacement phenomenon where injected knowledge crowds out correct parametric recall.

  4. Test across models. Our three-model replication shows these effects are architectural, not model-specific. Your grounding strategy should work regardless of which LLM you deploy.

The Paper​

The full study β€” methodology, statistical analysis, per-task breakdowns, and cross-model replication β€” is available on arXiv:

"Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: Empirical Evidence for Neurosymbolic Grounding" Thanh Luong Tuan, Abhijit Sanyal arXiv:2604.00555 (cs.AI)

The three-layer ontology framework described in this study is the same architecture that powers FAOS β€” our agentic operating system for enterprises. The ontologies tested in the experiment are the same ontologies served to production agents across 22 industry verticals.


This post is part of our research series sharing empirical findings from the FAOS platform. Next up: how we used ontology-powered simulation to verify AI agent safety across 5,400 test scenarios before deployment.