Neurosymbolic Enterprise AI — How FAOS Closes the Precision Gap

Executive summary

Enterprise AI agents fail in regulated industries for a specific reason: large language models can speak the language of a domain, but they cannot be trusted to reason inside its rules. Pure-LLM agents hallucinate metric ranges, conflate regulatory frameworks, and drift between role perspectives. Pure rule-based pipelines stay accurate but cannot handle the language and ambiguity of real enterprise work.

The FAOS platform takes a third path. We pair the language model with a three-layer ontology — a formal, machine-readable description of the roles, domain concepts, and workflows of each industry the agent operates in — and constrain the agent's reasoning against that ontology at runtime. This whitepaper covers the architecture, what we measured, and what changes when you adopt the approach.

The headline findings from RA-3, our controlled 1,800-run study across five regulated industries and three model families:

Ontology-coupled agents outperform ungrounded agents on metric accuracy (p < .001) and role consistency (p < .001), with large effect sizes that replicate across Claude Sonnet 4, Qwen 2.5 72B, and Gemma 4 26B.
The benefit is largest where the model knows least. Vietnamese-localized industries — where the pretraining corpus is sparse — saw roughly twice the improvement of English-language industries.
For well-known concepts already covered by the model's training data, structured ontology injection can actively reduce performance. Grounding is not a uniform good. It is a targeted intervention.

The implication for a CIO planning agentic deployments in regulated work: a generic agentic platform paired with retrieval is not the same product as an ontology-grounded platform, and the difference shows up in exactly the cases where compliance and audit costs are highest.

The problem — why enterprise AI agents fail in regulated work

The narrative around enterprise AI has caught up with reality on one point: language models alone are not sufficient for regulated industries. They hallucinate. They confuse adjacent regulatory regimes. They average the perspectives of a CFO and a product manager into a generic analyst voice.

The narrative has not caught up on a harder point: the standard remedies — better prompts, retrieval over a document store, fine-tuning on industry corpora — close part of the gap but leave structural problems intact.

Three failure modes recur in regulated deployments:

Failure mode one — metric drift. Enterprise reasoning depends on quantitative thresholds: a 95% combined ratio in insurance is healthy, 105% is distressed. A bank's non-performing-loan ratio crossing 3% changes the supervisory regime that applies to it. A model that learned about "combined ratio" from public corpora will recite the definition correctly but cannot reliably tell you whether 87% is a good number or a great number for a specific line of business. The gap between definitional knowledge and operating knowledge is where decisions live.

Failure mode two — regulatory conflation. Public regulatory documents are well represented in training data. But a model that has read about Basel III, HIPAA, and the EU AI Act will not reliably distinguish them at the level a compliance officer needs. Worse, in non-English jurisdictions — Vietnamese banking under SBV circulars, Vietnamese insurance under MoF decrees — the model often has not read the source material at all. It will improvise, fluently and incorrectly.

Failure mode three — role drift. A CFO reviewing a deal asks different questions than a product manager reviewing the same deal. A general counsel reviewing the same deal asks different questions still. An LLM with a system prompt saying "you are a CFO" approximates this but does not anchor it. Over a long session, role-specific reasoning patterns drift toward a generic helpful-assistant voice.

What current solutions miss is not better prose. It is the absence of a formal, machine-readable specification of the domain the agent is operating in. Retrieval-augmented generation puts the right documents in front of the model, but the model still has to interpret what counts as a metric, what counts as a regulation, and what counts as a role-appropriate response. The interpretation step is where the regulated-industry risk concentrates.

The FAOS approach — ontology as a runtime contract

The FAOS platform addresses these failure modes by giving the agent a contract to operate against. The contract is the enterprise ontology — a formal description of three layers of the domain — and the agent's reasoning is constrained against it at every stage of execution.

Neurosymbolic AI is the engineering term for this combination. Neural reasoning supplies the flexibility, the language understanding, and the generalization. Symbolic structure supplies the constraints, the relationships, and the verifiability. The point of the integration is not to replace either side. It is to put them in the right relationship.

Three layers, one schema

A FAOS enterprise ontology is a triple — Role, Domain, Interaction — instantiated per industry but sharing the same schema across all 25 verticals the platform serves.

Role ontology. Encodes how specific organizational roles think. For each role, it specifies decision patterns, KPI focus, communication style, expertise domains, and approval authority. A product manager's role definition prioritizes ARR, NPS, feature adoption, and churn rate as KPIs and frames responses in executive register. A general counsel's role definition does not.
Domain ontology. Captures industry-specific concepts, their definitions, their relationships, and the regulations that apply. Verticals are organized hierarchically — fintech.payments.card_networks inherits from fintech.payments inherits from fintech — so an agent operating in a sub-domain automatically has access to parent-domain concepts. Each metric carries its healthy range and benchmarks. Each regulatory framework names the jurisdictions it applies to.
Interaction ontology. Formalizes organizational workflows as typed handoff patterns between roles, approval chains, and escalation paths. A design-to-development handoff specifies who hands to whom, on what trigger, with what artifacts, requiring whose approval.

The schema is industry-invariant. The content is industry-specific. A new industry plugs into FAOS by supplying the content for the three layers, not by writing new code.

Where the ontology constrains the agent

FAOS uses the ontology at three coupling points in the agent's execution. RA-3 names these input-side, process-side, and output-side coupling, and characterizes the current state of the platform — and of the broader industry — as predominantly input-side.

Input-side coupling (deployed). Before the LLM reasons, the system loads the relevant ontology for the agent's tenant, role, and domain, and serializes it into priority-ordered context: role first, then domain, then interaction. A token budget caps the injection at a default 2,000 tokens. The same ontological domain hierarchy filters which tools the agent can see — a banking agent does not have healthcare tools available in its discovery surface. Governance thresholds enforce that regulated domains can only be served by skills that meet a minimum quality bar.

Process-side coupling (partial). During execution, autonomy gates block sensitive operations until the appropriate role approves them. A quality-judge node scores agent outputs before they return to the user, and routes low-confidence results to escalation. These mechanisms enforce process-level constraints but do not yet validate outputs against the ontology itself.

Output-side coupling (proposed). The next architectural layer — and a primary direction of FAOS research — is validation: checking that the agent's response references only defined domain terms, cites metric values within defined ranges, follows handoff patterns specified in the interaction ontology, and references only applicable regulatory frameworks. This is formalized in RA-3 as a definition of ontological compliance and proposed as an OntologyValidator component, but not yet implemented in the platform.

This asymmetry — input is constrained, output is not — is the structural critique RA-3 makes of current enterprise AI systems, including our own. An agent can receive perfect ontological context and still emit constraint-violating output. Closing that loop is the L4-L5 work ahead.

What the research shows

RA-3 reports a controlled within-subject experiment designed to isolate the contribution of ontological grounding from the contribution of retrieval and from baseline LLM capability. Four conditions, five industries, three models, 1,800 runs.

Study design

Four grounding conditions corresponding to maturity levels in the coupling taxonomy:

C1 — Ungrounded. System prompt only, no domain context.
C2 — RAG-only. Unstructured text chunks extracted from the same ontology blueprints, injected as flat reference text (~2,000 tokens).
C3 — Ontology-coupled. Structured three-layer injection via the FAOS PromptInjector (~2,800–3,200 tokens, reflecting structural overhead from typed property-value pairs and metric range definitions).
C4 — Ontology plus process. C3 plus a post-generation quality judge that scores output and flags sub-threshold responses for escalation.

Fifty tasks across five industries — FinTech under BSA-AML, US Insurance, Healthcare under HIPAA/CMS, Vietnamese Banking under SBV regulations, and Vietnamese Insurance under MoF regulations. Each task ran under all four conditions with three repetitions, on each of three models: Claude Sonnet 4 as the primary, Qwen 2.5 72B and Gemma 4 26B for replication. An independent LLM judge scored every response against ontology-derived ground truth using metric-specific rubrics.

The deliberate inclusion of Vietnamese-localized industries was the methodological hinge of the study. Vietnamese banking and insurance test grounding in a domain where LLM training is sparse — SBV circulars, MoF decrees, bilingual terminology — and where ontological grounding would matter most if our framework's predictions were correct.

Headline results

The 1,800-run study delivered statistically robust evidence on two of four metrics and model-sensitive evidence on a third.

Metric accuracy. Ontology coupling produces a large, highly significant improvement over ungrounded agents on the primary model (Friedman p < .001, Kendall's W = .460). The cross-model replication is unusually clean: every model shows pairwise C1→C3 significance at p < .001.
Role consistency. A larger effect still. Friedman p < .001, W = .614 on the primary model, replicating across all three models with pairwise C1→C3 significance ranging from p = .003 (Claude) to p < .001 (Gemma).
Regulatory compliance. Omnibus significance on the primary model (p = .003, W = .318), but pairwise C1→C3 is model-dependent — significant on Qwen ( $p_{corr}$ = .019), not significant on Claude ( $p_{corr}$ = .324), approaching significance on Gemma (p = .051). Regulatory frameworks are well represented in pretraining, so the lift depends on how strong the model's prior coverage is.
Terminological fidelity. Omnibus significance, but pairwise C1→C3 not significant on any model. Well-established terminology is already parametrically encoded, regardless of architecture.

The Vietnamese-localized industries finding

The result that warrants a separate paragraph is the geography of the improvement. Vietnamese Banking and Vietnamese Insurance showed C1→C3 deltas of +.29 and +.28 respectively. The English-language industry average was +.12. Vietnamese-localized domains improved roughly twice as much as English-language domains, and the pattern replicated in all three model families.

RA-3 names this the Inverse Parametric Knowledge Effect: the value of ontological grounding is inversely proportional to the LLM's pre-existing parametric knowledge of the domain. The result has the structure of a falsifiable prediction. Domains under-represented in pretraining should benefit most. They did.

The same pattern shows up in a second, independent signal. Following recent work on semantic entropy, RA-3 computed the change in score distribution entropy under grounding. Ontological grounding reduced entropy on 11 of 12 metric-by-model combinations. The single exception was metric accuracy on Claude — precisely where Claude's strong parametric knowledge of industry benchmarks creates destructive interference with the injected context. Under a binomial null, 11 of 12 constructive outcomes is significant at p = .003.

What the paper does not claim

The Inverse PKE result has a sharp boundary. RA-3 demonstrates a domain-level and model-level effect. It does not demonstrate that ontology uniformly beats retrieval — on the primary model's grand means, RAG matched or beat ontology on three of four metrics. The categorical advantages of ontology over RAG are structural (relationships, composability, formal verifiability), not lexical. The paper is explicit on this distinction.

RA-3 also reports a structural critique of its own scope. The experiment is single-agent and within-subject. The output-side validation framework is proposed, not implemented. Generalization beyond the five tested industries requires more domains. None of these caveats undermine the headline findings, but they shape what the findings authorize you to conclude.

What this means for your enterprise

The practical implications fall into three categories: architecture, operations, and where the approach is not the right fit.

Architectural implications

If you are evaluating an agentic platform for regulated work, the relevant question is not whether the platform supports retrieval. Retrieval is table stakes. The relevant question is whether the platform has a formal, machine-readable specification of your domain that the agent's reasoning is constrained against.

The specification has three properties worth checking. Is the schema separated from the content — that is, can you add a new industry without writing code? Is the vocabulary of the schema (roles, metrics with ranges, regulations with applicability, handoff patterns with triggers) rich enough to express the constraints you care about? Is the runtime architecture set up to enforce those constraints at multiple coupling points, not only at prompt injection?

An ontology that exists only as documentation does not constrain agent behaviour. An ontology that exists as a runtime artifact — loaded by a context resolver, consulted by a tool discovery layer, surfaced to a quality judge — is a different thing.

Operational implications

Three operational changes follow.

Ontology becomes an owned artifact, not a vendor's secret. The teams that own a domain — your insurance product team, your compliance team, your banking risk team — need to be co-authors on the ontology for their domain. The blueprint cannot stay inside a platform team. A platform that does not give you authorable, versioned, audit-tracked access to the ontology of your own business is treating the domain specification as a black box, and you will pay for that opacity downstream.

Selective injection is more valuable than blanket injection. The TF regression result — well-known concepts getting worse under ontology injection — has a clean operational meaning. A mature deployment estimates how much the model already knows about each concept and injects only what would add value. The token budget freed up by skipping well-known concepts is available for the enterprise-specific knowledge that actually matters.

Vietnamese (and other under-represented language) work has a higher floor and a higher ceiling. If your operations include non-English regulated work — Vietnamese banking, Indonesian insurance, Thai healthcare — the case for ontological grounding is not "nice to have." Parametric coverage is structurally insufficient, and the agent without ontology will fail predictably on the regulated concepts of those jurisdictions. The case becomes part of the operational requirement.

When ontology grounding is not the right tool

Three honest counter-cases.

Single-domain English-language work with strong public coverage. If your entire agentic surface is English and concentrated in a domain that the model already knows well — a generalist marketing assistant, an English-language customer-support agent, a coding agent — the cost of building and maintaining ontology content may exceed the marginal accuracy benefit. RA-3's English-only industries showed real but smaller gains.

Discovery-phase exploration. Early in a use case, before requirements are clear, an LLM-only agent is faster to iterate. Ontology adds value once you know what the constraints are and need to enforce them. Premature formalization is a real cost.

Tasks where lexical recall dominates and structure is incidental. Pure terminology-lookup tasks may be better served by RAG with curated chunks than by structured ontology injection. The format overhead — typed property-value pairs, section headers — eats token budget that flat prose does not.

A platform decision should be made with these counter-cases in mind. The framing is not "ontology good, RAG bad." It is "ontology is a categorically different capability that solves problems retrieval cannot solve, and the gap shows up in specific places."

Implementation guide

At a high level, an enterprise adopts ontology-grounded agentic AI in three phases. The phasing assumes you are starting from a working agent platform — agents in production, retrieval in place, observability set up. Greenfield deployments are a separate playbook.

Phase one — pick a starting domain (first 90 days)

The right starting domain has three properties. It is a regulated workflow where errors are expensive. It has a clear domain owner — a head of compliance, a chief actuary, a head of clinical operations — willing to invest in the ontology authoring work. It is small enough to complete an end-to-end pilot in 90 days.

A representative scope for the 90-day pilot: one industry, one to three roles, the top 20 to 40 domain concepts, the top three to five regulatory frameworks, the most-trafficked handoff patterns. The deliverable is a working ontology-grounded agent for that scope plus a baseline measurement against an ungrounded comparison agent.

Success at 90 days looks like: agent outputs score measurably higher than the ungrounded baseline on a pre-registered evaluation set, the domain owner can navigate and edit the ontology without engineering help, and the evaluation methodology is reproducible.

Phase two — extend to adjacent domains and add process-side coupling (next 90 days)

Two parallel tracks.

Track A — domain extension. Add adjacent verticals using the same schema. Each new vertical should take less time than the previous one, because the schema is reused. Track time-to-add as a leading indicator. If it is not declining, the schema or the tooling is wrong.

Track B — process-side coupling. Wire the ontology into governance: autonomy gates that respect the role's approval authority, escalation paths drawn from the interaction layer, quality judging that consults the domain layer. This is the work that takes you from L1-L2 to L3 in the maturity model.

Phase three — output-side validation and ontology operations (next 6 months)

Two pieces of work that distinguish a research-grade deployment from a production-grade one.

Output validation. Build the validator. Translate domain ontology portions into OWL or an equivalent description-logic representation. Run the validator on agent outputs. Tune the latency-quality trade-off. This is the L4 work that closes the asymmetric coupling problem.

Ontology operations. Decide who owns ontology change management. Decide how new concepts surfaced by agents during operation get reviewed and added. Decide how regulatory changes propagate into ontology updates. The operational discipline is what keeps the ontology from rotting over time.

Twelve months in, a successful deployment is operating with measurably grounded agents in three to five domains, with output validation running on the most regulated of them, and with a published internal cadence for ontology change management. The Vietnamese-localized improvement should be visible in your own data if you are running non-English domains. If it is not, the ontology content for those domains is incomplete, and the gap is the next place to invest.

Frequently asked questions

Is this just retrieval-augmented generation with extra steps? No. RAG retrieves documents to add to context. Ontology coupling supplies a runtime contract — formal relationships, type hierarchies, regulatory applicability, role-specific decision patterns — that the agent's tool discovery, governance checks, and (in future) output validation are all built against. The two are complementary, not equivalent. The RA-3 study includes a curated RAG baseline; on metrics requiring relational reasoning, structured ontology shows a categorically different profile.

Does this lock me into a particular vendor's ontology format? The schema FAOS uses is a triple of Role, Domain, and Interaction layers, instantiated in declarative configuration. We have not standardized on a single industry format because no such standard yet exists for enterprise agent ontologies — the closest industry parallels (TOVE, FIBO, the Agentic Ontology of Work) target different problems. We design the schema to be exportable and the content to be authorable in standard formats. Lock-in concerns should be raised with any vendor, including us, by reading their actual schema and asking who can author and audit it.

What is the maintenance overhead? Material but bounded. The main ongoing work is keeping regulatory frameworks current as they change, adding new concepts as the business surfaces them, and reviewing handoff patterns as workflows evolve. In practice, regulated industries already do this work — compliance teams already track regulatory change. The ontology becomes the artifact that work feeds into, rather than a new burden.

Can I build this in-house instead of adopting a platform? You can. The architecture in RA-3 is described at a level that lets a sufficiently resourced internal team replicate it. The trade-off is the same trade-off as for any platform decision. Building gives you control and lock-in independence. Buying gives you the existing ontology content across 25 industries, the runtime that has been hardened against multi-tenant production load, and the research pipeline that produces the next layer (output validation) as it matures. The right answer depends on your team's depth in formal ontology engineering and on how many domains you need to serve.

How does this interact with our existing agentic investments — Crew, AutoGen, LangChain, our own orchestration? Cleanly, in principle. The ontology layer is conceptually separable from the orchestration layer. RA-3 implements both in FAOS, but the architectural ideas are portable. The realistic answer is that retrofitting ontology coupling into an existing agent framework is real work, and the value of doing it depends on whether the framework's prompt assembly, tool discovery, and quality gates are configurable enough to consume an ontology at runtime.

What to do next

Two paths forward, depending on where you are.

Read the research. RA-3 is available on arXiv at the link below. The paper is the empirical rigour behind this whitepaper. If you are a principal engineer or research-minded architect, the methodology section and the cross-model replication results are worth the read.

Luong, T. T., and Sanyal, A. (2026). Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents. arXiv preprint 2604.00555.

Discuss application to your stack. If you are evaluating ontology-grounded agentic AI for a specific regulated domain — financial services, insurance, healthcare, or a non-English jurisdiction — book a working session with the FAOS team. We are happy to walk through your domain, our existing ontology content for the relevant vertical, and a candidate 90-day pilot scope.

Book a discovery call

Executive summary​

The problem — why enterprise AI agents fail in regulated work​

The FAOS approach — ontology as a runtime contract​

Three layers, one schema​

Where the ontology constrains the agent​

What the research shows​

Study design​

Headline results​

The Vietnamese-localized industries finding​

What the paper does not claim​

What this means for your enterprise​

Architectural implications​

Operational implications​

When ontology grounding is not the right tool​

Implementation guide​

Phase one — pick a starting domain (first 90 days)​

Phase two — extend to adjacent domains and add process-side coupling (next 90 days)​

Phase three — output-side validation and ontology operations (next 6 months)​

Frequently asked questions​

What to do next​