Entropy-Guided Ontology Design — Predicting Grounding Lift Before You Build

Executive summary

Enterprise ontology programs have a budgeting problem. The teams that own a domain — banking, insurance, healthcare, manufacturing — invest months building a structured description of their concepts, regulations, roles, and workflows, in the expectation that the resulting context layer will lift the performance of the AI agents they deploy. The grounding lift is the operational payoff. Today, the lift is discovered after the investment. By the time you can measure whether the ontology was worth building, the budget is already spent.

RA-12 reports a design-time predictor for that payoff. The structural entropy of an ontology — a measurement made on the ontology itself, before any agent experiment is run — predicts the grounding lift the ontology will produce across LLM agents at Spearman r = 0.811 on a 15-cell pilot study across five industries and three model families. The signal is robust to leave-one-industry-out cross-validation (r = 0.786, RMSE 0.060) and is dominated by the interaction layer of the ontology — the part that describes how roles hand off, escalate, and route work — which alone matches the strength of the composite signal.

The implication for a CIO sequencing an ontology investment program: you can rank ontology investments by predicted grounding lift before you spend the implementation budget. A multi-industry roadmap that previously needed every vertical built and measured to be assessed can now be sequenced from a measurement that costs minutes per ontology, not months. The signal is design-time, not operational; it tells you which ontology to build next, not which concepts to retire.

This whitepaper covers what RA-12 measures, what the study shows, what changes when you adopt the discipline, and where the signal is not the right tool.

The problem — why ontology investments often fail to deliver

The narrative around enterprise AI agents has settled on a working architecture: language models do the reasoning, a structured context layer grounds them in the facts of the domain, retrieval bridges the gap when neither covers a question. The architecture is sound. The next question — which CIOs reach within a year of adopting it — is harder. Where should the next ontology investment go?

Three symptoms recur in enterprise ontology programs that have been running long enough to produce evidence.

Symptom one — the marginal-lift surprise. A team builds an ontology for an industry they assumed would benefit, runs the agent pilot, and measures a smaller lift than expected. The investment shipped on time, the artifacts are well-formed, the domain experts validated the content, and the agent improvement is real but underwhelming. The team has no clean way to determine whether the ontology design was wrong, whether the chosen industry was already well-covered by the model's training data, or whether the pilot's evaluation methodology was too narrow.

Symptom two — the wide spread across industries. A program that has shipped ontologies for several verticals finds that lift varies by an order of magnitude across them. Some industries deliver a clean step-change in agent quality; others move marginally. The pattern is real but unexplained, and the program team has no a-priori basis on which to rank the next five candidate verticals by likely return.

Symptom three — the inability to defend the next budget request. The CIO is asked to justify a multi-year ontology roadmap across the regulated parts of the business. The team has anecdotal lift data on three or four verticals from the existing program, no controlled comparison, and no quantitative basis for ranking the remaining industries. The roadmap ends up being defended on consensus and gut feel rather than on a measurable signal.

What is missing from current practice is not better ontology authoring or richer evaluation. It is a quantitative signal, computable on the ontology artifact itself, that predicts whether the agent agent grounding lift will be worth the build. RA-12 introduces such a signal.

The FAOS approach — structural entropy as a design-time signal

The FAOS approach to enterprise ontologies is to organize each industry's domain description into three layers and to measure the information content of each layer before any agent runs against it. The measurement is structural entropy. The signal is the structural entropy of the ontology itself, not of the language model, not of the model's outputs, not of any runtime telemetry. It is a property of the artifact you have already built or are about to build.

What structural entropy measures

Structural entropy is an information-theoretic measure originally developed for graphs and extended to multi-relational graphs of the kind enterprise ontologies are. The intuition is simple. An ontology that organizes its concepts, roles, and relationships into a rich, balanced, well-typed structure carries more information than an ontology that is flat, lopsided, or sparse. Structural entropy quantifies that distinction in bits. The richer the organized structure, the higher the entropy.

This is not the same as ontology size. A large, deep ontology can have low structural entropy if its content is concentrated in a few sub-branches. A smaller, well-balanced ontology can have higher structural entropy. The point of using an information-theoretic measure rather than a count is to capture the organized information content — the part that a downstream agent can actually exploit — separately from the raw artifact size.

The four layers of the FAOS ontology

A FAOS enterprise ontology decomposes into four entropy measurements. Three of them correspond to the layers of the ontology schema; the fourth is a composite.

Role-layer entropy. Captures how roles are partitioned into organizational groups (c-suite, specialists, domain experts) and the heterogeneity of attribute richness across roles. It measures how informative the role structure is on its own.
Domain-layer entropy. Captures the information content of the domain concepts, regulations, and KPIs — both their type diversity and their volume contribution. It measures the richness of the concept layer.
Interaction-layer entropy. Captures the information content of the workflow patterns — handoff types, description detail, and pattern count. It measures the richness of the runbook layer that connects roles over domain concepts.
Composite entropy. A weighted combination, calibrated against measured agent lift. The calibrated composite drops the role-layer contribution at pilot scale (see below) and weights interaction-layer entropy most heavily.

Reporting four entropies per ontology rather than one scalar matters for two reasons. The first is diagnostic — you can see which layer of the ontology is the information-rich one and which is thin. The second is allocation — when authoring resources are constrained, the diagnostic tells you which layer of an ontology is worth deepening for marginal lift.

Why the interaction layer dominates the predictive work

The RA-12 study tested all three layer entropies and the composite as predictors of measured grounding lift. Interaction-layer entropy alone matches the strength of the best composite. Domain-layer entropy adds incremental, additive predictive power. Role-layer entropy in the pilot ontologies — which all declare a single organizational group — collapses to an attribute-heterogeneity signal and shows no measurable correlation with lift.

The interaction-layer finding is the most operationally significant result in the paper. The runbook of an industry — how its roles trigger, hand off, escalate, and approve — turns out to carry the structural information that agents can most reliably exploit. This is consistent with a practitioner intuition: agents in regulated work fail more often on workflow-shaped tasks than on definitional ones, because workflow knowledge is harder to recover from pretraining corpora than concept definitions are. The handoff structure that an ontology architect spends comparatively little time on is the part that most predicts lift.

A production caveat applies. The single-organizational-group structure of the pilot ontologies makes the role-layer null specific to that regime. Production ontologies have richer role-partition structure, and a non-null role-layer contribution may emerge once direct lift measurements on production ontologies are available. The current claim is scoped to the pilot regime.

What design-time means

The signal is computed on the ontology artifact, in advance of running agent experiments against it. It does not require a deployed agent. It does not require traffic. It is a measurement of the build, before the build proves itself. That is the operational property that makes it valuable for investment ranking. A CIO does not need to wait through a pilot quarter to find out whether an ontology investment was worth making.

The same property is the boundary of the signal's claim. RA-12 is not an online method. It is not measured at runtime. It does not change as the ontology is used. Operational discipline for an ontology in production — version control, change management, regulatory updates — is a separate concern, addressed by the implementation guidance below rather than by the structural entropy signal itself.

What the research shows

RA-12 reports a structural-entropy analysis across the full FAOS corpus of 23 industry ontologies, with calibrated lift measurements available for five pilot industries via the predecessor RA-3 study. The pilot industries are FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance. Lift data comes from three model families — Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B — yielding 15 industry-model cells of paired structural-entropy-versus-measured-lift observations.

The headline correlation

Interaction-layer structural entropy alone correlates with overall grounding lift at Spearman r = 0.811, p = 0.0002, on the 15-cell dataset. The composite measure, with weights calibrated against the lift data, achieves Spearman r = 0.753 on the same dataset; domain-layer entropy alone reaches r = 0.786. Among the six structural-entropy variants tested, five survive Holm-Bonferroni correction at the 0.01 level. Role-layer entropy is the lone null, scoped to the single-group pilot regime as noted above.

Cross-validation under industry holdout

The headline correlation is exposed to a fair-test protocol: leave-one-industry-out cross-validation. Each of the five pilot industries is held out in turn, the regression is refit on the remaining four industries' twelve observations, and the held-out three observations are predicted. The pooled cross-validated Spearman correlation is r = 0.786, p = 0.0005. The pooled RMSE is 0.060 units of lift. The calibration slope is 0.81 — the predictor's slope is slightly shrunk toward the mean but the rank ordering is preserved across the held-out set.

A nested cross-validation check, in which the composite weights are re-selected on each fold's training data rather than held fixed at the values fit on the full dataset, yields essentially the same result: r = 0.775, RMSE 0.058, calibration slope 0.85. The layer-priority finding — interaction layer first, domain layer second, role layer null — is robust under nested weight selection.

Replication across three model families

The pattern replicates within each generator model's subset of the data. Claude Sonnet 4 yields r = 0.900 (n = 5, p = 0.037). Gemma 4 26B yields r = 0.800 (n = 5, p = 0.104). Qwen 2.5 72B yields r = 0.700 (n = 5, p = 0.188). All three models rank the pilot industries positively on structural entropy and on measured lift, and the pooled n = 15 result inherits that directional consistency. The signal is not an artifact of one model family.

The Vietnamese-localized industries observation

The cross-validation produces an asymmetric residual that is worth flagging on its own. The largest fold-RMSE belongs to Insurance — the English-language pilot industry where one of the three models (Claude) shows a terminological-fidelity regression on well-known concepts. The Vietnamese-localized industries, where the same models have weak parametric coverage, are predicted accurately. The structural entropy signal works most cleanly precisely where ontological grounding has the most to offer — in domains the model knows least about. This is consistent with the Inverse Parametric Knowledge result from the predecessor RA-3 study.

What the paper does not claim

Three boundaries on the result warrant explicit note.

First, structural entropy is a design-time predictor, not a detector of concepts that should be retired. The signal scores an ontology before it is used. It does not look at an ontology that has been in production for two years and tell you which of its concepts have stopped pulling their weight. Concept retirement, longitudinal pruning, and ontology hygiene over time are operational disciplines distinct from this signal.

Second, the predictive power at pilot scale is comparable to size baselines, not categorically superior to them. On the five pilot ontologies, with their narrow range of interaction-pattern counts, the raw handoff count achieves the same rank correlation as interaction-layer entropy. The entropy formulation is retained for three reasons that pilot data does not exhaust: it provides layer-wise decomposition that size counts do not; it carries a theoretical anchor in Shannon information theory; and the size-versus-entropy distinction is expected to widen as production-scale ontology measurements become available. At pilot scale, the rank-correlation claim is honest about its equivalence to a simpler signal.

Third, the result is a cross-sectional, design-time correlation across 15 cells, not a longitudinal experiment over time and not a claim about every concept in every ontology earning or failing to earn its place. The granularity of the claim is whole-ontology and per-layer, not per-concept.

What this means for your enterprise

Three practical implications follow.

Implication one — investment ranking before implementation

The most direct application is the use case the signal was designed for: predicting which ontology investments will earn the largest grounding lift before the build budget is committed. A CIO with a roadmap of five to ten candidate industry ontologies can measure structural entropy on early drafts of each, before any agent runs, and rank them by predicted lift. The investment then sequences by predicted return.

The rank-ordering accuracy at pilot scale — interaction layer r = 0.811 solo, cross-validated r = 0.786 — is suitable for ranking decisions where the difference between candidates is on the order of 0.1 lift units or larger. The calibration slope of 0.81 means the absolute lift predictions are directional rather than precise. Use the signal to choose between meaningfully different candidates, not to settle small differences.

Implication two — layer-allocation guidance for new ontology work

When an ontology authoring team has constrained capacity, the decomposition tells you which layer to prioritize. The empirical layer-priority ranking is interaction-layer first, domain-layer second, role-layer third — at pilot scale. Practitioners building a new industry ontology should begin with the workflow runbook — handoff patterns, escalation paths, role-to-role triggers, approval chains — before they elaborate the concept taxonomy. The handoff layer carries the structural information that downstream agents most reliably exploit.

This inverts the typical authoring order, in which teams start with concept definitions because they feel like the foundation. RA-12 suggests that, for predictive grounding lift, the concept layer is the second priority and the runbook layer is the first.

Implication three — vertical-by-vertical calibration is required

The signal does not transfer cleanly across verticals with very different parametric coverage. The Vietnamese-localized industries in the pilot dataset show simultaneously higher structural entropy and larger lift, in a way that is partially confounded with the model's underlying training-data coverage of those verticals. Cross-vertical generalization should be approached with calibration per vertical, not by applying a single threshold across the board. A high-coverage English-language vertical may not deliver the same lift at the same structural-entropy score as a low-coverage non-English one.

In practice, this means an ontology program should treat the structural-entropy signal as a within-vertical ranking tool, augmented by an external check on the model's pre-existing coverage of the vertical. The two together — structural entropy of the ontology plus an estimate of the model's parametric coverage of the domain — give a more honest prediction than either signal alone.

The honest counter-cases

Three places where this signal is not the right tool.

Single-vertical deployments with mature ontologies. If your enterprise operates in one domain and your ontology has been live for a year, the marginal benefit of computing structural entropy on the existing artifact is small. The signal is most valuable when ranking new investments. For mature single-vertical work, the dollars are better spent on per-concept value tracking, regulatory currency, and operational hygiene — disciplines the structural-entropy signal does not address.

High-coverage English-language domains. Domains where the language model's training corpus is rich — generalist software engineering, English-language customer support, US-listed financial services — are exactly the regime where the predecessor RA-3 study observed terminological-fidelity regressions: structured ontology injection can occasionally hurt rather than help on well-known concepts. Structural entropy mis-predicts in this regime. The signal works most reliably where the model knows the domain least, and is least reliable where the model already knows it well.

Taxonomy-shaped problems where workflow is incidental. If the work the agent does is primarily lookup and definitional, with little workflow structure, the interaction-layer dominance of the signal may not match the task profile. For pure terminology resolution, a curated retrieval index may serve better than an interaction-rich ontology. The structural-entropy signal still measures something, but the connection to operational lift is weaker.

Implementation guide

An enterprise adopts entropy-guided ontology design in roughly the same way it would adopt any other quantitative design discipline. The work is organizational as much as technical.

Who owns the discipline

The natural owner is the Data and AI organization, or the Ontology team if one exists. Structural entropy of an ontology is computed from the ontology artifact and is not domain-specific in the sense that a banking SME or a healthcare SME would compute it. The domain SMEs author the ontology content; the Data and AI team computes the signal, interprets the layer decomposition, and surfaces it into the ontology-review workflow. The CIO sits on top of the discipline at the level of ranking the investment portfolio.

Tooling at a high level

The measurement is a scalar computation on a graph representation of the ontology. Loading the ontology in a structured form (OWL, RDF, or a typed YAML representation), constructing the layer-typed subgraphs, and reporting the four entropies is straightforward engineering work. The compute cost is trivial. The discipline cost is in instrumenting the ontology review workflow so the signal is consulted at the right gates.

Milestones for a 90/180/365-day adoption

Day 0 to 90 — instrument the existing portfolio. Compute structural entropy across every ontology already in the catalog. Land the layer decomposition in the ontology review dashboard the team already uses, or a new lightweight one. Rank the catalog by interaction-layer entropy and by the composite. This is the first defensible cross-vertical comparison the program will have.

Day 90 to 180 — apply to new build decisions. Use the signal as one input — alongside business value, regulatory pressure, and domain-SME capacity — in the next two or three ontology build decisions. Document the predicted-versus-observed lift gap as those builds ship and measure their agent performance.

Day 180 to 365 — calibrate against per-vertical lift. As the program accumulates direct lift measurements on production ontologies, refit the regression weights against the observed data. The pilot-fit weights are the right starting point, but production-scale measurements will yield a better-calibrated formula. Use the recalibrated formula in the second-year roadmap.

The end-state at twelve months is an ontology investment program that ranks candidate verticals on a measurable signal, allocates authoring capacity by layer, and refines its calibration as production lift data accrues. The board-defensible artifact is the ranked roadmap. The operational artifact is the layer-decomposition dashboard.

Frequently asked questions

Is this just for new ontologies or for existing ones too? Both, with different uses. For new ontologies, the signal is a design-time ranking tool — score draft ontologies, compare them, prioritize the investments with the strongest predicted lift. For existing ontologies, the signal is a portfolio-comparison tool — score what you already have, see which assets carry the most organized information content, and use that to inform refresh investment. The signal is not a maintenance metric for individual concepts within an existing ontology.

What if my industry is not in the studied 23? Two paths. If you are building an ontology for a vertical we have not yet measured, the structural-entropy computation works the same way — the signal is property-of-the-artifact, not property-of-the-vertical. Run the measurement on your draft ontology and use the layer decomposition for allocation guidance. The absolute lift prediction will be less reliable until per-vertical calibration is available, so treat the within-program ranking as the actionable output. If your industry has very different model-coverage characteristics than the pilot industries, expect the predicted lift to over- or under-shoot, and weight the layer-decomposition diagnostic over the headline number.

How often do I re-measure? Re-measure when the ontology itself changes substantially — a major restructuring, the addition of a new role partition, a meaningful expansion of the handoff layer. The signal is not a runtime metric, so day-to-day re-measurement is unnecessary. A quarterly cadence on ontologies under active authoring is reasonable; an annual snapshot on stable ontologies is enough.

Does this interact with retrieval-augmented generation? Cleanly, in principle. Structural entropy measures a property of the ontology; retrieval indexes a property of a document corpus. The two are addressing different parts of the grounding problem — ontology supplies the formal contract, retrieval supplies the unstructured supporting evidence. RA-12 does not measure or predict retrieval quality. A complete enterprise agent stack typically uses both, and the structural-entropy signal informs the ontology side of that stack without making claims about the retrieval side.

Do I really need a full ontology — what about a flat taxonomy? A flat taxonomy will score low on interaction-layer entropy by construction, because there is no interaction layer to measure. The predictive power that the RA-12 study identifies is concentrated in the workflow runbook — handoffs, escalations, approval chains. A taxonomy that captures only concept hierarchy will not carry that signal. Whether you need the full three-layer ontology depends on whether your agents do workflow-shaped work. For agents that primarily classify, look up, and retrieve, a taxonomy may be sufficient. For agents that route, hand off, and escalate inside regulated workflows, the runbook layer is where the lift is.

How does this relate to the FAOS papers on agent grounding and verification? This whitepaper is the third in a trilogy. The companion whitepaper on neurosymbolic enterprise AI covers the underlying grounding architecture — how the three-layer ontology is constrained against the language model at runtime. The companion whitepaper on pre-deployment verification covers the verification step before an agent goes live. The present whitepaper covers the design-time discipline that decides which ontology to build in the first place. The three papers describe three layers of the same problem: the agents you deploy, the context they reason against, and the verification that keeps the autonomy honest. Read in any order; this whitepaper sits earliest in the build sequence.

What to do next

Two paths forward.

Read the companion whitepapers. The grounding architecture is covered in the FAOS whitepaper on neurosymbolic enterprise AI, which describes how the three-layer ontology connects to the LLM at runtime, and the pre-deployment verification whitepaper, which describes how agents are certified before they go live. Together with the present design-time discipline, the three papers describe the FAOS enterprise-agent stack end to end.

Discuss application to your ontology portfolio. If you are running or planning a multi-industry ontology program and want a working session on applying the structural-entropy signal to your portfolio — instrumenting the measurement, interpreting layer decomposition, sequencing the build roadmap — book a discovery call with the FAOS team. We are happy to walk through your verticals, our existing measurements on the FAOS corpus of 23 industries, and a candidate first-90-days plan.

Book a discovery call

Executive summary​

The problem — why ontology investments often fail to deliver​

The FAOS approach — structural entropy as a design-time signal​

What structural entropy measures​

The four layers of the FAOS ontology​

Why the interaction layer dominates the predictive work​

What design-time means​

What the research shows​

The headline correlation​

Cross-validation under industry holdout​

Replication across three model families​

The Vietnamese-localized industries observation​

What the paper does not claim​

What this means for your enterprise​

Implication one — investment ranking before implementation​

Implication two — layer-allocation guidance for new ontology work​

Implication three — vertical-by-vertical calibration is required​

The honest counter-cases​

Implementation guide​

Who owns the discipline​

Tooling at a high level​

Milestones for a 90/180/365-day adoption​

Frequently asked questions​

What to do next​