Key Technical Challenges: Problems That Almost Broke Us
I'm going to tell you about the problems that almost killed our project. Not the polished "challenges we overcame" version—the real ones. The bugs that took weeks to find. The architectural decisions we reversed three times. The features we built, shipped, and then ripped out.
If you're building agentic systems, you'll hit these walls too. Maybe this saves you some scars.
The AI agent space is so new that there's no playbook. Every team is learning by doing. We've made progress on these challenges, but I won't pretend we've solved them. Some we're managing. Others we're still fighting. This is the honest version.
Challenge #1: Context Window Management at Scale​
The Memory Problem Nobody Warns You About​
Here's what the AI hype doesn't mention: LLMs have finite context windows, and agent interactions generate massive amounts of context.
A typical agent session might include:
- System prompt (500-2,000 tokens)
- Agent persona (1,000-3,000 tokens)
- User request and conversation history (variable, can grow large)
- Retrieved context (documents, code, data)
- Previous step outputs in a workflow
Add it up, and you're easily at 20,000-50,000 tokens per interaction. In a multi-step workflow, context accumulates. By step 10, you might be pushing model limits.
What we tried and failed:
Naive truncation — Just cut the oldest content when approaching limits. This seemed reasonable until we realized that critical context often appears early in workflows. Truncating the beginning lost essential information that later steps depended on.
Aggressive summarization — Summarize everything aggressively to save space. The problem: summaries lose nuance. An agent making decisions based on a summary of a summary makes worse decisions than one with the original context. Quality degraded noticeably.
RAG for everything — Put all context in a vector store, retrieve only what's relevant. This works for some cases, but adds latency. More importantly, relevance isn't always obvious. Sometimes the "irrelevant" detail in paragraph 47 matters for the decision in step 12.
Bigger models — Just use models with larger context windows. Cost explosion. Our inference costs tripled. For enterprise workloads, this wasn't sustainable.
What we learned:
Context is not fungible. Not all tokens are equally important. A 100-token summary of a requirement might be less valuable than the original 500-token version, or it might be more valuable than 5,000 tokens of verbose discussion. The value depends on the task.
Summarization is an art, not a mechanical process. What to preserve, what to compress, what to discard—these are judgment calls that depend on what comes next. Summarizing without knowing future use cases loses important information.
Different workflow stages need different context strategies. Early stages benefit from comprehensive context. Later stages benefit from focused, relevant context. One strategy doesn't fit all.
Our current approach:
Hierarchical context management — We maintain context at multiple levels of detail. Full context is preserved but compressed. Summaries are available for space-constrained situations. The system chooses the appropriate level based on available space and task requirements.
Intelligent context compression — We use models to summarize context, but with guidance about what's important. "Summarize this, preserving all technical specifications" produces better results than "summarize this."
Working memory vs. reference memory — Active context (what the agent is working on now) is kept in full. Background context (previous steps, reference documents) is compressed or retrieved on demand.
Cost/quality trade-off controls — We expose knobs to balance context richness against cost. For high-stakes decisions, use more context. For routine tasks, optimize for efficiency.
What's still hard:
Detecting when context loss causes errors is nearly impossible in real-time. An agent might make a confident decision based on incomplete context, and we don't know the context was incomplete until we see the bad outcome.
Finding optimal summarization strategies for different content types remains more art than science.
Cross-agent context sharing is complex. When Agent A passes context to Agent B, how much should transfer? Too little loses information. Too much exceeds Agent B's capacity.
Challenge #2: Agent Hallucination and Decision Validation​
When Your Agent Confidently Does the Wrong Thing​
Hallucination isn't a bug—it's an inherent property of language models. They generate plausible-sounding content, and sometimes that content is wrong. For chatbots answering questions, this is annoying. For agents taking actions, it's dangerous.
An agent might:
- Confidently cite a policy that doesn't exist
- Reference a file at a path that was never mentioned
- Make up statistics to support a recommendation
- Claim to have completed an action it didn't actually take
The worst part: confidence doesn't correlate with accuracy. Hallucinated content sounds just as confident as accurate content. Sometimes more confident.
What we tried and failed:
"Just prompt better" — We wrote elaborate prompts instructing agents not to hallucinate. "Only state facts you're certain about." "If you're unsure, say so." This helped marginally but didn't solve the problem. Models hallucinate even when told not to.
Self-verification — Have the agent check its own work. "Review your response and correct any errors." The problem: agents verify their own hallucinations. If they were confident enough to say it, they're confident enough to confirm it.
Confidence thresholds — Ask the model for a confidence score, only proceed if confidence is high. Model confidence is uncalibrated. High confidence doesn't mean high accuracy. We saw agents report 95% confidence on completely fabricated information.
What we learned:
Hallucination is inherent to how language models work. You can reduce it, but you can't eliminate it. Accept this and design accordingly.
Structural constraints beat prompt engineering. Telling an agent "don't make things up" is less effective than making it impossible to make things up. Constrain outputs to valid options rather than hoping the model will self-constrain.
Multiple validation sources are needed. One agent checking itself doesn't work. Multiple agents cross-checking can help. External validation (actually checking if that file exists) helps more.
High-stakes decisions need human oversight. Some decisions are too important to trust to probabilistic systems alone.
Our current approach:
Structured output enforcement — We use JSON schemas to constrain agent outputs. Instead of "describe the architecture," we ask "fill in this architecture template." The model can't hallucinate fields that don't exist in the schema.
Multi-agent cross-validation — For critical decisions, multiple agents independently analyze the same situation. Disagreements trigger review. This doesn't catch hallucinations all agents share, but it catches individual agent errors.
Confidence calibration through feedback — We track outcomes over time. When an agent expresses high confidence but outcomes are poor, we adjust our interpretation of that agent's confidence scores.
Action authorization gates — Before agents take actions (especially irreversible ones), we verify the action makes sense. "You're about to delete this file. Is this file actually mentioned in the user's request?" This catches cases where agents hallucinate tasks.
Explicit "I don't know" training — We include examples of appropriate uncertainty in agent personas. We reward agents (through feedback) for acknowledging uncertainty rather than confabulating.
What's still hard:
Subtle hallucinations in plausible-looking output are the hardest to catch. Gross hallucinations (obviously wrong facts) are easy to spot. Subtle ones (slightly misremembered details, plausible but incorrect inferences) slip through.
Hallucinations about internal state are particularly pernicious. An agent might hallucinate what it previously decided, leading to inconsistent behavior across steps.
Cascading errors from early hallucinations compound. A small error in step 2 might seem fine, but by step 8 the workflow has built an elaborate structure on a flawed foundation.
Challenge #3: State Management Across Long-Running Workflows​
Remembering What Happened Yesterday​
Workflows can run for hours or days. A PRD creation workflow might span multiple sessions across a week as stakeholders provide input. Throughout this time, the system needs to remember:
- What's been done
- What's pending
- What each step produced
- What decisions were made and why
- Where we are in the workflow
But LLMs are stateless. Each API call starts fresh. The model doesn't remember previous calls unless you tell it. State management is entirely our problem.
What we tried and failed:
In-memory state — Keep state in application memory. Simple. Lost on restart. Any system crash loses all workflow progress. Unacceptable for enterprise.
Simple key-value stores — Persist state to Redis or similar. Works for flat state. But workflow state has relationships. Step 5's output references Step 3's decision. Key-value stores don't handle relationships well without careful schema design.
Full event sourcing — Record every event, reconstruct state by replaying events. Theoretically elegant. Practically complex. Event replay got slow as workflows grew. Event schema evolution was nightmarish. Debugging by reading event logs was painful.
What we learned:
State granularity matters enormously. Too coarse (save state once per workflow) and you lose progress on failures. Too fine (save state on every token) and you're drowning in storage and I/O.
Checkpointing frequency is a key trade-off. More checkpoints mean better recovery but higher overhead. Fewer checkpoints mean better performance but more lost progress on failures.
State schema evolution is a real problem. Workflows change over time. New fields get added. Old fields become irrelevant. Migrating in-flight workflows across schema changes is harder than it sounds.
Debugging state issues is incredibly hard. When a workflow behaves unexpectedly, figuring out whether it's a state corruption, a state loading error, or a logic error requires painstaking reconstruction.
Our current approach:
Structured workflow state with checkpoints — We define explicit checkpoint moments in workflows (typically after each step). State is persisted at checkpoints in a structured format.
Event log for reconstruction — We maintain an event log alongside structured state. The structured state is primary (fast to load). The event log enables debugging and auditing (what led to this state?).
State snapshots at decision points — Major decisions capture a snapshot of relevant state. If something goes wrong later, we can examine the state at the decision point.
Clear state ownership per agent — Each agent owns its portion of state. Agents don't modify each other's state directly. This prevents race conditions and makes debugging easier.
What's still hard:
Optimal checkpoint frequency varies by workflow. Short, fast workflows need fewer checkpoints. Long, complex workflows need more. Getting this right for diverse workflows is ongoing.
State migration across versions remains painful. When we update a workflow definition, in-flight workflows may have state that doesn't match the new definition.
Debugging state-related issues still requires significant investigation. Tools help, but state bugs remain some of our hardest to diagnose.
State consistency in parallel execution adds complexity. When multiple agents work simultaneously, ensuring consistent state views requires careful coordination.
Challenge #4: Balancing Autonomy with Guardrails​
The Control Paradox​
Here's the central tension in agentic systems: we want agents to be autonomous (that's the point), but we also want to control them (that's enterprise reality).
Too much autonomy produces unpredictable, potentially risky behavior. Users can't trust agents that might do anything.
Too little autonomy produces sophisticated chatbots. Users ask questions, get answers, but nothing actually happens without manual intervention.
The right balance depends on context. Low-stakes tasks can tolerate more autonomy. High-stakes tasks need more control. Different users have different risk tolerances. Different organizations have different compliance requirements.
What we tried and failed:
Binary on/off autonomy — Agents are either fully autonomous or fully supervised. Too coarse. Real needs require nuance. Users wanted autonomy for routine tasks and supervision for exceptions.
Per-action approval — Every action requires human approval. Completely destroyed productivity. If users have to approve every step, they might as well do the work themselves.
Trust scoring — Assign trust scores to agents and users. High trust means more autonomy. But the scoring was opaque. Users didn't understand why certain actions required approval. Agents didn't understand their boundaries.
What we learned:
Autonomy is a spectrum, not a switch. The question isn't "autonomous or not" but "how autonomous for this specific action in this specific context."
Context determines appropriate autonomy level. The same action (sending an email) might be fine for internal updates but require approval for customer communications.
Predictable boundaries beat flexible intelligence. Users prefer knowing exactly what agents will and won't do autonomously, even if the rules are simple, over sophisticated but opaque decision-making.
Users need to understand the autonomy model. If users can't predict when agents will act autonomously versus ask for permission, they can't trust the system.
Our current approach:
Action classification — We classify actions as reversible or irreversible. Reversible actions (creating a draft, suggesting changes) have more autonomy. Irreversible actions (sending emails, deleting files, deploying code) have more oversight.
Configurable autonomy tiers per workflow — Workflows define autonomy levels for each step. A code review workflow might give full autonomy for analysis but require approval for actually merging.
Explicit boundary documentation — Each agent's persona includes clear statements of what it will and won't do autonomously. Users can read these boundaries.
Escalation triggers for boundary cases — When actions fall near boundaries (probably fine, but maybe not), agents escalate to humans. This catches edge cases while keeping normal operation smooth.
Human-in-the-loop for high-stakes actions — Certain action categories always require human approval regardless of configuration. This prevents accidental over-delegation.
What's still hard:
Predicting which actions are truly reversible is complicated. Deleting a file seems irreversible, but maybe there are backups. Sending an email is irreversible, but maybe the recipient hasn't read it yet. The lines blur.
Calibrating autonomy to user trust levels is imprecise. New users should probably have more oversight. Experienced users might deserve more autonomy. Measuring trust is subjective.
Explaining autonomy decisions to users in real-time without being annoying is a UX challenge. Too much explanation feels like nagging. Too little feels opaque.
Challenge #5: Debugging Multi-Agent Interactions​
When Something Goes Wrong, What Went Wrong?​
Traditional debugging: set a breakpoint, step through code, inspect variables, find the bug.
Multi-agent debugging: which of 5 agents made the bad decision? During which of 15 steps? Based on which combination of context? And if you rerun it, you might get different results because agents are non-deterministic.
What we tried and failed:
Traditional logging — Log everything. We ended up with gigabytes of logs per workflow. Finding the relevant information was searching for a needle in a haystack. Log noise made debugging harder, not easier.
Step-through debugging — Traditional debugging techniques assume deterministic systems. Set a breakpoint, reproduce the bug, step through. But agents aren't deterministic. Rerunning doesn't reproduce the same bug.
Replay mechanisms — Record inputs, replay them to reproduce the bug. But even with the same inputs, models produce different outputs. Replay doesn't reproduce the exact behavior.
What we learned:
Observability must be built in from day one. Retrofitting observability onto a system not designed for it is painful. Every component needs to emit useful telemetry from the start.
Structure logs for machine consumption. Human-readable logs don't scale. Structured logs can be queried, aggregated, and analyzed automatically.
Correlation IDs are essential. Every workflow needs a unique ID. Every action within that workflow needs to reference the workflow ID. Without correlation, piecing together what happened is nearly impossible.
The bug is often in the handoff, not the agent. When multi-agent workflows fail, the problem is frequently not in any individual agent's logic but in how context was passed between agents. Agent A's output was good. Agent B's logic was good. But A passed something B misinterpreted.
Our current approach:
Structured semantic logging — Logs include structured fields: workflow ID, step ID, agent ID, action type, outcome. Logs can be queried by any field. "Show me all actions by the architect agent in workflow X that resulted in errors."
Distributed tracing across agents — We implement distributed tracing (similar to microservices tracing) across agent interactions. We can visualize how a request flowed through multiple agents, how long each step took, and where things went wrong.
Decision tree reconstruction — For complex workflows, we can reconstruct the decision tree: what decisions were made, what alternatives were considered, why each choice was made. This helps diagnose not just "what went wrong" but "why did the agent think this was right."
Snapshot-based debugging — We capture snapshots of state and context at key points. When debugging, we can examine exactly what the agent saw when it made a decision.
"Time travel" for workflow state — We can roll back to any checkpoint and examine the workflow state at that moment. This helps pinpoint when things went wrong.
What's still hard:
Reproducing probabilistic failures is inherently difficult. If a bug occurs 5% of the time due to model non-determinism, you might need to run 20+ times to see it again.
Finding root cause vs. symptom is challenging. The step that produces visibly wrong output might not be the step that caused the error. The root cause might be 5 steps earlier.
Debugging in production without impacting users requires careful instrumentation. You can't just attach a debugger. You need to work with logs and traces from production execution.
What We Haven't Solved Yet​
Honest Assessment of Open Problems​
Here's what we're actively working on without satisfactory solutions:
Better context compression without information loss — We've improved, but we're not close to optimal. Too much compression loses information. Too little exceeds limits.
Reliable confidence calibration — Agent confidence scores remain poorly calibrated. High confidence doesn't reliably indicate high accuracy.
Cross-session agent memory — Agents remember within a workflow but forget across workflows. A user's preferences from last week? Gone.
Real-time collaboration protocols — Multi-agent collaboration works, but it's computationally intensive and not always smooth.
Here's what we don't have good answers for:
True agent creativity vs. recombination — Our agents combine and recombine patterns they've seen. Is that creativity? Can we get genuine novel insights? Unclear.
Long-term learning and improvement — Agents don't learn from experience in any persistent way. Each workflow starts from the same baseline.
Multi-tenant personalization at scale — How do we personalize agent behavior to individual users without creating maintenance nightmares?
Why we're sharing this:
Transparency builds trust. If we claimed to have solved everything, you shouldn't believe us.
The community might have answers. Maybe someone reading this has tackled these problems differently. We'd love to learn.
Setting realistic expectations matters. Agentic AI is powerful, but it's not magic. Understanding limitations helps you use it effectively.
The Challenges Are the Opportunity​
These challenges exist because the space is genuinely new. We're not optimizing mature technology—we're figuring out what works.
Solving these challenges is what creates competitive advantage. Anyone can spin up an LLM API. The differentiation is in tackling these hard problems.
We're building in public, which means sharing the hard parts. Polished success stories are less useful than honest accounts of what's actually difficult.
If you're working on similar problems, we'd love to hear your approaches. The field advances faster when we share what we're learning.
In our next post, we shift from technical challenges to business concerns—how enterprises think about risk when AI agents make decisions, and how we build trust over time.
Join the discussion: Share your experiences with these challenges in our GitHub Discussions. What have you tried? What's worked? What hasn't?
Follow along: Subscribe to our newsletter to get Post 8: Risk Management—how we build trust when agents make decisions.
Next in the series: Post 8: Risk Management — Building Trust in Autonomous Systems
This is Post 7 of 10 in the series "Building the Agentic Enterprise: The FAOSX Journey."
Ready to see agentic AI in action? Request a Workshop and let's build the future together.
