How AI Can Get It Right
Artificial IntelligenceSoftware ArchitectureDeveloper Experience

How AI Can Get It Right

17 April 20269 min readWritten by Eban Escott

Better prompts help, but they are not the whole answer. This article explores how teams improve AI reliability with retrieval, context management, structured workflows, and stronger sources of truth.

AI does not get it right because someone wrote one magnificent prompt. It gets it right more often because the system around the model is doing more of the work.

That is the useful shift.

If you want the failure modes behind that claim, start with Why AI Gets It Wrong.

Because people are not just sitting around admiring the problem. They are building ways around it.

It is a less magical story. It is also a far more practical one.

So… how can AI get it right?

Not by becoming magical. By becoming better supported.

Large language models are still probabilistic systems. That part does not change. What changes is the quality of the context, tools, constraints, and workflows wrapped around them. OpenAI’s guide on conversation state, along with its compaction guidance and Anthropic’s recent work on context engineering, all point in the same direction: serious AI systems need active context management, not blind faith.

That is the shift.

The industry is slowly moving away from “just prompt it harder” and towards something much healthier: engineer the conditions in which the model works.

Give it better context, not just more context

One of the most common responses to unreliable AI is brute-force prompt stuffing. More notes. More history. More documentation. More everything.

That works right up until it does not.

Long threads become messy. Old decisions sit beside new ones. Tool results linger long after they are useful. Summaries flatten nuance. Eventually the model is no longer working from a clean brief. It is working from a recycling bin.

That is why OpenAI’s compaction guide and Anthropic’s compaction guide matter. They are both practical responses to the same reality: if a conversation keeps growing, someone has to curate it.

And that gives us a much better principle than “more context is better”.

More context is only better when it is still relevant.

AI does not become wiser because you hand it a warehouse full of meeting notes.

Ground it in facts it can actually read

The second big move is retrieval.

If the model needs accurate information about your documents, your policies, your codebase, or the live web, do not ask it to perform a séance with its training data. Give it a way to fetch the material.

That is why tools like Google’s Grounding with Google Search are explicitly aimed at reducing hallucinations by connecting the model to current sources, and why platforms like Mistral offer things like a Document Library for grounding answers in uploaded material.

This is the practical value of retrieval-augmented generation, or RAG if we want to sound like we have attended the right meetings.

The idea is not complicated. Instead of hoping the model remembers the right thing, retrieve the right thing and put it in front of the model at the moment it matters.

That is the difference between answering from vibes and answering from evidence.

Use tools for what tools are better at

Another quiet improvement is that people are getting more disciplined about what the model should do itself, and what it should hand off.

Need a real-time fact, a database query, a calculation, a system action, or a workflow step? Good. That should probably be a tool, not a paragraph.

Both Anthropic’s tool use overview and Google’s function calling guide frame the model less like an all-knowing oracle and more like a coordinator that knows when to ask the rest of the system for help.

That is a much healthier pattern.

A calculator is still better at arithmetic. A search system is still better at retrieval. A workflow engine is still better at deterministic execution. The model’s job is to reason over the task, not cosplay as every subsystem in the building.

Turn repeated work into reusable skills

A lot of prompting pain is really repetition pain.

Same instructions. Same process. Same approved stack. Same formatting rules. Same “please check these files before you touch anything important”. Eventually the prompt starts to look like a compliance document stapled to a cry for help.

Reusable skills are a cleaner answer.

OpenAI’s skills guide describes skills as versioned bundles of instructions, files, and scripts that can be attached when needed rather than pasted into every single request. That matters because repeatable work should be encoded as repeatable workflow, not re-explained from scratch every Tuesday.

This is one of the broader themes in how AI systems are maturing.

The more something becomes routine, the less it should live in improvised prose.

Make the output easier to check

Another practical shift is to stop treating nice language as the ideal output for everything.

Sometimes you want prose. Sometimes you want a contract.

OpenAI’s Structured Outputs and Anthropic’s structured outputs both point to the same idea: when downstream systems need predictable output, constrain the model to a known structure.

That does two useful things.

First, it makes the response easier to validate. Second, it makes nonsense easier to catch.

Free-form output is persuasive. Structured output is inspectable. Those are not the same thing, and it is usually better to know which one you are asking for.

There is a related lesson here as well: sometimes the smartest thing a model can say is “I don’t know”. Anthropic’s guide on reducing hallucinations is good on this point. Reliable systems do not just encourage answers. They also make room for uncertainty, citation, and abstention when the evidence is thin.

Give it something more durable than chat history

This is the part that matters most in the long run.

A lot of today’s mitigation techniques are really ways of compensating for one awkward fact: chat history is a flimsy source of truth.

Useful, yes. Durable, not really.

Conversation compaction helps. Retrieval helps. Tool use helps. Skills help. Structured outputs help. But if the model is working on a living system, software, processes, architecture, operations, then it still needs some stable representation of that system outside the conversation itself.

This is where models, in the model-driven sense, can play a much larger role.

A structured model gives AI something more durable than remembered chat. It gives it a maintained representation of entities, relationships, rules, workflows, and constraints. That becomes a stable reference point for generation, validation, automation, and handoff. Instead of reconstructing the world from a pile of prompts and summaries, the AI can work from a governed source of truth.

In other words:

Chat history is context.
A model is context with a backbone.

That is a much stronger place to build from.

So what does “getting it right” actually look like?

Usually, it looks less glamorous than the demos.

It looks like:

  • compacting long conversations before they drift
  • retrieving only the documents that matter
  • grounding claims in live or approved sources
  • calling tools instead of guessing
  • turning repeated instructions into reusable skills
  • constraining outputs so they can be checked
  • anchoring work to structured artefacts that survive the session

None of this is especially mystical. That is the good news.

Reliable AI is not emerging because the model suddenly became perfect. It is emerging because teams are getting better at designing the environment the model operates in.

The model still matters, but the system matters more

Models will keep improving. Context windows will grow. Tool use will get smoother. Retrieval will get tighter. The rough edges will soften.

But the basic lesson is unlikely to change.

If you want AI to get it right more often, do not just ask it nicely. Give it better conditions.

Give it the right context. Give it access to the right sources. Give it tools. Give it structure. Give it workflows that are deterministic where they need to be deterministic. And where the work actually matters, give it a source of truth that is bigger than a chat transcript.

That is how AI can get it right.

Not perfectly. Not magically. But far more often than it does when left alone with a giant prompt and your optimism.

References

Anthropic

  • Context windows: Anthropic describes the context window as a model’s working memory and notes that accuracy and recall can degrade as token counts grow, a phenomenon it calls context rot.
  • Compaction: Anthropic recommends server-side compaction for long-running conversations and agentic workflows, replacing stale history with concise summaries.
  • Reduce hallucinations: Anthropic provides guidance on reducing hallucinations through grounded quoting, clearer uncertainty handling, and auditable outputs with citations.

DashScope / Alibaba

  • Long context (Qwen-Long): Alibaba Cloud shows file-based long-document handling through file IDs so large documents do not need to be present with every request.
  • Function Calling: Alibaba Cloud explains function calling as a way to handle tasks such as real-time information retrieval and calculations.
  • Optimize RAG performance: Alibaba Cloud walks through common RAG pipeline issues across parsing, chunking, retrieval, recall, and answer generation.

Gemini / Google

  • Long context: Google explains context as a form of short-term memory and discusses older workarounds such as dropping messages, summarising, and using RAG.
  • Context caching: Google documents context caching as a way to reuse repeated context across requests.
  • Grounding with Google Search: Google presents search grounding as a way to reduce hallucinations, access real-time information, and return cited responses.

Mistral

  • Document Library: Mistral documents its built-in document library as a retrieval tool for uploaded files, with reference chunks available for citations.
  • Function Calling: Mistral explains how models select tools, generate arguments, and use tool outputs in follow-up responses.
  • Agents API: Mistral introduces its Agents API with web search and MCP-based tooling to provide up-to-date, evidence-supported context.

Ollama

  • Context length: Ollama provides a plain-English explanation of context limits, including why agents, web search, and coding tools often need larger context settings.
  • Embeddings: Ollama documents embeddings for semantic search, retrieval, and RAG workflows.
  • Tool calling: Ollama explains how local models can invoke functions and run multi-turn agent loops.

OpenAI

  • Conversation state: OpenAI explains that text generation requests are independent and stateless unless prior messages or conversation features are explicitly included.
  • Compaction: OpenAI documents server-side compaction as a way to summarise and prune long conversations once they reach a threshold.
  • Skills: OpenAI describes skills as versioned bundles of instructions, files, and scripts that package reusable procedures outside the main prompt.