Why You Can (and Must) be Strict with AI Agents

AI agents can solve more complex problems, but that makes them more susceptible to errors. The more work we entrust to AI agents, the more important human oversight becomes.

AI agents are everywhere these days. An AI-related announcement is no longer relevant if the word ‘agentic’ isn’t mentioned somewhere. This new next big step for artificial intelligence promises great things, but what exactly is an AI agent and how do you put them to work in your company? Bert Vanhalst, research consultant for Smals Research, discusses both the possibilities and risks that AI agents introduce.

Think first, then do

AI agents are built on the same foundation as generative chatbots like ChatGPT and Copilot, namely LLMs or large language models. The difference lies in how they tackle problems. Vanhalst: “We all know ChatGPT that can generate and summarize texts, for example, but that’s a one-shot operation.”

“With AI agents, we can work in iterations to break down big problems into sub-steps and check the intermediate output. First, the next step is reasoned about and only then executed, in a continuous loop,” Vanhalst explains. “With a classic chatbot, you perform these iterations yourself by adjusting the output.

That’s why one AI agent is not the same as another. “Today, something is quickly labeled an AI agent because it sells well,” Vanhalst notes. “But it’s a whole spectrum of simple and complex systems. A system doesn’t always need to be complex. Some problems can be solved with a simple language model, or even without AI at all. I always recommend looking at the simplest solution first.”

Some problems can be solved with a simple model, or even without AI at all.
Bert Vanhalst, research consultant Smals Research

The Right Tool

AI agents have the ability to think for themselves, provided you give them clear instructions. Vanhalst: “As a user, you define a framework. This includes criteria for what the goal is and when it’s achieved, for example, the content, length, or style of a text. AI agents can be dynamic regarding the end goal.”

The builder determines which resources an AI agent may use to achieve the goal. “Which tools should be used and in what order, that decision is left to the model. Models are capable of reasoning about this autonomously,” Vanhalst adds.

In technical jargon, this process is called tool calling. He explains the concept: “Ultimately, the model produces a structured output with the name of the tool and input parameters. Models are also able to extract these from unstructured input. The output is not actually the execution of the tool, but determining which tool is called for what purpose.”

“This is how you eventually reach a result. This can be the final output, or a partial step. With a partial step, this returns to the language model, which again reasons whether more tools need to be called to reach the final result,” says Vanhalst.

Don’t Trust Blindly

AI models are susceptible to hallucinations, and this is no different with agents, Vanhalst knows. “Reasoning often has to happen on incomplete data or an uncertain context. The chance of errors is real, precisely because AI agents are deployed for complex problems. It’s therefore important to monitor the quality of the output.”

“Figuring out where things go wrong and fixing it is an intensive process,” says Vanhalst. Large language models have a non-deterministic character: one input can produce different outputs. There are many things that can go wrong. Hence the need to thoroughly evaluate systems and maintain human oversight once the system goes into production.”

According to Vanhalst, every user bears responsibility in this. “Blindly starting to trust AI systems is certainly a risk, even if you notice that the model performs well. Guidelines are needed to continue that human validation. One day we may be confident enough for less critical matters, but that won’t be tomorrow.”

Vanhalst prefers to stay far away from the discussion about ‘acceptable’ error rates for AI agents. “In certain situations, it’s less serious if a mistake is made, for example if it’s just a suggestion. But if the decision of an AI system impacts people, there are consequences. When is it ‘good enough’? That’s something we need to learn to deal with.”

Blindly trusting AI is a risk.
Bert Vanhalst, research consultant Smals Research

AI as a Junior

People are allowed to make mistakes; some might even argue that it’s occasionally necessary. So why do we expect perfection from artificial intelligence? Vanhalst looks for an explanation. “We’re used to computers giving the right output. We program them that way, so it must be correct, even though errors still creep in.”

Are AI Agents Your Colleagues? Not Exactly—But Sort Of, Says Workday

He describes the current generation of AI agents as juniors. “In the beginning, you’ll also monitor and guide new employees more closely. When we see that they deliver reliable work over time, we gradually let them go. I think we’ll see the same happen with AI agents. First, we need to see if they work ‘well enough’—and I deliberately don’t say ‘perfectly’—before we trust them.”

Will we then see AI agents evolve into seniors, as OpenAI CEO Sam Altman predicts? Vanhalst is more cautious: “Every vendor is jumping on the bandwagon today and a lot is being promised. Suppliers are improving their models, but you often still have to build in a feedback mechanism yourself.”

The Work is just Beginning

Vanhalst advises learning to walk carefully with AI before running. “You can usually set something up quickly with the technology, but the challenge lies in monitoring quality. Getting a system ‘production-ready’ is a long road. But the work actually only begins once the system goes into production, because that’s when you get actual use by end users. Then you need to evaluate.”

“The difficulty is that you’re dealing with non-deterministic output. Evaluating that often still requires manual work,” warns Vanhalst. “You have to check output by output whether it was correct and debug where necessary. We’re looking at a way to automate that evaluation process. Paradoxically, we’re using language models again for that.”

Vanhalst definitely doesn’t want to discourage companies from working with AI agents, but wants them to be aware of the risks. “It’s an exciting world. It’s useful to look for cases, without expecting AI agents to suddenly solve everything. A good cost-benefit balance is important.”