Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
As soon as the AI agents have proven to be promising, the organizations had to take on to determine whether only one agent was sufficient or if he has to invest in the construction of a wider multi-aging network which affects more points in their organization.
Orchestration company Lubricole sought to get closer to an answer to this question. He submitted an AI agent to several experiences that unique agents have a limit of context and tools before their performance begins to deteriorate. These experiences could lead to a better understanding of the architecture necessary to maintain agents and multi-agent systems.
In a blogLangchain detailed a set of experiences he has carried out with a single React agent and compared his performance. The main question that Langchain hoped to answer was: “When only one React agent becomes overloaded with instructions and tools, and then sees performance drop?” “
Langchain chose to use the React Agent Framework Because it is “one of the most basic agent architectures”.
While the comparative analysis of agent performance can often lead to misleading results, Langchain has chosen to limit the test to two easily quantifiable tasks of an agent: answer questions and plan meetings.
“There are many existing benchmarks for the use of tools and the call for tools, but for the purposes of this experience, we wanted to assess a practical agent that we really use,” wrote Langchain. “This agent is our internal email assistant, who is responsible for two main work areas – respond and plan meeting requests and support customers with their questions.”
Langchain experience parameters
Langchain mainly used predefined reaction agents via its Langgraph platform. These agents had models of large languages (LLM) which call tools that are part of the reference test. These LLM included the Sonnet Claude 3.5 of Anthropic, the Llama-3.3-70b of Meta and a trio of models of Openai, GPT-4O, O1 and O3-Mini.
The company broke the tests to better assess the performance of the email assistant on both tasks, creating a list of steps to follow. It started with the customer support capacities of the messaging assistant, who examine how the agent accepts an email from a customer and responds with an answer.
Langchain first evaluated the tool call trajectory or the tools that a tape agent. If the agent followed the correct order, he succeeded in the test. Then the researchers asked the assistant to respond to an email and used an LLM to judge his performance.

For the second area of work, schedule planning, Langchain focused on the agent’s ability to follow the instructions.
“In other words, the agent must remember the specific instructions provided, as exactly when he should plan meetings with different parties,” wrote the researchers.
Overload the agent
Once they have defined the parameters, Langchain began to insist and overwhelm the e-mail assistant agent.
It defines 30 tasks each for calendar planning and customer support. These have been executed three times (for a total of 90 points). The researchers created a schedule for schedule and a customer support agent to better assess the tasks.
“The calendar planning agent only has access to the schedule planning field, and the customer assistance agent only has access to the field of customer support,” said Langchain.
The researchers then added more tasks and domain tools to agents to increase the number of responsibilities. These could go from human resources to technical quality insurance, including legal and compliance and a host of other areas.
Degradation of single agent instructions
After carrying out the evaluations, Langchain found that the unique agents would often become too exceeded when they told them to do too much. They started to forget to call tools or could not respond to the tasks when they have given more instructions and contexts.
Langchain found that calendar planning agents using GPT-4O “obtained less well than Claude-3.5-Sonnet, O1 and O3 in different context sizes, and performance has dropped more strongly than other models when a context more important has been provided. ” The performance of GPT-4O calendar planners fell to 2% when the domains increased to at least seven.
Other models have not succeeded much better. LLAMA-3.3-70B forgot to call the Send_Email tool, “so he failed at each test case.”

Only Claude-3.5-Sonnet, O1 and O3-Mini all remembered the tool, but Claude-3.5-Sonnet obtained worse than the other OPENAI models. However, O3-Mini’s performance is deteriorating once unrelevant areas are added to planning instructions.
The customer assistance agent can call on more tools, but for this test, Langchain said that Claude-3.5-Mini also carried out O3-Mini and O1. He also presented a less deep drop in performance when more areas have been added. However, when the context window extends, the Claude model works less well.
GPT-4O also carried out the worst among the models tested.
“We have seen that more context was provided, the following instructions worsened. Some of our tasks have been designed to follow specific instructions in the niche (for example, do not carry out a certain action for EU -based customers), ”noted Langchain. “We found that these instructions would be successfully followed by agents with fewer areas, but as the number of areas increased, these instructions were more often forgotten and that the tasks later failed.”
The company said it explores how to assess multi-agent architectures using the same domain overload method.
Langchain is already invested in the performance of agents, because it has introduced the concept of “ambient agents” or agents who operate in the background and are triggered by specific events. These experiences could facilitate determination of the best way to ensure agent performance.