LangChain shows AI agents aren't human-level yet because they're overwhelmed by tools

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more

As soon as the AI agents have proven to be promising, the organizations had to take on to determine whether only one agent was sufficient or if he has to invest in the construction of a wider multi-aging network which affects more points in their organization.

Orchestration company Lubricole sought to get closer to an answer to this question. He submitted an AI agent to several experiences that unique agents have a limit of context and tools before their performance begins to deteriorate. These experiences could lead to a better understanding of the architecture necessary to maintain agents and multi-agent systems.

In a blogLangchain detailed a set of experiences he has carried out with a single React agent and compared his performance. The main question that Langchain hoped to answer was: “When only one React agent becomes overloaded with instructions and tools, and then sees performance drop?” “

Langchain chose to use the React Agent Framework Because it is “one of the most basic agent architectures”.

While the comparative analysis of agent performance can often lead to misleading results, Langchain has chosen to limit the test to two easily quantifiable tasks of an agent: answer questions and plan meetings.

“There are many existing benchmarks for the use of tools and the call for tools, but for the purposes of this experience, we wanted to assess a practical agent that we really use,” wrote Langchain. “This agent is our internal email assistant, who is responsible for two main work areas – respond and plan meeting requests and support customers with their questions.”

Langchain experience parameters

Langchain mainly used predefined reaction agents via its Langgraph platform. These agents had models of large languages (LLM) which call tools that are part of the reference test. These LLM included the Sonnet Claude 3.5 of Anthropic, the Llama-3.3-70b of Meta and a trio of models of Openai, GPT-4O, O1 and O3-Mini.

The company broke the tests to better assess the performance of the email assistant on both tasks, creating a list of steps to follow. It started with the customer support capacities of the messaging assistant, who examine how the agent accepts an email from a customer and responds with an answer.

Langchain first evaluated the tool call trajectory or the tools that a tape agent. If the agent followed the correct order, he succeeded in the test. Then the researchers asked the assistant to respond to an email and used an LLM to judge his performance.

For the second area of work, schedule planning, Langchain focused on the agent’s ability to follow the instructions.

“In other words, the agent must remember the specific instructions provided, as exactly when he should plan meetings with different parties,” wrote the researchers.

Overload the agent

Once they have defined the parameters, Langchain began to insist and overwhelm the e-mail assistant agent.

It defines 30 tasks each for calendar planning and customer support. These have been executed three times (for a total of 90 points). The researchers created a schedule for schedule and a customer support agent to better assess the tasks.

“The calendar planning agent only has access to the schedule planning field, and the customer assistance agent only has access to the field of customer support,” said Langchain.

The researchers then added more tasks and domain tools to agents to increase the number of responsibilities. These could go from human resources to technical quality insurance, including legal and compliance and a host of other areas.

Degradation of single agent instructions

After carrying out the evaluations, Langchain found that the unique agents would often become too exceeded when they told them to do too much. They started to forget to call tools or could not respond to the tasks when they have given more instructions and contexts.

Langchain found that calendar planning agents using GPT-4O “obtained less well than Claude-3.5-Sonnet, O1 and O3 in different context sizes, and performance has dropped more strongly than other models when a context more important has been provided. ” The performance of GPT-4O calendar planners fell to 2% when the domains increased to at least seven.

Other models have not succeeded much better. LLAMA-3.3-70B forgot to call the Send_Email tool, “so he failed at each test case.”

Only Claude-3.5-Sonnet, O1 and O3-Mini all remembered the tool, but Claude-3.5-Sonnet obtained worse than the other OPENAI models. However, O3-Mini’s performance is deteriorating once unrelevant areas are added to planning instructions.

The customer assistance agent can call on more tools, but for this test, Langchain said that Claude-3.5-Mini also carried out O3-Mini and O1. He also presented a less deep drop in performance when more areas have been added. However, when the context window extends, the Claude model works less well.

GPT-4O also carried out the worst among the models tested.

“We have seen that more context was provided, the following instructions worsened. Some of our tasks have been designed to follow specific instructions in the niche (for example, do not carry out a certain action for EU -based customers), ”noted Langchain. “We found that these instructions would be successfully followed by agents with fewer areas, but as the number of areas increased, these instructions were more often forgotten and that the tasks later failed.”

The company said it explores how to assess multi-agent architectures using the same domain overload method.

Langchain is already invested in the performance of agents, because it has introduced the concept of “ambient agents” or agents who operate in the background and are triggered by specific events. These experiences could facilitate determination of the best way to ensure agent performance.

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our privacy policy

Thank you for subscribing. Discover more VB newsletters here.

An error occurred.

LangChain shows AI agents aren’t human-level yet because they’re overwhelmed by tools

Langchain experience parameters

Overload the agent

Degradation of single agent instructions

Leave a Reply Cancel reply

Follow US

Popular News

Polycam’s new iPhone update lets you 3D scan rooms in seconds

Global Coronavirus Cases

Categories

Quick Link

Top Categories

Subscribe US

Langchain experience parameters

Overload the agent

Degradation of single agent instructions

You Might Also Like

Elon Musk: Have I Got News for You star Ian Hislop praised for brutal takedown

Travis Kalanick thinks Uber screwed up: “Wish we had an autonomous ride-sharing product”

From AI agent hype to practicality: Why enterprises must consider fit over flash

Founder Ted Price retires from Insomniac Games

The best gaming laptops for 2025

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

Polycam’s new iPhone update lets you 3D scan rooms in seconds

Global Coronavirus Cases

Categories

Quick Link

Top Categories

Subscribe US