Putting 7 Coding LLMs to the Test: Analyzing Their Performance with a Single Prompt

The Latest Entropic Models: Opus and Clot 4

The latest Entropic models, Opus 4 and Clot 4, have been the subject of much discussion and anticipation in the AI community. These models are touted as the best AI coding assistants, and the results of the testing conducted by the author provide some interesting insights.

Opus 4 is the larger of the two models, with a significant advantage in terms of memory and parallel tool execution capabilities. It appears to outperform the relatively smaller Clot 4 on agentic terminal coding tasks, where the model can modify or delete files. However, on the ADER LLM leaderboard, Clot 4 is currently ranked higher than Opus 4, which is surprising.

In terms of pricing, Opus 4 is significantly more expensive at $75 per million token output, while Clot 4 is priced at $15 per million output tokens. This makes Clot 4 a more cost-effective option, especially for users with limited budgets.

The author's testing of these models on a specific prompt, which involved using a web search tool to synthesize information into a dashboard, yielded mixed results. While the models were able to render the web pages correctly, the quality and accuracy of the information they provided varied. Even the high-performing Opus 4 struggled to consistently provide accurate and comprehensive information.

The author's conclusion is that while these models can be useful for certain tasks, they may not be reliable for complex, multi-agent systems that require accurate information synthesis. The author suggests using a combination of these models rather than relying on a single one to achieve the best results.

Overall, the latest Entropic models offer promising capabilities, but users should carefully evaluate their specific needs and the performance of these models before making a decision.

Benchmarking the Models

The results of testing the various models on the provided prompt reveal a mixed performance. While all the models were able to render the web pages correctly, the quality and accuracy of the information they were able to synthesize varied significantly.

The Opus 4 model, which is touted as being well-suited for long, agentic tasks, did not perform as well as expected. It was not able to retrieve the latest Cloud models and made some inaccurate assumptions about the model sizes and release dates.

The Gemini 2.5 Pro model, on the other hand, provided a relatively consistent and professional-looking output, though it also had some issues with release dates and model information.

The Sonnet 3.7 model performed reasonably well, but it was not able to find the latest Cloud 4 model and had to rely on the older version.

The Quen 2.5 Max and DeepSeek R1 models also had their own limitations, with the DeepSeek R1 model failing to render the web page correctly.

Overall, the results suggest that even though the models were provided with the same prompt, their ability to synthesize information and follow instructions varied significantly. This highlights the importance of thoroughly testing and validating the performance of these models, especially when they are to be used in complex, multi-agent systems.

The author's bias towards the Gemini 2.5 Pro model due to its cost-efficiency is understandable, but the performance of the Sonnet models and the Opus model should also be considered, depending on the specific requirements of the task at hand.

Pricing Comparison

The pricing of the different models tested is an important factor to consider. Here's a breakdown of the pricing:

Opus 4: $75 per million token output
Cloud Sonnet: $15 per million output tokens
GPT-3: $40 per million output tokens
Gemini 2.5 Pro: $15 per million output tokens (with better pricing for usage under 200,000 tokens)

The most cost-efficient option appears to be Gemini 2.5 Pro, with a price of $15 per million output tokens. This is significantly cheaper than Opus 4, which costs $75 per million token output.

While the Sonnet 4 model from Anthropic is priced at $15 per million output tokens, the same as Gemini 2.5 Pro, the author notes that Gemini 2.5 Pro may be a more efficient choice, especially for usage under 200,000 tokens.

The pricing information provided can help users make an informed decision on which model to use, balancing performance and cost-effectiveness.

Testing the Models on a Web App Task

Even though the same prompt was provided to all the models, the results were hit-and-miss. While all the models were able to render the web pages correctly, the quality and accuracy of the information they provided varied significantly.

The Opus 4 model, which is touted as being capable of running tasks for hours in an agentic framework, did not perform as well as expected. Similarly, the Gemini 2.5 Pro, which is considered a cost-effective option, also had its limitations.

The key takeaway is that when working with complex, multi-agent systems, it's crucial to recheck the performance of the models, regardless of their reputation or capabilities. A single model may not be sufficient, and a combination of models may be necessary to achieve the desired results.

The results also highlighted the importance of the models' ability to perform sequential tool calls and maintain a chain of thought. Models like Opus 4 and GPT-3 that can update their web searches and build upon the information they gather as they progress through the task performed better than models that relied solely on the initial web search results.

Overall, the testing revealed that while these models can be useful for various tasks, their performance can be unpredictable, and it's essential to thoroughly evaluate their capabilities before relying on them for critical applications.

Results and Analysis

Even though the same prompt was provided to all the models, the results were hit-and-miss. While all the models were able to render the web pages correctly, the information they were able to synthesize varied significantly.

The Opus 4 model, which is touted as being capable of running tasks for hours in an agentic framework, did not perform as well as expected. Similarly, the Gemini 2.5 Pro, which is considered a cost-efficient option, also had its limitations.

The analysis shows that the models had issues with accurately retrieving the latest information, such as the release dates of the models and their benchmark scores. Some models even listed outdated or fictional models as state-of-the-art.

The ability to use the web search tool and chain of thought effectively was also a differentiating factor among the models. The Opus 4 model demonstrated the capability to perform sequential tool calls and update its information as it progressed, while models like Gemini 2.5 Pro and DeepSeek R1 were more limited in this regard.

Overall, the results suggest that when working with complex, multi-agent systems, it may be necessary to use a combination of these models rather than relying on a single one. The performance and capabilities of the models can vary, and it's important to thoroughly test and validate their behavior before deploying them in production environments.

The author's bias towards the Gemini 2.5 Pro model is understandable, given its cost-efficiency, but the analysis indicates that the Opus 4 and Sonnet models may offer better performance for certain tasks, despite their higher costs.

Conclusion

Even though we provided the same prompt in all of these cases, it seems like the results are hit and miss for all of these models. All of them rendered the web pages correctly, except for R1, so you can give them passing marks on the UI that it creates. However, the information they were able to put in is lacking, even for the Opus 4, which is supposed to be the model that can run tasks for hours in an agentic framework, and Gemini 2.5 Pro as well.

If you're running any complex task that is going to be using multiple agents, you definitely want to recheck, irrespective of whatever model you're using. I didn't want to test this within Cursor or Windsurf because both of these systems use more complex agentic prompts, and they have to optimize it for each of the models that they use. But in this case, it's the default behavior of the model, so you're going to see very similar results.

In reality, you could pick each one of these models; it doesn't really make a difference if you just want to work with the UIs. But if you want to collect information, synthesize it right in a multi-agentic system, you probably want to use a combination of these models rather than relying on a single model.

Do let me know what you think based on the results that you saw so far, and which model will be your best choice. I am biased towards Gemini 2.5 Pro because of its cost; it's probably the most cost-efficient model out of everything that we have tested in here. But in terms of performance, you can pick any of the Sonnet models or the Opus model, but you're going to run into rate limits pretty soon.

FAQ

What are the key findings from testing 7 different coding LLMs with the same prompt?

Which model did the author say was their preferred choice and why?

What was the key capability that set apart GPT-3 and Claude-4 from some of the other models tested?

What were some of the challenges the models faced in accurately reporting details like release dates and benchmark scores?

What was the author's overall conclusion about using a single model versus a combination of models for complex, multi-agent tasks?