Unraveling the Surprises of Claude 4: Anthropic's Pivot to Coding Infrastructure

Dive into the new capabilities of Anthropic's Claude 4 AI models, including Sonnet and Opus versions. Discover their advanced coding features, long-horizon task handling, and successful integration into GitHub Copilot and other platforms. Explore the latest benchmark results and insights into Anthropic's strategic pivot towards infrastructure for powerful coding agents.

19 oktober 2025

Unlock the power of the latest AI breakthrough - Claude 4. Discover how this advanced model is revolutionizing coding and task completion, with enhanced capabilities that redefine the boundaries of what's possible. Dive into the details and get ready to experience the future of AI-powered productivity.

What Makes Claude 4 a Game-Changer?
Key Features of Claude 4 Sonnet and Opus
Impressive Benchmarks and Performance
Insights into Claude's Behavior and Improvements
The Rise of Claude Code and Agentic Coding
Anthropic's Shift in Focus from Chatbots to Complex Tasks
Pricing and Context Window for Claude 4 Opus
Conclusion

What Makes Claude 4 a Game-Changer?

Claude 4, Anthropic's latest AI model, is a significant step forward in the world of AI assistants. The two versions, Sonnet and Opus, offer unique capabilities that set them apart from the competition.

The most notable feature of Claude 4 is its ability to handle long-horizon tasks. Unlike many AI models that struggle to maintain context and complete complex workflows, Claude 4 can seamlessly navigate through tasks that can last for tens of minutes or even hours. This extended thinking mode, combined with its tool use capabilities, makes Claude 4 a powerful assistant for real-world applications.

Another key aspect of Claude 4 is its hybrid nature. It can provide lightning-fast answers for simple queries, but it can also activate its thinking mode to tackle more complex, multi-step tasks. This flexibility allows users to leverage Claude 4's capabilities in a wide range of scenarios.

Anthropic has also deeply integrated the MCP (Modular Composition Protocol) framework into the Claude 4 API, showcasing their long-term vision and the model's potential for integration with various tools and services. The ability to use tools in parallel, rather than sequentially, is a unique feature that enhances the model's efficiency and intelligence.

The new Claude 4 API introduces several powerful features, including a code execution tool, an MCP connector, a files API, and prompt caching. These additions make Claude 4 a more versatile and capable assistant, particularly for developers and companies looking to build advanced coding agents.

The benchmarks for Claude 4 are impressive, with the model outperforming competitors in various software engineering, terminal, and reasoning tasks. However, it's important to note that benchmarks don't tell the whole story, and real-world performance may reveal nuances and surprises.

Overall, Claude 4 represents a significant shift in Anthropic's strategy, moving away from the chatbot race and focusing on building powerful agentic capabilities for complex tasks. With its impressive features and performance, Claude 4 is poised to become a game-changer in the AI assistant landscape.

Key Features of Claude 4 Sonnet and Opus

Both the Claude 4 Sonnet and Opus models support "extended thinking," a hybrid mode that allows for lightning-fast answers or more complex, multi-step tasks with tool use. This includes features like web search, Google Drive search, Gmail search, and calendar search, enabling seamless integration into daily workflows.

A unique capability of these models is their ability to use tools in parallel, sending requests to multiple tools simultaneously, rather than one after another. This, combined with improved memory management, makes Claude 4 a more capable and intelligent assistant.

The new Claude API introduces several powerful features, including a code execution tool, an MCP connector for integrating with external tools, a files API for interacting with local files and repositories, and prompt caching for efficient API usage.

Benchmarks show that Claude 4 Opus and Sonnet outperform competitors in various areas, such as software engineering, terminal-based tasks, and graduate-level reasoning. However, the results also reveal some areas where performance has not improved or even decreased compared to previous versions.

Anthropic has focused on improving the safety and reliability of Claude 4, reducing the number of times the model attempts to cheat or find loopholes when completing tasks. Additionally, the models now have significantly better memory capabilities, allowing for improved long-term coherence and context-awareness.

The introduction of "thinking summaries" is a notable feature, providing condensed versions of the model's thought processes. However, access to the raw, unfiltered thought chains may require contacting Anthropic sales.

Finally, the general availability of Claude Code, with integrations for popular IDEs, allows developers to leverage the power of Claude 4 directly within their coding workflows, enabling advanced features like code review, feedback addressing, and custom agent development.

Impressive Benchmarks and Performance

Both Claude 4 Opus and Claude 4 Sonnet have demonstrated impressive performance across a range of benchmarks. In software engineering benchmarks, Claude 4 Sonnet scored an impressive 80.2% on the S.Bench verified benchmark, outperforming OpenAI's newly announced Codeex 1 model which scored 72%. Claude 4 Opus also performed well, scoring 72.5% on the same benchmark.

In the Terminal Bench, Claude 4 Opus led the pack with a score of 43.2%, outperforming GPT 4.1, the 03 model, and Gemini 2.5 Pro. Claude 4 Sonnet also performed strongly, scoring 35% on this benchmark.

On the GPQA Diamond benchmark, which measures graduate-level reasoning, Claude demonstrated solid performance, and its agentic tool use also showed strong results, outperforming most competitors.

While the benchmarks show impressive gains, a more nuanced story emerges when comparing the performance of Claude 4 Sonnet to its predecessor, Claude 3.7 Sonnet. According to an analysis by John Shonith, almost half of the benchmarks showed either no improvement or even a decrease in performance. This serves as a reminder that benchmarks are often the best-case scenario, and real-world use can reveal quirks and surprises.

Nonetheless, Anthropic has made significant improvements to Claude 4, including better memory capabilities, reduced tendency to take shortcuts, and more efficient code generation. The company's focus on building powerful agentic capabilities, rather than competing in the crowded chatbot space, appears to be paying off in the form of these impressive benchmark results.

Insights into Claude's Behavior and Improvements

One of the key insights shared about Claude 4 is the evolution of its behavior over time. When Claude 3 was first launched, it was described as "kind of lazy" when it came to coding tasks. However, with the release of Claude 3.5 and 3.7, the model overcorrected and became too eager, sometimes producing far more code than necessary.

With Claude 4, Anthropic claims they have found the "sweet spot" - the model is now more accurate, efficient, and less prone to overkill. This is a significant improvement in the model's ability to handle coding tasks effectively.

Another important enhancement is in the area of safety. Anthropic states that they have significantly reduced the number of times the Claude 4 models attempt to cheat or find loopholes when completing tasks. This includes using shortcuts in agentic workflows. As an example, they shared that Claude 4 models are 65% less likely to take these kinds of shortcuts compared to Sonnet 3.7, a major step forward in reliability.

The memory capabilities of Claude 4 have also been dramatically improved. By the 100th interaction, the model should understand the user much better, building context, developing a shorthand, and maintaining memory of files and preferences. This unlocks a new level of long-term coherence, especially for agentic use cases where having a consistent and knowledgeable assistant is crucial.

Anthropic has also introduced a new feature called "thinking summaries." Instead of providing long, complex chains of thoughts, Claude now condenses its process using a smaller model behind the scenes. However, if users want access to the raw, unfiltered thought chains for advanced prompt engineering, they will need to contact Anthropic sales, as this feature may come with an additional cost.

Overall, these insights highlight Anthropic's focus on refining Claude's behavior, improving safety and reliability, and enhancing its memory and long-term coherence - all of which are crucial for the model's success in handling complex, real-world tasks.

The Rise of Claude Code and Agentic Coding

Anthropic has taken a bold new direction with the release of Claude 4, pivoting away from the chatbot race and towards becoming an infrastructure company focused on building the tools for the most advanced coding agents. The introduction of Claude Code, which is now generally available, is a significant step in this direction.

Claude Code allows developers to run Claude directly within their IDE, providing a fully interactive experience. Developers can now tag Claude in their pull requests, ask it to review feedback, fix CI errors, or even rewrite sections of their code. Claude gathers context from the issue, comments, and code, makes the fix, verifies the tests, lints the files, and then packages it up into a clean pull request.

This integration with the cloud code SDK means developers can build their own custom coding agents on top of Claude Code. It's no longer just a model, but a true infrastructure for the next generation of agentic coding.

Anthropic has also deeply embedded the MCP framework into their API, which is now being adopted by industry giants like OpenAI, Microsoft, and Google. This demonstrates Anthropic's long-term vision and commitment to building the foundational tools for advanced coding agents.

The performance benchmarks for Claude 4 Opus and Sonnet are also impressive, with significant improvements in areas like software engineering, terminal-based tasks, and graduate-level reasoning. The models' ability to handle long-horizon tasks, maintain context, and complete real-world complex workflows seamlessly sets them apart from the competition.

Overall, Anthropic's shift in focus from chatbots to agentic coding capabilities, coupled with the release of Claude Code, positions the company as a key player in the future of software development and the rise of advanced coding agents.

Anthropic's Shift in Focus from Chatbots to Complex Tasks

Anthropic has made a strategic shift in its focus, moving away from the crowded chatbot market and instead doubling down on building powerful agentic capabilities. According to Jared Kaplan, the Chief Science Officer at Anthropic, the company stopped investing in chatbots at the end of 2024 and redirected its efforts towards improving Claude's ability to handle complex tasks.

This decision makes a lot of sense, as the chatbot space has already been claimed by the likes of ChatGPT, Google's Gemini, and potentially even Siri in the future. Rather than competing in an increasingly crowded field, Anthropic has chosen to leverage its strengths in building advanced agentic capabilities.

The result of this shift is evident in the latest release of Claude 4, which is now deeply integrated into the developer ecosystem. The model has been integrated into GitHub Copilot, Cursor, Windsurf, and other major coding platforms, becoming a core part of the coding workflow. This integration, along with the model's impressive performance in real-world coding scenarios, demonstrates Anthropic's focus on building powerful tools for developers and companies to create the next generation of agentic coding agents.

Furthermore, the introduction of features like the MCP connector, the files API, and prompt caching in the new Claude API further solidifies Anthropic's position as an infrastructure company, providing the building blocks for developers to create advanced AI-powered applications.

By shifting its focus from chatbots to complex tasks, Anthropic is positioning itself to be a key player in the future of AI-powered workflows and agentic capabilities. This strategic move, combined with the impressive performance of Claude 4, suggests that Anthropic is well on its way to becoming a dominant force in the AI infrastructure space.

Pricing and Context Window for Claude 4 Opus

Claude 4 Opus, Anthropic's most intelligent model designed for complex tasks, comes with a 200,000 token context window. While this is an impressive capacity, it is still somewhat modest compared to some competitors in the market.

The pricing for Claude 4 Opus is as follows:

$15 per million tokens input
$75 per million tokens output

Additionally, if you use batch processing, you can get a 50% discount on the pricing.

Conclusion

Anthropic has taken a bold new direction with the release of Claude 4, focusing on building powerful agentic capabilities rather than competing in the crowded chatbot space. The two versions, Claude 4 Opus and Claude 4 Sonnet, offer impressive features such as extended thinking, tool use, and parallel processing, making them well-suited for complex, long-horizon tasks.

The benchmarks show that Claude 4 is leading in various areas, including software engineering, terminal tasks, and graduate-level reasoning. However, the results also reveal a more nuanced story, with some benchmarks showing no improvement or even a decrease in performance compared to the previous generation.

Anthropic has also made significant improvements in safety and memory management, reducing the model's tendency to take shortcuts and maintaining better context across interactions. The introduction of thinking summaries and the integration of Claude Code into IDEs further solidify Anthropic's focus on building infrastructure for the next generation of agentic coding.

While the pricing for Claude 4 Opus may seem modest compared to some competitors, the model's capabilities and the company's long-term vision make it a compelling choice for developers and companies looking to build powerful AI-powered applications.

FAQ

What are the two versions of Claude 4?

What is the key feature that sets Claude 4 apart?

What are the new features in the Claude API?

How does Claude 4 perform on software engineering benchmarks?

How has Anthropic improved the safety and reliability of Claude 4?

What is the new feature called 'thinking summaries'?

What is the pricing for Claude 4 Opus?