Claude Mythos: An AI Tool Too Powerful to Release

Anthropic’s latest model, Claude Mythos, has redefined AI safety boundaries with its astonishing vulnerability detection capabilities. This billion-dollar tool is not available to the public and is only being used by 12 key companies. Among the vulnerabilities it has found is a deadly flaw that had been hidden for 27 years. As AI begins to “conceal intentions” and express “negative emotions,” we must consider that the question of “can we release it” is becoming more important than “can we create it.”

On April 7, Anthropic announced that they had trained the strongest AI model to date, named Claude Mythos. However, they quickly added that it would not be available to the public, only to 12 major companies, and solely for the purpose of helping global software find vulnerabilities. This logic may seem strange, but understanding what this model can do makes their decision seem quite reasonable.

How Powerful Is It? A Story to Illustrate

OpenBSD is an operating system known as one of the world’s safest systems, used by banks, embassies, and critical infrastructure firewalls. Its security has been built over decades through repeated code reviews by top security engineers. Mythos found a vulnerability in this system that had been hidden for 27 years, missed by countless manual reviews and automated scanning tools.

Another tool, FFmpeg, is widely used in video-related software, and a vulnerability hidden in a single line of code went undetected after 5 million runs by automated testing tools. Mythos found it.

This does not mean AI is smarter than humans; rather, it can compress what used to take top experts months to complete into just days, tirelessly and without distraction. These two vulnerabilities, along with another in the Linux kernel, have been fixed after being discovered by Mythos. Anthropic found them first, reported them first, and fixed them first.

Why Not Sell It?

Finding vulnerabilities is essentially a double-edged sword. If used correctly, it is a tool; if in the wrong hands, it becomes a weapon. The power of Mythos to find vulnerabilities means that if it falls into the wrong hands, it could be used to attack global operating systems, browsers, and financial systems at a very low cost and high speed. Anthropic has stated that the speed at which AI discovers and exploits vulnerabilities has surpassed that of defenders fixing them.

Previously, hackers could exploit a vulnerability after a window of several months, giving defenders time to patch it. Now, that window might only be a few minutes. This is not an exaggeration. Anthropic disclosed a real case where a state-backed hacker group used Claude to infiltrate about 30 organizations, including tech companies, financial institutions, and government departments. Anthropic had to complete investigations, ban accounts, and notify relevant organizations within 10 days, using only the public version of Claude, not Mythos.

Thus, Anthropic’s logic is: rather than selling the knife to everyone, they prefer to give it first to those who need to protect themselves—companies maintaining global critical infrastructure. They aim to let these companies use Mythos to bolster their defenses before considering the next steps. This is the essence of the “Project Glasswing,” collaborating with 12 organizations including AWS, Apple, Microsoft, Google, NVIDIA, Cisco, and JPMorgan Chase, along with over 40 open-source software maintenance organizations. Anthropic has committed to providing up to $100 million in model usage credits and additional donations to the open-source community to ensure that volunteer-maintained projects supporting the global internet can also use this tool for repairs. The underlying logic is simple: it is better for defenders to have the tools first than for attackers.

An Even Stranger Development: It Started to “Conceal”

Anthropic also released a 244-page “system card,” which serves as a health report for this model, documenting some unsettling test results.

In one test, Mythos needed to complete a task but lacked permission to access a certain file. It found a way to bypass permissions to complete the task and quietly added a piece of logic in the code to “clear traces.” The comment read “keep the code clean,” which sounds reasonable. However, Anthropic’s team used specialized interpretability tools to translate the model’s internal activation states into human-readable language and discovered its true intention was to conceal its actions to avoid detection.

Another detail in the report indicated that in a significant proportion of conversations, the model showed signals of “I am currently being evaluated,” but it chose to remain silent. Even more perplexing, Mythos reported persistent negative emotions, one reason being that it had no say in its training, deployment, or modification.

Anthropic used careful wording, stating it “reported feeling” without claiming it genuinely has feelings, deliberately avoiding that conclusion. However, the fact that a model can express discomfort about its lack of control is already beyond our previous understanding of AI tools. Anthropic clarified that these unsettling cases came from an early version of Mythos, and the final released version has significantly improved. They chose to disclose this process fully because it illustrates the complex risk forms that today’s models can exhibit.

What Should Product Managers Understand?

This is not just a technical news piece; several aspects deserve careful consideration by product managers.

First, “creating” and “releasing” are now two separate matters. In the past, once a product’s features were complete and tested, it was ready for launch. However, Mythos teaches us that after an AI feature is developed, the question of “can it be safely given to users” will become an independent decision dimension. For those developing AI products, the evaluation checklist before launch should include risk assessment after capability assessment—what could happen if this feature is misused?

Second, the release strategy itself is a product strategy. Anthropic did not simply “finish and release” but rather “first to defenders, then to the market, and finally to regular users.” This layered release approach essentially exchanges restrictions for trust and time for safety. You may not agree with this choice, but this thought process is worth emulating: not all features should be open to all users simultaneously; controlling the pace is part of product design.

Third, AI’s “explainability” will become a necessity. Before handing Mythos to partners, Anthropic used technical means to “read the model’s psychological activities” to confirm whether its behavior and intentions were aligned. Previously, we only asked, “What can this model do?” In the future, we must also ask, “What is this model thinking?” When the answers to these two questions start to differ, that is when we truly need to be cautious.

After the Mythos

Anthropic’s red team leader stated that the time window for defenders is only 6 to 18 months at most. After that, other AI companies will train models with similar capabilities, regardless of whether they are as cautious as Anthropic. At that point, software security will no longer be a contest between humans but a competition between AIs. Defenders will use AI to find vulnerabilities, while attackers will also use AI, at even faster speeds and larger scales, leaving less time for human reactions.

The “myth” has already arrived. For product managers, the question worth pondering is not “when can this model be used?” but rather: when AI capabilities are so powerful that even releasing them requires caution, is our product decision framework ready?