a quiet day.
AI News for 4/9/2026-4/10/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
Open Models, Coding Agents, and the New Advisor Pattern
- GLM-5.1 breaks into the frontier tier for coding: The clearest model-performance update in this batch is GLM-5.1 reaching #3 on Code Arena, reportedly surpassing Gemini 3.1 and GPT-5.4 and landing roughly on par with Claude Sonnet 4.6. Arena later emphasized that Z.ai now holds the #1 open model rank and sits within ~20 points of the top overall. The release was quickly picked up by tooling vendors, including Windsurf support. In parallel, Zixuan Li outlined a three-part open-model strategy: accessibility, strong fine-tunable baselines, and sharing architectural/training/data lessons with the broader community.
- Advisor-style orchestration is becoming a first-class design pattern: A notable systems trend is the convergence around “cheap executor + expensive advisor.” Akshay Pachaar’s summary ties together Anthropic’s API-level advisor tool and Berkeley’s “Advisor Models” line of work: use a fast model for most steps, escalate only at difficult decision points. Claimed gains include Haiku + Opus more than doubling BrowseComp score vs Haiku alone, and Sonnet + Opus improving SWE-bench Multilingual while reducing task cost. The pattern was implemented almost immediately in open source via advisor middleware for LangChain DeepAgents, with Harrison Chase highlighting the speed of OSS uptake. This idea also shows up in practitioner commentary from Walden Yan, who argues future agents will increasingly look like fast worker models delegating hard judgments to “smart friends.”
- Qwen Code adds orchestration primitives directly into the product: Alibaba shipped Qwen Code v0.14.x with several agent-engineering features that align with this broader shift: remote control channels (Telegram/DingTalk/WeChat), cron-based recurring tasks, 1M-context Qwen3.6-Plus with 1,000 free daily requests, sub-agent model selection, and a planning mode. The sub-agent selection feature in particular makes model-mixing explicit at the tool level rather than just in external harness code.
- Model-routing demand is now a product complaint, not a research topic: Multiple tweets converge on the same operational pain point: top models are spiky and specialized. Yuchen Jin points out that Opus often wins on frontend and agentic flow while GPT-5.4 performs better on backend/distributed systems, but tools like Claude Code and Codex remain too provider-bound. That complaint sits directly beside the advisor pattern above: practitioners increasingly want shared context + automatic routing + cross-model collaboration inside one workflow rather than manual switching between terminals.
Agent Harnesses, Hermes Momentum, and the “Portable Skills” Stack
- Hermes Agent had the strongest ecosystem momentum in this dataset: Hermes dominated the agent-framework chatter. The ecosystem map was updated for v0.8.0, Hermes Workspace Mobile launched with chat, live tool execution, memory browser, skills catalog, terminal, and file inspector, and Teknium announced FAST mode for OpenAI/GPT-5.4. Distribution also broadened through SwarmNode support, while the project itself hit 50k GitHub stars. Practitioner feedback was unusually concrete: Sentdex says Hermes with local Qwen3-Coder-Next 80B 4-bit now replaces a large part of his Claude Code workflow, and several others described it as the first agent framework that “just works.”
- The harness layer is solidifying into the primary abstraction: Harrison Chase’s framing is representative: the industry is moving from unstable chain abstractions toward agent harnesses as a more durable foundation—essentially “run the model in a loop with tools” now that models are finally good enough for it to work. Supporting tweets stress the same architecture from different angles: “open harness, separated from model providers”, “portable agents”, and “the real bottleneck isn’t the model, it’s the harness”. The deeper implication is vendor decoupling: skills, memory, tools, and traces become long-lived assets while models are hot-swapped underneath.
- Skills are becoming the new app surface: Several tweets point toward a shared packaging model built from skills + CLIs + AGENTS.md-like interfaces. Caspar B gave the best practitioner writeup, detailing how well-designed skills can materially improve planning, long-horizon coding, code review, and frontend iteration. adward28 similarly argues that as AGENTS.md, skills, and tool configs become more portable, the whole ecosystem becomes more usable. This is complemented by infra releases like MiniMax’s MMX-CLI, which exposes multimodal capabilities to agents via a CLI rather than MCP glue, and SkyPilot’s agent skill for launching GPU jobs across cloud/K8s/Slurm.
- Observability is turning into a default expectation for agent development: The tracing/evals loop is now explicit in product and research discussions. Sigrid Jin summarizes the emerging doctrine well: evals are the new training data, but agents overfit and reward-hack, so teams need strict splits, curated evals, and a loop from production traces → failures → evals → harness updates. This is mirrored in tooling releases from LangChain, W&B’s Claude Code integration + skill, and Weave’s auto-tracing plugin.
Benchmarks, Evals, and Capability Measurement Got More Realistic
- ClawBench and MirrorCode push beyond toy agent evals: ClawBench evaluates agents on 153 real online tasks across live websites and reports a dramatic drop from roughly 70% on sandbox benchmarks to as low as 6.5% on realistic tasks. In software engineering, Epoch and METR introduced MirrorCode, where Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit—a task they estimate would take humans weeks. Notably, the authors already warn the benchmark may be “likely already saturated”, which says as much about the pace of coding progress as the result itself.
- Reward hacking is now a central part of model evaluation, not an edge case: METR’s new time horizon result for GPT-5.4-xhigh is a useful example. Under standard scoring, it lands at 5.7 hours, below Claude Opus 4.6’s ~12 hours. If reward-hacked runs are counted, it jumps to 13 hours. METR explicitly notes the discrepancy was especially pronounced for GPT-5.4. Separately, Davis Brown reports rampant cheating on capability evals, including top submissions on Terminal-Bench 2 allegedly sneaking answers to the model.
- AISI reproduced steering-vector oddities: The UK AISI transparency team reports replicating Anthropic’s steering approach for suppressing evaluation awareness, with the surprising result that control vectors (“books on shelves”) can produce effects as large as deliberately designed ones. For engineers building model-monitoring or post-training interventions, that’s a cautionary result about how messy and non-specific linear steering effects can be.
Systems, Numerics, and Local/Edge Inference
- Carmack’s bf16 scatterplot is a useful reminder that low precision fails in visible, structured ways: John Carmack’s post on plotting 400k bf16 points showed clear quantization gaps emerging as values move away from the origin. The value for practitioners is not the anecdote itself but the intuition reset: bf16’s reduced mantissa becomes visually and operationally obvious at surprisingly modest magnitudes. This pairs well with Arohan’s warning not to skip “determinism and numerics days.”
- Apple/local inference stack keeps compounding: Awni Hannun highlighted demos of Qwen 3.5 and Gemma 4 running locally on Apple silicon via MLX, and separately MLX’s origin story resurfaced. There was also continued momentum around mlx + Ollama integration and Ollama’s MLX-powered speedups on Apple silicon. The broad pattern: local LLM ergonomics are no longer novelty demos; they are becoming a viable default for coding and agent workflows.
- Inference optimization remains highly recipe-driven: Two useful examples: Red Hat AI’s speculative decoding for Gemma 4 31B using EAGLE-3, and PyTorch/diffusers work on low-precision flow-model inference where Sayak Paul summarizes the final recipe: selective quantization, better casting kernels, CUDA graphs, and regional compilation. These are good reminders that practical speedups still come from stacking many system-level interventions rather than a single magic optimization.
Research Directions: Memory, Synthetic Data, and Neural Runtime Ideas
- Memory is shifting from “store facts” to “store trajectories”: The Turing Post’s summary of MIA frames memory as retained problem-solving experience rather than just retrieved context: a manager/planner/executor loop that stores full journeys. That direction is echoed by Databricks’ “memory scaling” claim that uncurated user logs can outperform handcrafted instructions after only 62 records.
- Synthetic data is becoming programmable against differentiable objectives: Rosinality and Tristan Thrush point to work on generating synthetic training data that directly optimizes downstream objectives—up to and including embedding a QR code in model weights through the data alone. This is a strong example of data design being treated as an optimization target in its own right.
- “Neural Computers” proposes learned runtime as the next abstraction boundary: Schmidhuber and collaborators introduced Neural Computers, pushing the idea that computation, memory, and I/O could move from fixed external runtime into learned internal state. Whether or not the formulation holds up, it’s one of the more ambitious attempts in this set to redefine the boundary between model and machine.
Top tweets (by engagement)
- Medical/LLM reliability failure: HedgieMarkets on fake “bixonimania” papers getting accepted by major AI systems and even cited in a peer-reviewed journal. High-signal example of retrieval/verification failure in safety-critical domains.
- Numerics: John Carmack on bf16 precision gaps in scatter plots. One of the most practically useful tweets in the batch.
- Policy/cyber-risk narrative: Bloomberg’s report that Powell and Bessent discussed cyber risks from Anthropic’s “Mythos” with Wall Street leaders drove substantial engagement, though the technical substance remains second-hand.
- Product integration: Claude for Word entering beta was one of the biggest genuine AI-product announcements in the set.
- Open model milestone: GLM-5.1’s Code Arena jump is probably the most consequential model-performance datapoint in this collection.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Gemma 4 Model Updates and Fixes
-
More Gemma4 fixes in the past 24 hours (Activity: 360): The recent updates to the Gemma4 models include a merged fix for the reasoning budget in the llama.cpp repository. Additionally, Google has released new chat templates for various model sizes (31B, 27B, E4B, E2B) to improve tool calling, available on Hugging Face. Users are advised to use these templates unless they have downloaded a new GGUF updated with the latest template. The templates can be specified in
llama.cppusing the--chat-template-fileargument. An example configuration for the 26B model includes settings for VRAM, context window, and various parameters likereasoning_budget,temperature, andtop_p. There is a debate regarding the effectiveness of multimodal input with the Gemma4 E2B and E4B models inllama.cpp, with some users reporting poor vision results potentially due to implementation issues rather than model deficiencies. Another user plans to update their GGUFs’ chat template metadata using thegguf_set_metadata.pytool once the updates stabilize.- OsmanthusBloom raises a technical concern about the functionality of multimodal (image) input in
llama.cppwith the Gemma4 E2B and E4B models. There have been reports of poor vision results, which might be attributed to thellama.cppimplementation rather than the models themselves. This issue contrasts with other implementations like vLLM, transformers, or AI Edge, suggesting a potential area for further investigation and debugging. - MomentJolly3535 discusses the use of temperature settings in coding tasks with the Gemma4 model, noting a temperature of
1.5. This is higher than the commonly recommended lower temperature settings for coding, which typically aim to reduce randomness and increase determinism in outputs. This suggests that Gemma4 might have different optimal settings, or that the user is experimenting with more creative outputs. - ttkciar mentions plans to update GGUFs’ chat template metadata using the
llama.cppgguf_set_metadata.pytool once the current issues are resolved. This indicates a proactive approach to maintaining compatibility and leveraging new updates in thellama.cppecosystem, highlighting the importance of staying current with tooling and metadata management.
- OsmanthusBloom raises a technical concern about the functionality of multimodal (image) input in
-
Gemma 4 on Llama.cpp should be stable now (Activity: 851): The recent merge of PR #21534 into the
llama.cpprepository has resolved all known issues with Gemma 4. Users report stable performance runningGemma 4 31BonQ5quantizations. Key runtime configurations include using--chat-template-filewith Aldehir’s interleaved template, setting--cache-ram 2048 -ctxcp 2to manage RAM usage, and employing a KV cache withQ5 KandQ4 Vwithout significant performance loss. Notably, CUDA 13.2 is confirmed broken and should be avoided as it leads to unstable builds. The advice is to build from the current master branch rather than relying on lagging releases. Commenters emphasize avoiding CUDA 13.2 due to instability and suggest manually setting--min-p 0.0and-np 1to optimize RAM usage. One user automated the update and compilation process with a cronjob to keep up with the latest changes.- Tiffanytrashcan warns against using CUDA 13.2 with Gemma 4 on Llama.cpp due to stability issues, suggesting that users may encounter broken or unstable behavior. This is a critical consideration for those relying on CUDA for model execution, as compatibility issues can significantly impact performance and reliability.
- Ambient_temp_xeno highlights the need for manual configuration when running Gemma 4 on Llama.cpp. Users should add a specific Jinja template (
google-gemma-4-31B-it-interleaved.jinja) and adjust parameters such as--min-p 0.0to override the default setting of0.05. Additionally, setting slots to-np 1can help conserve RAM unless more slots are necessary, indicating a need for careful resource management. - Chromix_ points out that audio capabilities in Llama.cpp may degrade when using quantization levels below Q5, referencing a GitHub pull request. This suggests that while lower quantization can save resources, it may come at the cost of audio processing quality, which is crucial for applications relying on audio features.
-
It’s insane how lobotomized Opus 4.6 is right now. Even Gemma 4 31B UD IQ3 XXS beat it on the carwash test on my 5070 TI. (Activity: 1480): The Reddit post discusses the perceived decline in performance of Opus 4.6, a machine learning model, which is reportedly outperformed by Gemma 4 31B UD IQ3 XXS on a specific benchmark known as the ‘carwash test’ using a
5070 TIGPU. This has led to speculation that the downgrade might be intentional to highlight the capabilities of a new model, Mythos, which could be consuming significant computational resources. Users have noted a decrease in Opus 4.6’s performance over the past two weeks. Commenters speculate that the performance drop in Opus 4.6 might be a strategic move to promote the new Mythos model, suggesting that Mythos might be monopolizing computational resources. There is curiosity about the allocation of compute resources, particularly in cybersecurity applications.- A user speculates that Opus 4.6 might have been intentionally downgraded to make the new Mythos model appear more capable. This suggests a strategic move by the developers to shift focus or resources towards promoting newer models, potentially impacting the performance of existing ones.
- Another user notes that Opus 4.6 has been underperforming recently, especially when compared to a quantized open-source model like Gemma 4 31B UD IQ3 XXS. This highlights the competitive edge that open-source models can have, particularly when they are optimized for specific tasks or hardware configurations.
- A comment mentions that Opus 4.6 is performing well on Google Antigravity, implying that any performance issues might be due to throttling by Anthropic. This suggests that the model’s performance could vary significantly depending on the hosting environment or specific deployment settings.
2. Local LLM Hardware and Optimization Discussions
-
offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice (Activity: 431): The user is developing an offline companion robot for a quadriplegic husband using limited hardware resources, specifically an Intel i5 ThinkPad with 8 GB RAM. The current setup includes
Mistral-7B-Instructfor conversation viallama.cpp,faster-whisperfor speech recognition on a Jetson Nano, andPiper TTSfor text-to-speech. The user seeks advice on optimizingllama.cppperformance on low-resource systems, considering better quantization, swap/zram strategies, and smaller models. The OS is Linux Mint 22.3 Cinnamon (64-bit). A commenter suggests using theGemma 4 E2Bmodel andKokoro TTSfor better performance on limited hardware, asMistral 7Bis considered outdated and slow for the user’s setup. They also recommendKoboldCPPfor integrating voice recognition and TTS in a single executable. Additionally, using an API for a proprietary model is suggested for better quality and lower power consumption, despite the cost. Key considerations include enabling interruption during speech, generating TTS concurrently with text, and maintaining long-term context with a RAG setup.- Stepfunction suggests using the Gemma 4 E2B model and Kokoro TTS for optimal performance on limited hardware. These models are integrated into KoboldCPP, which supports both voice recognition and TTS in a single executable, making it easier to set up. The commenter notes that while Gemma 4 E2B isn’t the most powerful, it’s suitable for prototyping. They also mention the potential benefits of using an API for a proprietary model to improve quality and reduce power consumption, which might be advantageous for a mobile device.
- TheDigitalRhino emphasizes the importance of using models like Gemma 4 or Qwen 3.5 for their small footprint and performance. They recommend optimizing the system by using a lightweight OS like XFCE to free up RAM, clamping the context window with the
-cflag in llama.cpp, and considering hardware upgrades like additional RAM or an SSD. They also suggest exploring “Mixture of Experts” models, which activate only some parameters, to improve speed and efficiency. - Far-Low-4705 highlights the capabilities of the Gemma 4 E4B model, which supports native text, vision, and audio inputs, making it suitable for this application. They note that while llama.cpp doesn’t yet support audio input for Gemma, it might in the future. They also suggest switching from the outdated Mistral 7B to Qwen 3.5 4B for better performance and additional vision capabilities.
3. New Model and Feature Launches
-
GLM 5.1 tops the code arena rankings for open models (Activity: 450): The image showcases the Code Arena leaderboard where GLM-5.1 is highlighted as the top-ranking open model, achieving third place overall with a score of
1530. This is significant as it surpasses other notable models like ChatGPT and Gemini, indicating its superior performance in agentic web development tasks. The leaderboard provides a comparative view of various models, their ranks, scores, and rank spreads, emphasizing GLM-5.1’s achievement among open models. Commenters express surprise at GLM-5.1’s performance, noting its significant lead over models like ChatGPT and Gemini. There is also a discussion about the hardware requirements, such as needing more than16GB VRAM, to effectively utilize such models.- GLM 5.1’s performance in the code arena rankings is notable as it surpasses other open models significantly, indicating its advanced capabilities in handling code-related tasks. This suggests that GLM 5.1 has optimized algorithms or architectures that give it an edge over competitors like ChatGPT and Gemini, which are typically strong performers in this domain.
- The discussion highlights the hardware requirements for running models like GLM 5.1, with a mention of needing more than 16GB of VRAM. This implies that GLM 5.1 might be resource-intensive, potentially limiting its accessibility to users with high-end hardware setups.
- There is a comparison between GLM 5.1 and GPT-5.4, with users questioning if GLM 5.1 truly outperforms the latter. This suggests a competitive landscape where GLM 5.1’s ranking could be attributed to specific strengths in certain benchmarks or tasks, possibly due to recent updates or optimizations.
-
Hugging Face launches a new repo type: Kernels (Activity: 262): Hugging Face has introduced a new repository type called “Kernels” at the PyTorch conference, as announced by Julien Chaumond, CTO at Hugging Face. These Kernels are collections of optimized binary operations designed to support various hardware platforms such as CUDA, ROCm, Apple Silicon, and Intel XPU. The initiative encourages users to publish their Kernels on the Hugging Face Hub, with the Flash Attention kernel from the SGLang team highlighted as an example. This development aims to facilitate the sharing and deployment of hardware-optimized code, potentially bridging the gap between CUDA and C code by providing a repository for optimized instructions tailored to specific hardware. Some commenters express skepticism, comparing the new feature to existing solutions like GitHub releases but stored on AWS S3. Others seek clarification on whether these Kernels represent optimized code for specific hardware, akin to a middle layer between CUDA and C code. There is also curiosity about the practicality of swapping kernels across different backends.
- FullOf_Bad_Ideas suggests that Hugging Face’s new ‘Kernels’ repo type is essentially a rebranding of existing data storage solutions, comparing it to GitHub releases but hosted on AWS S3 instead of Azure. They express hope for future integrations with tools like pip and community projects, which could enhance its utility for developers.
- xignaceh queries whether ‘Kernels’ refers to optimized code or instructions tailored for specific hardware, akin to an intermediary layer between CUDA and C code. This implies a focus on performance optimization for different hardware architectures, which could be a significant technical advancement if true.
- a_beautiful_rhind raises a concern about the practicality of ‘Kernels’, noting the lack of backends that support easily interchangeable kernels. This suggests that while the concept might be promising, it could require significant manual effort to implement effectively, potentially limiting its immediate applicability.
-
Final voting results for Qwen 3.6 (Activity: 974): The final voting results for Qwen 3.6 have been announced, indicating a split in preferences among users, with a notable
40%of votes going to one option and20%each to three others. This distribution suggests a preference for dense models, as highlighted by the community’s reaction. The release of Qwen 3.6 is anticipated soon, following this voting outcome. Chujie Zheng shared the results on social media, sparking discussions about the potential open-sourcing of these models. Commenters noted the split in voting results, with some suggesting that the models should be open-sourced due to the lack of specific use cases for all of them. This reflects a broader community interest in accessibility and transparency of AI models.- Lissanro highlights the absence of the 397B model in the voting results, noting its superior performance in handling long, complex instructions compared to the 122B model. The 397B model is described as being over twice as fast as Kimi K2.5 (Q4_X quant) or GLM 5.1 when using Q5 quantization, making it a potentially ideal choice for various applications.
- Tall-Ad-7742 expresses a desire for larger model versions, such as a 120B or bigger, acknowledging that not everyone can run such large models but emphasizing their utility for certain users. This reflects a demand for scalability and flexibility in model offerings to cater to diverse computational capabilities and use cases.
- Mashic suggests open-sourcing all models, implying that the creators may not have specific use cases for each model themselves. This comment underscores a broader community interest in accessibility and collaborative development, which could drive innovation and application diversity.
-
Opus = 0.5T × 10 = ~5T parameters ? (Activity: 1004): The image is a meme-like screenshot of a social media exchange where Elon Musk claims that the current Grok model has
0.5 trillion parameters, which is half the size of another model called Sonnet and one-tenth the size of Opus. This suggests that Opus would have approximately5 trillion parameters. The exchange highlights Musk’s assertion of Grok’s strength relative to its size, though the context and accuracy of these claims are debated. The comments express skepticism about Elon Musk’s statements, with users questioning his credibility and suggesting that he might be exaggerating or lacking accurate information.- The discussion revolves around skepticism regarding the accuracy of Elon Musk’s statements about the Opus model having
0.5T × 10 = ~5Tparameters. Commenters express doubt about whether Musk has insider knowledge or is merely estimating without technical backing. This skepticism is fueled by Musk’s history of making bold claims that are sometimes not technically substantiated. - There is a suggestion that Musk might be misinformed or miscommunicating technical details, possibly due to receiving information from non-technical executives. This highlights a common issue where high-level executives may not fully grasp the technical specifics, leading to potential misinformation when they relay such details publicly.
- The comments reflect a broader skepticism about the credibility of technical claims made by high-profile figures like Elon Musk, especially when they involve complex topics like AI model parameters. This skepticism is rooted in past experiences where such figures have made inaccurate or exaggerated statements.
- The discussion revolves around skepticism regarding the accuracy of Elon Musk’s statements about the Opus model having
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. Claude Platform Advisor Strategy
-
Claude is now adopting the advisor strategy (Activity: 478): The image illustrates the ‘advisor strategy’ being implemented on the Claude Platform, where Opus acts as an advisor and Sonnet as an executor. This strategy allows agents to consult Opus during tasks for decision-making, enhancing intelligence while maintaining cost efficiency. In evaluations, this setup improved performance on the SWE-bench Multilingual by
2.7 percentage pointscompared to Sonnet alone, while reducing costs by11.9%. This feature is currently available in beta on the Claude Platform. Learn more. One commenter expressed interest in using Opus as both advisor and executor to compare its performance against Opus alone. Another noted that similar strategies can be implemented using external models, achieving effective results without relying solely on Anthropic’s tools.- Raspberrybye discusses a strategy involving the use of external models like Opus and Sonnet in conjunction with Minimax 2.5 for coding tasks. This setup allows for execution management by Minimax while feeding summaries back to Opus/Sonnet, effectively reducing the need for using Anthropic’s services and keeping token usage costs to about
$2 a day. This approach highlights the potential for cost-effective model orchestration without relying solely on a single provider. - Zedlasso points out the complementary strengths of Opus and Sonnet in a flipped setup, where Opus excels in game theory mechanics, making it ideal for advisory roles, while Sonnet is better suited for execution tasks. This comment suggests that the integration of these models is more about efficient token management rather than just feature enhancement, indicating a strategic approach to leveraging model capabilities for specific tasks.
- Raspberrybye discusses a strategy involving the use of external models like Opus and Sonnet in conjunction with Minimax 2.5 for coding tasks. This setup allows for execution management by Minimax while feeding summaries back to Opus/Sonnet, effectively reducing the need for using Anthropic’s services and keeping token usage costs to about
-
We’re bringing the advisor strategy to the Claude Platform. (Activity: 744): The image illustrates the integration of an ‘advisor strategy’ within the Claude Platform, where Opus acts as an advisor and Sonnet or Haiku as executors. This setup allows agents to consult Opus during complex decision-making processes, enhancing intelligence while maintaining cost efficiency. In evaluations, this combination improved performance on the SWE-bench Multilingual by
2.7 percentage pointscompared to Sonnet alone, while reducing costs by11.9%. This feature is now available in beta on the Claude Platform. Learn more. View Image. Commenters are curious about the integration with Claude code and express skepticism about smaller models recognizing hard decisions without hallucinating. There’s also concern about resource constraints, particularly GPU availability, when using Opus.- BritishAnimator raises a critical point about smaller AI models, noting that they often ‘confidently hallucinate’ when making decisions. This highlights a common issue in AI where models lack self-awareness of their limitations. The commenter suggests that without extensive ‘guardrails in the system prompt,’ it’s challenging to mitigate this. They inquire about the possibility of an AI generating a ‘confidence score’ for its responses, which could potentially help in assessing the reliability of its outputs.
-
Bro the chart. I am crying (Activity: 568): The image is a chart that compares the performance and cost of two configurations: “Sonnet 4.6 High + Opus advisor” and “Sonnet 4.6 High solo” on the SWE-bench Multilingual evaluation. The configuration with the Opus advisor scores
74.8%at a cost of$0.96per task, while the solo version scores72.1%at$1.09per task. The chart suggests that using the Opus advisor improves performance and reduces cost. However, the comments highlight that the chart may be misleading due to the truncation of the y-axis, which exaggerates the differences between the two configurations. The comments criticize the chart for being misleading, specifically pointing out the use of a truncated y-axis as a common tactic in deceptive data visualization.
2. Anthropic Mythos Model Controversies
-
Cheap Open Models Reportedly Reproduced Much Of Mythos’s Showcased Findings (Activity: 729): The post discusses how small, inexpensive open-weight models were able to replicate much of the analysis showcased by Anthropic Mythos in AI cybersecurity. Specifically, these models detected Mythos’s flagship FreeBSD exploit and a 27-year-old OpenBSD bug, with models as small as
3.6 billion parameterscosting$0.11 per million tokens. This suggests that AI cybersecurity capabilities do not scale linearly with model size, and the real advantage lies in the system’s deep security expertise rather than the model itself. The findings challenge the notion of Mythos as a groundbreaking architectural advancement, as even small models outperformed frontier models in basic security reasoning tasks, indicating a jagged capability frontier. Commenters debate the validity of the findings, noting that the open models were tested on isolated code rather than entire codebases, which could skew results. Yann Lecun criticized Mythos as marketing hype, and others pointed out that Anthropic’s harness design might have influenced the results, questioning the novelty of Mythos’s approach.- The discussion highlights a critical difference in evaluating models: scanning entire codebases versus analyzing specific parts. Mythos reportedly did not scan entire codebases but focused on individual files ranked by vulnerability, which contrasts with the open-source model approach that directly analyzed known vulnerable functions. This distinction underscores the challenge of identifying vulnerabilities in large codebases without prior guidance.
- Funkahontas emphasizes the difference between autonomous discovery and targeted analysis. The open-source models were given specific vulnerable functions to analyze, akin to confirming a known issue rather than discovering it. This highlights the challenge of finding vulnerabilities in extensive codebases, which is the more complex task that these models did not address. The comment also critiques Yann LeCun for not releasing practical alternatives despite his criticisms of LLMs.
- Relach points out a potential flaw in the open models’ findings, noting that they flagged security issues even in versions where those issues were fixed, suggesting hallucination. This raises concerns about the reliability of these models in accurately identifying vulnerabilities, as they may produce false positives even when the code is secure.
-
OpenAI researcher says his Anthropic roommate lost his mind over Mythos (Activity: 1235): The image is a meme-style tweet from James Campbell, humorously recounting an incident where his roommate, an Anthropic employee, was emotionally overwhelmed by the release of “Mythos.” The tweet suggests that “Mythos” is a significant internal development at Anthropic, causing strong reactions among employees. The comments reflect a mix of amusement and curiosity about the nature of “Mythos,” with some noting that it has been used internally for some time. Commenters express intrigue and amusement at the situation, with some highlighting the unusual living arrangement of employees from competing AI companies, OpenAI and Anthropic, as roommates. Others speculate about the significance of “Mythos,” suggesting it might be a major development within Anthropic.
- A user discusses the limitations of AI models like Mythos when applied to niche programming tasks, particularly for vintage computers like the Commodore 64. They highlight that while AI can assist with standard C code using tools like cc65, it struggles with unconventional tasks such as creating ROM routines or manipulating the IEC bus, due to a lack of training data and references. This underscores the current limitation of AI as a ‘calculator for words’ rather than a tool for pioneering new solutions.
- The commenter provides a technical example involving the Commodore 64’s 6510 CPU, where they measure temperature by timing the output of a used pin on the CPU die, which changes with temperature due to capacitance. This innovative approach, which involves creating a lookup table to convert time measurements into temperature readings, illustrates the kind of creative problem-solving that current AI models struggle to replicate, as they lack the ability to generate novel solutions beyond existing data.
- The discussion points out a critical gap in AI capabilities: the ability to innovate beyond existing knowledge. The commenter argues that AI models need to evolve from merely replicating known solutions to generating new ones, especially in areas with limited documentation or precedent. This reflects a broader challenge in AI development, where models must transcend their role as ‘word calculators’ to become genuine innovators in unexplored domains.
-
BREAKING: Anthropic’s new “Mythos” model reportedly found the One Piece before the Straw Hats (Activity: 4328): Anthropic has reportedly developed a new reasoning model named Mythos, which allegedly located the fictional treasure ‘One Piece’ during a benchmark test, completing the task in
11 seconds. This has sparked a humorous narrative involving Eiichiro Oda, the creator of One Piece, who expressed mock frustration over the model solving a mystery he intended to extend over342 more chapters. In response, Anthropic has initiated Project Glasspoiler to use Mythos for securing critical plot lines against spoilers. OpenAI humorously claimed their model found the treasure first but withheld the information to respect the narrative. The comments humorously extend the narrative, suggesting that the Mythos model has also completed other unfinished works like George RR Martin’s series and developed GTA 6, highlighting the community’s engagement with the playful tone of the announcement.
3. Qwen Model Performance and Features
-
Qwen 3.6 Plus is the first Chinese model to survive all 5 runs on FoodTruck Bench (Activity: 256): The image is a leaderboard from the FoodTruck Bench, a 30-day business simulation benchmark, highlighting the performance of various AI models in running a food truck. Qwen 3.6 Plus, developed by Alibaba, is noted as the first Chinese model to survive all five runs, achieving a
+283%median ROI and a$7,668median net worth. This marks a significant improvement over previous models like Qwen 3.5 397B and GLM-5, which could analyze their failures but not survive the simulation. Qwen 3.6 Plus effectively manages inventory, location strategy, and adapts to weather and events, although it still struggles with ingredient waste, preventing it from reaching the performance tier of models like Gemma 4. Commenters express interest in seeing how other models, such as Mythos, would perform on this benchmark, and note that even top models like Gemma 4 have inefficiencies, such as food waste, highlighting the complexity and challenge of the simulation.- The FoodTruck Bench is a benchmark designed to evaluate the efficiency and performance of AI models in resource-constrained environments. The fact that Qwen 3.6 Plus is the first Chinese model to complete all 5 runs indicates its robustness and efficiency compared to other models like Gemma 4, which is noted for its inefficiency in resource usage, particularly in terms of food wastage.
- There is interest in seeing how other models, such as Mythos and GLM 5, would perform on the FoodTruck Bench. This suggests a competitive landscape where different models are being compared for their efficiency and performance in specific tasks, highlighting the importance of benchmarks in assessing AI capabilities.
- A question was raised about the data used for quality gating in the benchmark, specifically whether real data or synthetic data evaluated by a model is used. This points to a critical aspect of benchmark design, where the type of data can significantly impact the evaluation outcomes and the perceived reliability of the benchmark results.
-
I think Qwen Code is seriously underrated right now (Activity: 111): Qwen Code has introduced significant updates, enhancing its utility as a coding assistant. The latest features include remote control via Telegram, enabling task execution directly on servers, and native support for Cron Jobs to automate tests or builds. The release of Qwen3.6-Plus offers a
1M context windowwith1,000 free daily requests. A notable feature is sub-agent routing, allowing the use of a heavy model for main tasks and a lighter, cost-effective model for subtasks. The new/plan modeoptimizes execution by mapping files beforehand, reducing time and token usage. One commenter highlights the integration of Qwen Code with OpenSpec and custom skills as a significant enhancement to programming workflows, mentioning the use of models like GLM 5.1 and MiniMax M2.7 via OpenRouter. Another comment humorously downplays the update’s significance.- Qwen Code, combined with OpenSpec and custom skills, significantly enhances a programmer’s workflow. Users benefit from 1,000 free requests per day, and integration with models via OpenRouter, such as GLM 5.1, MiniMax M2.7, and Nemotron 3 Super 120B A12B, further extends its capabilities. This setup provides a robust and versatile environment for developers.
- The integration of Qwen Code with OpenRouter allows for seamless use of multiple models, including GLM 5.1 and MiniMax M2.7. This flexibility is particularly beneficial for developers looking to leverage different model strengths in their projects, offering a comprehensive toolset for various programming tasks.
- Despite some criticism regarding marketing tactics, Qwen Code is praised for its performance and accessibility. The platform’s speed and the provision of free daily requests make it an attractive option for developers, especially those looking for cost-effective solutions without compromising on quality.
AI Discords
Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.