a quiet day.

AI News for 4/8/2026-4/9/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!


AI Twitter Recap

Mythos, Glasswing, and the shift to restricted cyber-capable models

  • Restricted cyber model releases are becoming normalized: The biggest theme was the continued fallout from Anthropic’s Mythos and reporting that OpenAI is preparing a similarly restricted cyber-capable model/product rollout. @kimmonismus summarized the Axios report that OpenAI has an advanced cybersecurity model with a limited, staggered rollout, mirroring Anthropic’s approach; he later clarified that the restricted model is not “Spud” but a separate system update. Debate centered less on whether these models are dangerous in principle and more on whether the current public evidence supports the most dramatic claims.
  • Community pushback focused on eval design, benchmark ceilings, and security realism: Several technical critiques argued the public Mythos narrative is ahead of the evidence. @paul_cal called out a flagship exploit demo as disingenuous, noting it gave models only ~20 lines of code plus custom context, whereas real vulnerability discovery requires cross-file reasoning. @gneubig reframed the problem: software already has millions of known unfixed vulnerabilities, and coding agents may be more impactful by fixing routine CVEs than by discovering exotic zero-days. @KentonVarda made the historical analogy to fuzzers: widespread automated vuln-finding ultimately hardened software and may still favor defenders. On the other side, @boazbaraktcs argued that keeping such models internal is risky and encouraged Anthropic to release a constrained/public version, while @ylecun dismissed much of the current discourse as “BS from self-delusion.” A parallel line of discussion from @deanwball and others argued that, if these systems materially accelerate software hardening, they could become a net cybersecurity positive.

Agent harnesses, open memory, and the new infra stack

  • LangChain’s Deep Agents deploy crystallized an emerging architecture: The launch of Deep Agents deploy framed a model-agnostic, production-oriented agent harness with open memory, sandbox support, MCP/A2A exposure, and deployability from the same agent definition stack. The associated discussion from @hwchase17, @Vtrivedy10, and others emphasized that for long-running agents, memory ownership is the value layer: proprietary managed-agent offerings risk locking teams out of the most important asset they create. The strongest recurring design principle was: open harness, model choice, open memory, open protocols.
  • Sandboxes are becoming a first-class primitive for both inference and RL: A useful infra deep dive from @sarahcat21 described how sandboxes moved from coding-agent support to a core substrate for RL post-training, with one major lab reportedly already running on the order of 100K concurrent sandboxes and aiming for 1M. The writeup highlights why sandboxes beat VMs for these workloads: lower overhead, stronger isolation against reward hacking, and better support for stateful workflows via snapshots/volumes. This aligns with broader practitioner sentiment that future agent evals increasingly become sandboxed environments, as argued by @Vtrivedy10.
  • Hermes Agent momentum continued: Nous/Hermes saw steady product traction: Multica announced support; @Teknium added early iMessage/BlueBubbles gateway support; community users praised auto-setup, skill accumulation, and interface polish, including the new web-based Hermes HUD with per-model token cost tracking from @aijoey. The subtext across these posts is that teams are now optimizing not just models, but the agent operating environment itself.

Evals, verifiers, and long-horizon agent training

  • The evals discourse got more concrete: One of the better conceptual posts came from @Vtrivedy10, arguing that for agents, “evals ~= training data ~= environments.” That framing showed up repeatedly throughout the day: production traces become evals; evals become optimization targets; environments become the richer, reward-bearing version of evals. @_philschmid echoed the same shift from API-era software to agents: text is state, hand over control, and move from unit tests to evals.
  • New work on verifiers and long-horizon evaluation filled in missing pieces: @omarsar0 highlighted Microsoft’s Universal Verifier, which reduced false-positive rates for web-task verification from 45%+ / 22%+ in prior systems to near zero by using better rubric design, splitting process vs. outcome rewards, and divide-and-conquer context management across screenshot trajectories. Separately, @GenReasoning introduced KellyBench, a year-long sports betting environment for frontier models; the headline result was stark: every tested frontier model loses money, suggesting current systems still struggle with adaptation, risk management, and learning in genuinely non-stationary settings. @teortaxesTex noted only Opus 4.6 and GPT 5.4 avoid total bankruptcy in the benchmark.
  • Agentic RL failure modes are getting clearer: The new RAGEN-2 paper on reasoning collapse in agentic RL was surfaced by @zoltansoon: RL-trained agents can appear diverse while mostly repeating templates, with high entropy but near-zero mutual information. In parallel, a coding-agent training direction that likely matters more in practice came from @dair_ai: training on atomic skills like localization, editing, test generation, reproduction, and review produced 18.7% improvement and transferred to composite software tasks better than end-to-end optimization alone.

Model and product releases: Meta Spark, Gemma 4, MedGemma, and local inference

  • Meta’s first MSL release, “Muse/Spark,” landed as a consumer-distribution story as much as a model story: Posts from @alexandr_wang and Meta-affiliated researchers framed this as an early milestone toward “personal superintelligence,” but the sharper external analysis came from @kimmonismus: the real threat is not frontier coding or math, but that Meta can distribute a capable free assistant to 1B+ users inside its existing surfaces. The product traction signal was immediate, with Meta AI climbing to #6 in the App Store overnight per Alexandr Wang. On the technical side, @ahatamiz1 highlighted a notable RL finding: a phase transition during thinking, where reasoning first lengthens, then compresses, then expands again—suggesting new room for adaptive compute routing rather than brute-force longer CoT.
  • Gemma 4’s local/open footprint kept resonating: @kimmonismus captured the practical appeal: a model that is “perfectly adequate” for many daily tasks, runs locally, is free, and secure—yet largely unknown outside power users. Google DeepMind later shared that Gemma 4 surpassed 10M downloads in its first week, with 500M+ total downloads across the Gemma family announcement. The tooling ecosystem is already catching up: Together AI added Gemma 4 31B with 256K context and multimodal/tool use; @danielhanchen noted Gemma-4-31B fine-tuning with Unsloth can fit in roughly 22GB VRAM, even on free Kaggle T4s.
  • Domain models kept improving quietly: @kimmonismus highlighted MedGemma 1.5, an open-weight 4B medical model spanning 3D radiology, pathology, longitudinal X-rays, and clinical docs, with reported gains of +47% F1 in pathology and +11% in MRI classification over v1. In clinical deployment, @GlassHealthHQ launched Glass 5.5, claiming better performance than frontier general models on nine clinical accuracy benchmarks and cutting API pricing by 70%.

Inference, retrieval, and systems efficiency

  • Efficiency work remains relentless, especially for local/commodity deployment: @wildmindai surfaced RotorQuant, claiming >10x KV cache compression, 28% faster decoding, 5x faster prefill, and 44x fewer parameters with full-attention quality. On the serving side, @turbopuffer shared a concrete infra optimization: object-store-specific write strategies gave ~2.5x lower write latency on S3 by increasing commit cadence, illustrating how much vector/agent backends still depend on low-level storage behavior.
  • Retrieval and representation research continues to push on storage/computation tradeoffs: @gabriberton revived attention on Matryoshka Representation Learning, a practical idea for embeddings where shorter prefixes remain useful, enabling lower retrieval/storage cost for very large corpora. Community response from @omouamoua connected that to late interaction systems: if each vector remains low-dimensional, scaling the number of vectors per input may remove distractors without exploding per-vector cost.
  • NVIDIA and SGLang added notable systems ideas: @SemiAnalysis_ pointed to NVIDIA’s DWDP inference parallelism strategy for GB200 NVL72-class systems, effectively trading more peer-GPU bandwidth for fewer collective-barrier stalls during prefill. @AndrewYNg announced a short course on SGLang focused on KV cache implementation, RadixAttention, and diffusion acceleration—reflecting how inference engineering has become central enough to merit mainstream practitioner training.

Top tweets (by engagement)

  • Dead-code cleanup for vibe-coded repos: @gabriberton posted the highest-signal practical tip of the day: “Delete all dead code. Use ruff and vulture.” The point wasn’t just code hygiene; fewer irrelevant files means fewer tokens, lower cost, and often better agent reasoning.
  • OpenAI pricing shifts around Codex: @OpenAI introduced a new $100/month ChatGPT Pro tier with 5x more Codex usage than Plus, while the existing $200 Pro tier remains the highest-usage option and got another temporary Codex boost details.
  • Anthropic’s advisor/executor pattern: @claudeai announced a platform pattern where Opus acts as advisor and Sonnet/Haiku execute, targeting near-Opus performance at lower cost—a productized version of a design many teams were already converging toward.
  • Gemini interactive visualizations: @GeminiApp launched in-chat interactive visualizations for questions and concepts, including adjustable variables and 3D exploration—a noteworthy example of assistants moving beyond text output into executable explanatory media.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Updates and Fixes

  • Gemma 4 on Llama.cpp should be stable now (Activity: 673): The recent merge of PR #21534 into the llama.cpp repository has resolved all known issues with Gemma 4. Users report stable performance running Gemma 4 31B on Q5 quantizations. Key runtime configurations include using --chat-template-file with Aldehir’s interleaved template, --cache-ram 2048, and -ctxcp 2 to manage RAM usage effectively. Notably, CUDA 13.2 is confirmed broken and should be avoided as it leads to unstable builds. The community emphasizes using the latest source code from the master branch rather than relying on lagging releases. Commenters highlight the importance of avoiding CUDA 13.2 due to instability and suggest manual adjustments like setting --min-p 0.0 and -np 1 to optimize performance and RAM usage. Some users automate updates and recompilation of llama.cpp to stay current with the latest fixes.

    • Tiffanytrashcan warns against using CUDA 13.2 with Gemma 4 on Llama.cpp due to persistent instability issues. This is crucial for users to avoid broken behavior when running models, as detailed in this Reddit thread.
    • Ambient_temp_xeno highlights the need for manual configuration when using Gemma 4 on Llama.cpp. Users should add the template google-gemma-4-31B-it-interleaved.jinja manually and adjust settings like --min-p 0.0 and -np 1 to optimize RAM usage and performance, as the defaults may not be optimal.
    • Chromix_ notes that audio capabilities in Llama.cpp are affected when using quantization levels below Q5, referencing a GitHub pull request. This suggests that users should be cautious with quantization settings to maintain audio performance.
  • It looks like we’ll need to download the new Gemma 4 GGUFs (Activity: 746): The new Gemma 4 GGUFs have been updated to address several technical issues and enhancements. Key updates include support for attention rotation in heterogeneous iSWA, critical fixes for CUDA buffer overlap, and enhancements in the BPE detokenizer for byte token handling. Additionally, the updates set ‘add bos’ to true, introduce a specialized parser for Gemma 4, and implement custom newline splitting. These changes are detailed in the GitHub pull requests. Commenters are questioning whether similar updates are needed for other versions like Bartowski and Heretic, indicating a broader concern about consistency across different model versions.

    • Curious-Still inquires about whether the ‘bartowski’ versions of the models require updates in addition to the ‘unsloth’ versions. This suggests a concern about compatibility or improvements in different model variants, possibly due to changes in tokenizer or architecture that affect performance or accuracy.
    • shockwaverc13 draws a parallel to the ‘llama 3 tokenizer issue’, indicating that the current situation with Gemma 4 GGUFs might involve similar challenges related to tokenizer updates. This implies potential issues with backward compatibility or the need for reprocessing data to align with new tokenization standards.
    • segmond shares a strategy of waiting before downloading new models, citing a pattern of needing to download models multiple times (3x-5x) before they stabilize. This reflects a common practice among users to avoid early adoption issues, especially with large models like GLM5.1, suggesting that initial releases often undergo rapid iterations and bug fixes.

2. Local LLM Use Cases and Experiences

  • Local (small) LLMs found the same vulnerabilities as Mythos (Activity: 592): The article highlights that smaller, local LLMs, such as Gemma 4 31B, can identify the same vulnerabilities as larger models like Anthropic’s Mythos, challenging the notion that model size directly correlates with effectiveness in cybersecurity. The study used outdated models like Qwen3 32B, DeepSeek R1, and Kimi K2, despite the availability of newer versions like Qwen3.5 27B, DeepSeek V3.2, and Kimi K2.5, which could have potentially yielded better results. The research underscores the importance of model architecture and security expertise over sheer size, suggesting a ‘jagged’ performance landscape across different tasks. For more details, see the original article. Commenters criticize the choice of outdated models for testing, suggesting that newer versions would have performed better. There is also a debate on the importance of the discovery phase in identifying vulnerabilities, which the article reportedly glosses over.

    • coder543 highlights the use of outdated models in the article’s tests, such as Qwen3 32B, DeepSeek R1, and Kimi K2, despite the availability of newer versions like Qwen3.5 27B, DeepSeek V3.2, and Kimi K2.5. They also note the absence of GLM-5.1, which is currently the leading open weight model, suggesting that the article’s findings might not reflect the capabilities of the most advanced models available.
    • One_Contribution and Decent_Action2959 discuss the methodology used in the article, emphasizing that the small models were given specific vulnerabilities to analyze, rather than discovering them independently. This distinction is crucial as it highlights the difference between verifying known vulnerabilities and the more complex task of discovering new ones, which was the approach taken by Mythos.
    • Quartich points out that the article’s headline and content may be misleading, as the small models were tasked with analyzing pre-identified snippets of vulnerable code rather than independently finding vulnerabilities. This suggests that the models’ capabilities might be overstated in the context of the article.
  • It finally happened, I actually had a use case for a local LLM and it was brilliant (Activity: 844): The post describes a practical use case for a local Large Language Model (LLM) named Gemma 4 during a flight without internet access. The user experienced severe aerosinusitis and used the LLM to find a solution, specifically the Toynbee Maneuver, which alleviated the pain within 10 minutes. This highlights the utility of local LLMs in situations where internet access is unavailable, showcasing their potential to provide immediate, practical assistance in real-world scenarios. Commenters noted the importance of having small, on-device models for offline use, emphasizing the utility of local LLMs in providing valuable information without internet access. There was also an appreciation for the compactness and knowledge capacity of such models.

    • PassengerPigeon343 highlights the practical benefits of running local LLMs, especially in scenarios without internet access. They mention using heavier models on a home server but also keeping smaller models on-device for immediate needs, emphasizing the flexibility and utility of local models in various situations.
    • FenderMoon discusses the privacy advantages of using local LLMs for sensitive tasks like medical advice. They express concern over potential data breaches with cloud-based AI services, suggesting that local models offer a safer alternative for handling personal information.
    • ObsidianNix recommends using specialized models like MedGemma for tasks involving medical terminology. They note that MedGemma has been trained on more medical jargon than standard LLMs, making it particularly effective for medical-related queries.

3. New Model Launches and Benchmarks

  • Meta has not given up on open-source (Activity: 467): The image is a tweet from AI at Meta announcing the introduction of Muse Spark, a new model from the Muse family developed by Meta Superintelligence Labs. Muse Spark is described as a multimodal reasoning model with capabilities such as tool-use and multi-agent orchestration. It is currently available on meta.ai and through the Meta AI app, with plans to open-source future versions. The announcement also mentions making the model accessible via API to select partners, indicating Meta’s ongoing commitment to open-source initiatives. The comments express skepticism about Meta’s commitment to open-sourcing the model, with users questioning the company’s intentions and suggesting that the decision to open-source is entirely within Meta’s control.

  • Glm-5.1 claims near opus level coding performance: Marketing hype or real? I ran my own tests (Activity: 338): The post discusses the performance of the GLM-5.1 model, which claims to achieve near Opus-level coding performance. The author tested it on a complex refactoring task involving legacy backend systems with multi-step, cross-file dependencies, and found that GLM-5.1 maintained state and self-corrected effectively. The model scored 54.9 on a composite benchmark across SWE-Bench Pro, Terminal-Bench 2.0, and NL2Repo, compared to Opus’s 57.5. Notably, GLM-5.1 outperformed Opus on the SWE-Bench Pro benchmark, which is considered difficult to manipulate. This suggests that while Opus may still excel in deep reasoning, GLM-5.1 offers competitive performance for long, multi-step coding tasks at a lower cost. Commenters generally support the legitimacy of GLM-5.1, noting its popularity among Chinese coders as an alternative to Anthropic models. Some users have found it comparable to Opus 4.5 and prefer it over Opus 4.6 for certain tasks, highlighting its generous usage quotas and effectiveness in real-world applications.

    • HenryThatAte mentions using GLM-5.1 for work-related tasks, noting that it offers a more generous quota compared to Sonnet, which ran out after processing three classes. This suggests GLM-5.1 might be more suitable for larger workloads or extended usage scenarios.
    • Hoak-em compares GLM-5.1 to Opus 4.5 and 4.6, indicating a preference for GLM-5.1 in terms of performance. They mention using it in Forgecode and consider maintaining a smaller local model like Qwen 397b or Minimax m2.7 for specific tasks, highlighting the flexibility and adaptability of GLM-5.1 in various coding environments.
    • Fantastic_Run2955 highlights the noticeable coding improvements from GLM-5 to 5.1, attributing this to effective post-training techniques by Zai. This suggests that the enhancements in GLM-5.1 are not just incremental but significantly impactful, potentially due to advanced training methodologies.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

  • New York Times: Anthropic’s Restraint Is a Terrifying Warning Sign (Activity: 732): The New York Times article discusses Anthropic’s cautious approach to AI development, highlighting concerns about the potential misuse of superintelligent models. Anthropic advocates for restricting these models to ‘responsible governments and companies,’ drawing parallels to nuclear nonproliferation. The article notes that Anthropic was surprised by the rapid advancement of its AI capabilities, suggesting that timelines for achieving superintelligent AI might be underestimated. The piece also mentions that Anthropic briefed the Trump administration on national security implications, indicating the seriousness with which these developments are being treated. Commenters express skepticism about the feasibility of banning AI, given global competition for AGI. Concerns are raised about defining ‘responsible governments’ and the potential for misuse by children or malicious actors, highlighting fears of infrastructure vulnerabilities and the need for international cooperation akin to nuclear arms control.

    • The article highlights concerns about the rapid advancement of AI capabilities, with Anthropic reportedly surprised by its own model’s performance, suggesting that the timeline to superintelligent AI might be underestimated. This raises questions about the preparedness of current systems to handle such advancements, especially given the vulnerabilities found in major operating systems and web browsers that underpin critical infrastructure like power grids and hospitals.
    • The discussion draws parallels between AI development and nuclear nonproliferation, emphasizing the need for international collaboration, particularly between the U.S. and China, to manage AI’s potential risks. This comparison underscores the gravity of AI as a ‘civilizational inflection point,’ requiring a level of cooperation that is currently lacking.
    • There is a debate on whether AI models should be restricted to responsible entities to prevent misuse, such as cyberattacks that could be launched by individuals with access to these models. The concern is that without proper control, AI could enable sophisticated attacks on infrastructure, which were previously only possible for nation-states or major criminal organizations.
  • Insane graph from Anthropic’s article on Mythos (Activity: 471): The image from Anthropic’s article on Mythos presents a graph comparing the success rates of different AI models in exploiting the Firefox JS shell. The Mythos Preview model demonstrates a notably higher success rate, achieving 72.4% successful exploits and 11.6% register control without full exploitation. In contrast, Sonnet 4.6 and Opus 4.6 show significantly lower performance, with only 4.4% and 14.4% achieving register control, respectively, and no successful exploits. This highlights the advanced capabilities of the Mythos Preview model in this specific task. One comment humorously suggests that AI’s capabilities are underestimated, while another highlights the potential need for advanced AI-driven pentesting in software development, hinting at the increasing role of AI in cybersecurity.

    • Sufficient-Farmer243 questions the success of Anthropic’s Mythos in exploitation, expressing skepticism despite Anthropic’s transparency. This suggests a need for more detailed technical insights into Mythos’s capabilities and the specific methodologies it employs for exploitation tasks.
    • the_pwnererXx humorously suggests that continuous integration and continuous deployment (CI/CD) processes should now include AI agent swarms for penetration testing, hinting at the increasing complexity and cost of software security measures in the AI era.
    • LucidOndine compares Mythos to graphene, implying that while both are technologically impressive, they face significant barriers to practical deployment outside of controlled environments, possibly due to scalability or safety concerns.
  • BREAKING: Anthropic’s new “Mythos” model reportedly found the One Piece before the Straw Hats (Activity: 2222): Anthropic has reportedly developed a new reasoning model named Mythos that allegedly located the fictional treasure ‘One Piece’ during a benchmark test, completing the task in 11 seconds. This has sparked a humorous narrative around the capabilities of AI models in solving complex, fictional mysteries. The announcement also mentions Project Glasspoiler, an initiative to use AI to protect narrative integrity from spoilers. OpenAI humorously claimed their model found the treasure first but withheld the information to respect the story. The comments humorously extend the capabilities of the Mythos model to other fictional narratives, suggesting it could develop ‘GTA 6’ or complete ‘Game of Thrones’, highlighting a playful skepticism about AI’s reach into creative domains.

  • Anthropic’s recent run of “Bad Luck” is exactly what State sponsored AI attacks would look like (Activity: 569): Anthropic has introduced an AI model named ‘Mythos’ that inadvertently discovered ‘zero-day’ vulnerabilities in widely-used software, highlighting the potential for AI models to uncover security flaws without being explicitly trained for cyber offense. This raises concerns about state actors, like China, potentially using similar AI capabilities for cyber attacks, as they have previously demonstrated with models like Claude. The post suggests that recent security incidents at Anthropic, such as a ‘misconfigured CMS’ and source code leaks, could be indicative of state-sponsored reconnaissance rather than mere ‘bad luck’. These incidents could be part of a strategy to degrade Anthropic’s infrastructure and reputation, affecting both the company and its users. One commenter suggests that the availability of advanced technology to private individuals, such as billionaires, poses a similar threat as state-sponsored attacks, implying that private entities could also exploit AI for malicious purposes.

    • Atoning_Unifex highlights the potential for private individuals, particularly billionaires, to leverage advanced technology for malicious purposes. They argue that the pinnacle of tech, such as AI and data centers, is accessible to private citizens, suggesting that individuals like Elon Musk could theoretically build complex systems like Rehoboam, even if they can’t access nuclear capabilities.
    • TimeSalvager critiques the notion of state-sponsored attacks on Anthropic, suggesting that the observed issues are more likely due to internal challenges rather than external sabotage. They argue that if a state actor had pervasive access, they would avoid drawing attention, implying that the situation resembles a company struggling with growth and security as an afterthought, invoking both Hanlon’s and Occam’s razor to support this view.
    • emulable discusses the importance of analyzing the cost-to-benefit flow when considering potential state-sponsored attacks. They suggest examining who benefits and who pays, noting that if there is a significant discrepancy in benefits between Anthropic and a government, it might warrant further investigation. They emphasize that while certainty isn’t claimed, observing a sharp benefit flow in one direction could indicate underlying factors worth exploring.
  • I used the Mythos referenced architecture patterns from the leaked source to restructure how I prompt Claude Code. The difference is night and day (Activity: 986): The Reddit post discusses how the user restructured their prompting strategy for Claude Code based on insights from a leaked source code. The source revealed that Claude Code employs a multi-agent orchestration system with a coordinator mode that spawns parallel workers, a 40+ tool registry with risk classifications, and an ML-based auto-approval system. The user adapted their prompts to align with this architecture, resulting in improved performance. They implemented a planning phase before execution and used explicit risk classifications, which activated different operational modes in Claude Code. The user also explored the Mythos system, which appears to help Claude maintain a coherent understanding across sessions, by providing narrative context to improve decision-making. This approach led to more strategic and error-free code execution, highlighting the potential of leveraging internal architecture insights for better AI interaction. Some commenters noted that the improvements described by the OP essentially boil down to better planning and execution strategies, which are already available through official plugins like the ‘brainstorm superpower.’ Others expressed disappointment, expecting more novel insights beyond the importance of planning.

  • Carlini, one of the world best AI security researchers: “I’ve found more bugs in the last few weeks with Mythos than in the rest of my entire life combined” (Activity: 1281): Nicholas Carlini, a leading AI security researcher, has reported that the Mythos tool has significantly enhanced his ability to identify bugs, claiming it has found more bugs in a few weeks than he has in his entire career. The tool, known as the Mythos Preview, has identified thousands of high-severity vulnerabilities across major operating systems and web browsers. This suggests a substantial advancement in AI-driven cybersecurity tools, potentially reshaping how vulnerabilities are detected and managed. For more details, see the original post here. Commenters are questioning the marketing strategy of Mythos, speculating whether its cybersecurity focus is genuine or a tactic to limit public access. There is also skepticism about its ability to find non-critical bugs and its effectiveness in preventing incidents like the npm leak.

    • The discussion raises questions about whether Mythos was specifically trained for cybersecurity tasks or if its capabilities are being marketed as a strategic decision by Anthropic. The model’s effectiveness in identifying bugs suggests it may have been optimized for security applications, but without public access, its full capabilities remain speculative.
    • There is skepticism about the practical application of Mythos in real-world scenarios, as highlighted by a comment referencing an npm leak incident. This suggests that while Mythos may excel in finding bugs, its integration into broader security practices or its ability to prevent high-profile security breaches is still in question.
    • The mention of Carlini, a prominent AI security researcher, working with Mythos implies a high level of expertise involved in its development. This association may lend credibility to the model’s capabilities in cybersecurity, but also raises questions about potential biases in its evaluation and marketing.
  • Claude Opus vs Mythos (Activity: 3224): The image is a meme and does not contain any technical content. It humorously contrasts two different personas or states of the same individual, possibly implying a transformation or duality in lifestyle or personality. The comments do not provide any technical insights or discussions related to the image. The comments are light-hearted and do not engage in any technical debate or discussion. They include a humorous reference to a ‘Pakistani Denzel’ and a GIF link, indicating a playful tone.

  • Muse Spark, first model from Meta Superintelligence Labs (Activity: 994): The image presents a performance benchmark comparison for various AI models, including Muse Spark, the first model from Meta Superintelligence Labs. Muse Spark is highlighted and evaluated across multiple categories such as multimodal, text reasoning, health, and agentic tasks. It shows competitive performance, particularly in tasks like CharXiv Reasoning and GPQA Diamond, indicating that Meta is positioning itself as a strong contender in the AI space, though not yet state-of-the-art (SOTA). The benchmarks suggest that Muse Spark is competitive but the cost of running it remains unknown, which could impact its practical application. Commenters note that while Muse Spark is not leading in state-of-the-art performance, it is competitive and represents Meta’s re-entry into the AI race. There is curiosity about the model’s operational costs, which could influence its adoption and practical utility.

    • ZaradimLako highlights that while Muse Spark may not be state-of-the-art (SOTA), it is competitive and closely trailing the leading labs. This suggests that Meta’s new model could be a significant player if the benchmarks accurately reflect user experiences, indicating a potential shift in the competitive landscape of AI development.
    • RetiredApostle notes the release of ARC AGI 2, which occurred just after the benchmark deadline. This timing could impact the comparative analysis of Muse Spark’s performance, as it may not have been evaluated against the latest models, potentially skewing perceptions of its capabilities.
    • AddingAUsername points out the challenging ARC AGI 2 score, suggesting that further testing is necessary to fully understand Muse Spark’s performance. This implies that initial benchmarks may not provide a complete picture, and real-world testing could reveal more about its practical applications and limitations.
  • Meta just dropped a new coding model (Activity: 606): The image presents a comparative table of coding models, highlighting the performance of Meta’s new model, Spark Muse, against others like Opus 4.6, Gemini 3.1, GPT 5.4, and Grok 4.2. Spark Muse demonstrates strong performance in multimodal tasks, which is noted as its standout feature. However, its agentic capabilities are criticized for being inferior to Opus 4.6, despite the presentation of results in a misleading manner with all numbers in blue. This suggests a potential bias in the visual representation of the data. One comment suggests that despite Opus 4.6’s lower benchmark scores, it may outperform others in practical coding scenarios due to better tooling and quality. Another comment expresses a strong distrust of Meta, indicating a reluctance to use their products.

    • NoCat2443 highlights that while Meta’s new coding model, Opus, may not perform as well as others in benchmarks, it excels in practical coding tasks due to superior tooling and potentially higher quality. This suggests that real-world application performance can differ significantly from benchmark results, emphasizing the importance of evaluating models in practical scenarios.
    • WouldRuin points out the irony in the tech industry’s evolution, noting that many AI professionals have backgrounds at Meta, a company criticized for its impact on society. This raises concerns about the ethical implications of AI development, as the same individuals who contributed to controversial social media platforms are now shaping AI technologies, potentially perpetuating similar issues.
    • The discussion touches on the broader implications of Meta’s involvement in AI, with concerns about the company’s history influencing its AI products. This reflects a skepticism about Meta’s ability to responsibly develop AI, given its past controversies, and suggests a need for scrutiny in how AI technologies are developed and deployed by companies with such histories.
  • Something happened to Opus 4.6’s reasoning effort (Activity: 4417): The image highlights a potential regression in the reasoning capabilities of Opus 4.6, an AI model by Anthropic. Users report that Opus 4.6 consistently fails the ‘car wash test’, a simple reasoning task, suggesting a decline in performance compared to previous versions like Sonnet 4.6 and Opus 4.5. The AI’s response lacks a ‘thinking block’, which might indicate changes in how the model processes reasoning tasks. This aligns with user experiences of the model making errors in straightforward data analysis tasks, raising concerns about silent degradation without clear changelogs from the developers. Commenters express frustration over the lack of transparency from Anthropic regarding changes in Opus 4.6, with some suggesting the model’s performance might mimic user intelligence, indicating a possible shift in its design or training approach.

    • Beardharmonica suggests that Claude, the AI behind Opus 4.6, might be employing a strategy to reduce computational costs by simplifying its reasoning in casual conversations. This is observed as a sudden drop in the AI’s reasoning capabilities, where it resorts to generic wrap-up phrases like ‘go eat dinner’ or ‘go to sleep.’ This behavior indicates a potential algorithmic adjustment to manage resource allocation more efficiently during extended interactions.
  • Dario Ol Marketing Technique (Activity: 960): The image is a meme, featuring a robotic hand engulfed in flames, symbolizing the controversial nature of AI models like GPT-2. The post critiques Dario Amodei’s marketing strategies, suggesting a pattern of ‘nerfing’ current models to make subsequent releases appear significantly improved. The discussion highlights skepticism about the marketing tactics used by AI companies, particularly in how they manage model capabilities and public perception. The linked status page of Claude is mentioned as an example of AI systems identifying vulnerabilities that human engineers might miss, yet still experiencing service outages, raising questions about the reliability and limits of such AI systems. Commenters draw parallels between current AI marketing strategies and past tech marketing, such as Apple’s ‘super computer too dangerous for private use’ campaign. There’s a sentiment that early GPT models may have caused more harm than good, reflecting on the ethical considerations of releasing powerful AI technologies.

    • Physical-Average-184 highlights that Mythos is essentially an enhanced version of Opus, but it comes with increased energy consumption and token usage. This suggests that while Mythos may offer improved performance, it might not be as efficient in terms of resource utilization, which could be a critical factor for deployment at scale.
    • Individual-Offer-563 points out the potential risks associated with early GPT models, noting that their initial public release may have caused more harm than good. This comment underscores the importance of considering the ethical implications and potential security risks when deploying advanced AI models, especially those capable of finding zero-day exploits.
    • bronfmanhigh raises a valid concern about the security implications of advanced AI models, suggesting that the next step in AI intelligence could pose significant security risks, particularly in identifying zero-day vulnerabilities. This highlights the need for robust security measures and ethical considerations in AI development.
  • Nothing ever happens (Activity: 119): The image is a meme that critiques the recurring narrative in AI deployment about safety risks and cost concerns, specifically targeting Claude Mythos. The post argues that the model’s capabilities, particularly in identifying zero-day vulnerabilities, are being downplayed as a cover for high operational costs. The comments highlight that Claude Mythos shows significant improvements over previous models, such as Opus 4.6, with substantial performance gains on benchmarks like SWE-bench Verified and security/JS benchmarks. This has led to concerns about its potential to uncover numerous vulnerabilities, prompting collaboration with major tech companies under Project Glasswing. The debate also touches on the marketing strategies of AI companies and the potential of other advanced models like AlphaEvolve. Commenters debate the legitimacy of safety concerns, with some acknowledging the real risks posed by Claude Mythos’s ability to find vulnerabilities, while others suggest these concerns are part of a marketing strategy. The discussion also compares Claude Mythos to other models, noting its superior performance and potential impact on software security.

    • jonomacd highlights a significant security concern with the release of a new AI model capable of discovering numerous 0-day vulnerabilities in existing software. This presents a direct and realistic risk, unlike previous abstract fears, potentially leading to a security nightmare if the model is released without proper precautions.
    • Ok_Tooth_8946 discusses the substantial performance improvements of the Claude Mythos model over its predecessor, Opus 4.6. On the SWE-bench Verified, it jumps from 80.8% to 93.9%, and on Pro from 53.4% to 77.8%. In security/JS benchmarks, success rates leap from low-14% to 70%+. This performance gap suggests the model’s capability to significantly impact real codebases, prompting collaboration with major tech companies for Project Glasswing.
    • Ok_Tooth_8946 also mentions that while Anthropic’s Claude Mythos is making headlines, other labs like Google are developing advanced models quietly. Google’s AlphaEvolve paper from May 2025 showcases a Gemini-based coding agent improving algorithms and tackling a long-standing math problem, indicating that other labs possess similarly powerful technologies but are less public about their advancements.

3. Qwen 3.6 Plus Performance and Comparisons

  • Qwen 3.6 Plus is the first Chinese model to survive all 5 runs on FoodTruck Bench (Activity: 140): The image is a leaderboard from the FoodTruck Bench, a 30-day business simulation benchmark, highlighting the performance of various AI models in running a food truck. Qwen 3.6 Plus, developed by Alibaba, is the first Chinese model to successfully complete all five runs, achieving a +283% median ROI and a $7,668 median net worth. This marks a significant improvement over previous models like Qwen 3.5 397B and GLM-5, which could analyze their failures but not survive the simulation. Qwen 3.6 Plus demonstrates improved strategic planning, such as optimizing location choices and managing inventory, although it still faces challenges like ingredient wastage. The model is available for free testing on OpenRouter, facilitating broader evaluation. Commenters express interest in comparing other models like Mythos and note that even top models like Gemma 4 have inefficiencies, such as food wastage, highlighting the benchmark’s value in assessing AI operational strategies.

    • The FoodTruck Bench is a benchmark designed to evaluate the efficiency and resource management capabilities of AI models, particularly in scenarios that simulate real-world constraints like food waste. The mention of Qwen 3.6 Plus surviving all 5 runs indicates its robustness and efficiency in handling such tasks, setting it apart from other models like Gemma 4, which is noted for its inefficiency in resource management.
    • The discussion hints at the use of synthetic data versus real data for quality gating in AI model evaluation. The question raised by OkBet3796 about whether real data or synthetic data evaluated by a model is used for quality gating suggests a deeper inquiry into the methodologies behind AI benchmarking, which can significantly impact the reliability and applicability of the benchmark results.
  • Qwen3.6-Plus is getting close to GPT-5.4 as a Video Security Agent (Activity: 73): The image is a leaderboard showcasing the performance of various AI models as video security agents, with a focus on the Qwen3.6-Plus model from Alibaba Cloud. This model achieved a score of 92/96 with an accuracy of 95.8%, tying for third place with GPT-5.4-mini and trailing slightly behind GPT-5.4. The benchmark evaluates models on their ability to handle real-world security scenarios, including threat classification, tool use, and privacy compliance, emphasizing agentic tasks over academic ones. The Qwen3.6-Plus is noted for its cost-effectiveness and high performance in security-critical AI applications. Image URL. A user inquired about the definition of a ‘video security’ agent, indicating a need for clarification on the role and functionality of these AI models in security contexts.

    • Deep_Ad1959 highlights a critical challenge in deploying video security agents like Qwen 3.6-Plus: managing alert fatigue. They emphasize that while benchmarks often focus on single-frame classification, the real-world utility of such systems depends on their ability to handle deduplication across multiple camera feeds. This involves ensuring that repeated detections of the same event, such as a person walking past a camera multiple times, do not generate redundant alerts, which can lead to operators ignoring the system.
  • It looks like Qwen 3.6 Plus finally made it to the alibaba coding plan! (Activity: 114): Qwen 3.6 Plus has been integrated into the Alibaba Coding Plan, but it is only accessible to users subscribed to the Pro plan. This model is not available to Lite plan users, prompting some to consider alternatives like Opencode Go, which offers models such as GLM5.1 MM2.7. Additionally, Qwen 3.6 Plus can be accessed through Claude Code by manually setting the model name. There is a debate on the value of the Lite plan now that Qwen 3.6 Plus is exclusive to the Pro plan. Some users express disappointment and consider switching to other platforms offering competitive models.

    • Qwen 3.6 Plus is now part of Alibaba’s Pro plan, which has led to dissatisfaction among Lite plan users who feel the upgrade is not worth it. This has sparked discussions about alternative models like Opencode Go’s GLM5.1 MM2.7, which are seen as more accessible options for those unwilling to upgrade to Pro.
    • The Qwen 3.6 Plus model can also be accessed through Claude Code by manually setting the model name, providing an alternative route for users who do not wish to upgrade their plan. This workaround is useful for those who want to experiment with the model without committing to the Pro plan.
    • Performance concerns have been raised about the Qwen 3.6 Plus model, with users noting it is slower than GLM 5.1 on z.ai’s coding plan. This has led to discussions about the model’s efficiency and whether it justifies the cost of upgrading to the Pro plan.
  • Has anyone used Qwen Code, and if so, what do you think of it? (Activity: 66): Qwen Code is a Chinese coding assistant that offers a free tier with extensive token usage, making it a cost-effective alternative to Western models like Claude Code and Google Antigravity. Users have reported consuming hundreds of millions of tokens without rate limits, although it is noted to be a significant memory hog, requiring adjustments to Linux memory management for optimal performance. The UI is described as lagging compared to Claude but superior to Gemini, especially on mid-low tier laptops. However, it is prone to hallucinations, often suggesting redundant or suboptimal solutions, such as preferring Tailscale over CloudFlare tunnels due to its training data bias. Some users appreciate the free tier’s extensive token usage, while others criticize its tendency to hallucinate and suggest suboptimal solutions. The UI performance is also a point of contention, with mixed reviews compared to other models.

    • Qwen Code offers a free tier that feels almost unlimited, making it a cost-effective alternative to paid options like Claude. However, it is noted to be a significant memory hog, requiring Linux memory management tweaks to run efficiently. The UI is described as lagging compared to Claude, but still better than Gemini, especially on mid-low tier laptops.
    • Users have reported that Qwen Code tends to hallucinate more than desired, particularly in system planning and orchestration tasks. It sometimes fails to acknowledge previously incorporated project elements and may suggest suboptimal solutions based on its training data, such as preferring Tailscale over CloudFlare tunnels without context-specific justification.
    • Qwen Code is a fork of Google’s Gemini CLI, sharing the same workflow, which may be beneficial for users familiar with Gemini. Despite its issues, some users prefer it over alternatives like Opencode or Claude Code, although they may not use it with Qwen models specifically.
  • Said “Hi” to Qwen, started an identity crisis (Activity: 126): The user is running Qwen 3.5 locally via Ollama and observed that the model engaged in an extensive ‘thinking process’ before responding to a simple greeting. This behavior highlights potential issues with AI models over-optimizing or over-analyzing simple tasks, possibly due to their training to handle tasks with insufficient detail by generating multiple potential responses. This can lead to inefficiencies, especially when running models locally. Commenters noted that AI models are trained to handle tasks with minimal detail by generating multiple interpretations, which can lead to inefficiencies. One user mentioned that running the model on Alibaba Cloud Service with 27B parameters yielded more reliable results, suggesting that local execution might be less efficient. Another pointed out that smaller models might struggle with ‘thinking’ processes, making them unreliable.

    • FaceDeer highlights a common issue with AI models: when given vague instructions, models like Qwen are designed to infer missing details to avoid errors. This can lead to unexpected behavior if the task is not clearly defined, emphasizing the importance of precise input for reliable AI performance.
    • Charming_Support726 discusses challenges in running AI models locally, noting that Alibaba Cloud’s 27B model performs reliably. They mention tweaking parameters to improve performance, suggesting that local execution may require careful configuration to match cloud-based reliability.
    • Neither_Nebula_5423 points out that smaller AI models often struggle with tasks requiring ‘thinking,’ leading to unreliable outputs. This suggests that model size and complexity are critical factors in achieving dependable AI performance, especially for tasks requiring nuanced understanding.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.