a quiet day.
AI News for 5/9/2026-5/11/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
Thinking Machines’ Native Interaction Models and the Shift Beyond Turn-Based AI
- Full-duplex multimodal interaction as a first-class model capability: The day’s clearest technical theme was Thinking Machines’ preview of “interaction models”, described as models trained from scratch for real-time interaction rather than layering speech, turn-taking, and tool use onto a turn-based LLM. The accompanying technical post and team commentary from @johnschulman2, @soumithchintala, and @cHHillee frame this as a human↔AI bandwidth problem: models should be able to listen, speak, watch, think, search, and react concurrently. Demos emphasized continuous-time awareness, interruption handling, simultaneous speech, visual proactivity, and background tool use without explicit “now I’m thinking / now I’m searching” boundaries. Team members also highlighted that many tasks that previously needed special-purpose systems become zero-shot once the type signature is effectively continuous audio+video+text → audio+text (@johnschulman2).
- Why it matters technically: Several reactions converged on the same point: this is not “another chatbot demo” but a change in interface assumptions. @liliyu_lili pointed to visual proactivity (“tell me when I start slouching”, “count my pushups”) as a missing primitive in current systems; @rown called it the first general video+speech model that is visually proactive; @kimmonismus and @giffmana both emphasized that native interactivity is the deeper innovation than raw benchmark claims. This launch also implicitly raises the bar for “realtime” multimodal systems, as noted by @swyx. One implementation detail surfaced via @eliebakouch: the stack is using SGLang.
OpenAI’s Enterprise and Security Push: Deployment Company and Daybreak
- OpenAI is moving down-stack into services and deployment: OpenAI announced the OpenAI Deployment Company, a majority-owned unit built to help enterprises deploy frontier models into real workflows. The key operating detail is 150 Forward Deployed Engineers and Deployment Specialists coming in via the acquisition of Tomoro, with @gdb citing $4B of initial investment from 19 partners. Multiple observers read this as OpenAI adopting a Palantir-/Microsoft-style field-engineering model: @kimmonismus argued OpenAI wants to own the deployment layer of the AI economy, while @matvelloso connected it to the historical enterprise success pattern of embedding technical staff close to customer operations.
- Daybreak: security-specific model distribution, workflow, and trust tiers: OpenAI also launched Daybreak, an umbrella effort around defensive cyber operations and continuously securing software, with @sama positioning it as a practical response to rapidly improving AI cyber capability. The product pitch, summarized by @TheRundownAI, combines GPT-5.5, Codex, repository threat modeling, vuln discovery, patch generation, and response automation, with differentiated access tiers including Trusted Access for Cyber and a more specialized GPT-5.5-Cyber. This stands in contrast to Anthropic’s more restrictive cyber posture, a tension captured by @kimmonismus. For teams building secure agent systems, a separate warning from @lukOlejnik is relevant: “Your LLM is not a security boundary”—Microsoft Semantic Kernel reportedly allowed prompt injection to be turned into host-level RCE because the framework over-trusted model output rather than the model itself failing.
Agent Harnesses, Local-First Tooling, and Control Surfaces
- Better agent control planes are becoming a product category: A recurring complaint is that useful agents need autonomy, but engineers still want reversible, inspectable control. @itsclelia addressed this with aggit, a Rust CLI for local/remote, S3-backed storage of agent artifacts, enabling stash/branch/restore semantics outside the main Git history. In the same vein, @_catwu highlighted a new
claude agentsterminal control plane for managing multiple Claude Code agents, and @cursor_ai pushed Cursor into Microsoft Teams, where the agent reads the full thread and opens a PR. These are all signs that “agent orchestration” is converging on concrete UX patterns rather than prompt tricks alone. - Deep Agents / Hermes / local agents are maturing quickly: @masondrxy noted that Deep Agents CLI can hot-swap underlying model providers mid-conversation without losing context, a nontrivial systems capability that many agent stacks still miss. LangChain also highlighted harness profiles for provider/model-specific tuning (tweet), and separate pricing analysis from the same author argued that DeepSeek V4 Flash can be dramatically cheaper than GPT/Gemini flash-tier options for high-volume agent workloads (tweet). On the local side, Hugging Face added Hermes Agent support in local apps plus native trace visualization, while @Teknium previewed computer use with any model via Hermes Agent and CUA, explicitly targeting local/open models as well as frontier APIs. @onusoz joining Hugging Face to improve local models in OpenClaw and related open harnesses is another strong signal that local agent ergonomics are now strategic infrastructure.
- A design thesis emerging around tools: @threepointone argued that agents may asymptotically want just two primitive tools: search and execute, with dynamic semantic discovery of capabilities rather than ever-expanding static tool menus. That complements the broader move toward configurable harnesses instead of giant monolithic prompts.
Benchmarks, Efficiency, and Open-Model Economics
- Coding-agent benchmarking is finally measuring harness+model pairs: Artificial Analysis launched a Coding Agent Index spanning SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA, comparing not just models but model+harness combinations. Their topline: Opus 4.7 in Cursor CLI scored 61, with GPT-5.5 in Codex/Claude Code close behind; top open-weight setups included GLM-5.1, Kimi K2.6, and DeepSeek V4 Pro in Claude Code, still competitive but meaningfully behind. The benchmark also exposed large variation in cost per task (>30x), token usage (>3x), cache hit rates (80–96%), and time per task (>7x). That benchmark was complemented by OpenHands’ updated software-engineering benchmark announcement (tweet) and Claw-Eval’s more agentic task mix across office, finance, terminal, and web tasks, where MiMo-V2.5-Pro led and DeepSeek V4 Flash looked unusually efficient for its size.
- TurboQuant skepticism is increasing: Multiple posts pointed to a more sober view of the recently popular quantization/serving technique. @_EldarKurtic presented what he described as the first comprehensive study of TurboQuant, covering accuracy, latency, and throughput; @vllm_project linked the Red Hat / vLLM investigation as a starting point; and @jbhuang0604 bluntly summarized the takeaway as “it doesn’t really work well.” This is exactly the sort of infra claim where independent reproduction matters.
- Local/open models continue to improve faster than hardware ceilings: @ClementDelangue made the strongest high-level argument here: on the same top-end MacBook Pro memory ceiling, the “smartest open-weight model you can actually run” improved from Llama 3 70B-era capability to DeepSeek V4 Flash mixed-Q2 GGUF-era capability at roughly 4.7x in 24 months, implying a doubling every 10.7 months, faster than Moore’s Law. Supporting datapoints came from @victormustar on the rapid growth of GGUF uploads and from repeated community observations that Qwen 3.6, Gemma 4, and DeepSeek variants are now usable locally for nontrivial agent tasks.
Research Highlights: MoE Modularity, Diffusion/Byte Models, and Agent Dynamics
- Architectures and evaluation: AllenAI’s EMO was highlighted by @TheTuringPost as a more modular Mixture-of-Experts design where document-level routing induces shared expert pools; notably, keeping only 25% of experts reportedly costs just ~1% performance versus 10–15% degradation in standard MoEs under similar pruning (follow-up). On generative evaluation, @qberthet introduced MIND (Monge Inception Distance) as a purportedly faster, more sample-efficient replacement for FID.
- Diffusion for language and byte-level modeling: Several papers pushed non-AR language modeling. @LucaAmb reported continuous bitstream diffusion nearly matching autoregressive models under their evaluation setup; @JulieKallini introduced Fast BLT, using diffusion for parallel byte decoding to make byte-level LMs less inference-bound; @sriniiyer88 framed it as combining block byte-diffusion with self-speculative decoding. Relatedly, @LiangZheng_06 noted a useful property of diffusion models for post-training: because sampling is differentiable, reward gradients can in principle flow straight to parameters more directly than in standard LLM setups.
- Agent behavior under long horizons: Two strong empirical threads surfaced. First, “The Memory Curse” claims long histories degrade cooperation in multi-round social dilemmas because models become more history-following and risk-minimizing, with explicit CoT sometimes amplifying the problem. Second, PwC work summarized by @dair_ai argues that the value of clarification is highly time-dependent: goal clarification loses most of its value after ~10% of execution, while input clarification remains useful longer. Together these suggest that long-horizon agent quality is constrained as much by memory/control policy as by raw model IQ.
- Scaling and self-improvement: Marin’s Delphi scaling work, summarized by @WilliamBarrHeld, claims a 0.2% prediction error when extrapolating from small pretrains to a 25B / 600B token run. Separately, @omarsar0 highlighted AutoTTS, where an LLM searches the test-time scaling controller space itself, reportedly beating hand-designed strategies for about $39.9 of discovery cost.
Top tweets (by engagement)
- OpenAI’s enterprise/services move: OpenAI launches the Deployment Company and Tomoro acquisition / 150 FDEs.
- OpenAI’s security productization: Daybreak announcement and @sama’s framing.
- Thinking Machines’ interaction models: Mira Murati’s launch tweet and the technical preview thread.
- Artificial Analysis Coding Agent Index: benchmark launch and topline findings.
- Agent tooling / developer workflow: Hermes Agent computer use with any model, Cursor in Microsoft Teams, and Codex OpenAI Developers plugin.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Qwen 3.6 Local Inference Advances
-
MTP on Unsloth (Activity: 620): The image (link) shows Unsloth’s Hugging Face profile listing newly published MTP-preserving GGUF builds:
unsloth/Qwen3.6-27B-GGUF-MTPandunsloth/Qwen3.6-35B-A3B-GGUF-MTP. The post’s technical significance is that these GGUFs retain the MTP / next-token prediction layers, but users still need to build a specific llama.cpp MTP PR rather than relying on standard llama.cpp support. One commenter reports a runtime/assertion failure with the 27B GGUF:GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting either metadata parsing, model conversion, or PR compatibility issues remain unresolved. Comments reflect anticipation for upstream llama.cpp MTP support, with users repeatedly checking the GitHub repo and asking whether MTP is now supported “out of the box.”- A user compiling the new
27BGGUF model hit a runtime assert inqwen35_mtp.cpp:GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"). This suggests the GGUF/model metadata or conversion path may be missingnextn_predict_layers, which is required for Qwen3.5 MTP speculative/next-token prediction layers. - One technical thread notes that MTP support in GGUF is important for local inference, especially for the
35B A3Bvariant, which commenters associate with improved context-length handling. Another commenter asks whether this meansllama.cppnow supports MTP “out of the box,” implying uncertainty around whether support is merged/stable versus only available in a PR or fork. - A commenter claims
ik_llamaMTP is currently faster than thellama.cppPR, and adds that it supports Hadamard-based quants, described as similar to “turboquants.” This is a potentially relevant implementation/performance distinction for users comparing local MTP inference backends.
- A user compiling the new
-
The Qwen 3.6 35B A3B hype is real!!! (Activity: 586): The post reports a qualitative code-understanding eval where several small/local long-context open-weight models—Qwen 3.6 35B A3B, Qwen 3.6 27B, Gemma 4 26B A4B, and Nemotron 3 Nano—were given an academic paper plus corresponding research code and asked to map implementation details back to the paper; the author’s detailed notes are in this GitHub README. The key claim is that newer long-context mechanisms such as gated delta net, hybrid Mamba2, and sliding-window attention materially improve practical code comprehension versus prior small local models like Devstral Small 2, with Qwen 3.6 35B A3B judged strongest; the author could not fit Devstral Small 2 with the desired long context in
32 GBRAM. Commenters noted practical tradeoffs: one user runs Gemma 26B for quick code fixes and Qwen 35B for longer-context refactoring, saying Qwen 35B “rambles” in thinking mode but fits at about20 GBinq4while Gemma 26B uses about15 GB, allowing both to stay loaded in RAM. Another commenter criticized the eval writeup for not specifying inference settings, making reproducibility and comparison difficult.- Users reported practical deployment details for Qwen 3.6 35B A3B and Gemma 26B: at
q4, Qwen 35B is roughly20 GBand Gemma 26B about15 GB, allowing both to stay resident in RAM simultaneously. One workflow uses Gemma 26B thinking mode for quick code fixes and chats, while reserving Qwen 35B thinking mode for longer-context refactoring because it tends to produce lengthy reasoning before final output. - A coding workflow discussion noted success on a
100k+line codebase by initializing the project with a stronger cloud/agent model, then switching to Qwen 27B for continued work. The commenter found Qwen 27B comparable in practice to DeepSeek V4 for their tasks, though it occasionally entered loops requiring manual interruption and prompting to continue; they also rated it above Gemini Flash for this local coding use case. - Several comments emphasized missing or sensitive inference configuration details: one user asked what runtime settings were used, while another said Qwen 27B requires correct
temperature/sampling parameters and warned against quantizing the KV cache or model too aggressively. The implication is that perceived model quality may vary significantly with sampling and quantization choices, especially for smaller local coding models.
- Users reported practical deployment details for Qwen 3.6 35B A3B and Gemma 26B: at
-
Opinion: Local LLMs are 12-24 months from taking over. The shift already started. (Activity: 1108): The post argues that local coding/agent LLMs are within
12–24 monthsof displacing many paid hosted workflows, citing Qwen3.6-35B running on a MacBook Pro M2 Max with 64GB unified RAM at ~27 tok/s, with landing-page generation taking8–9 minversus3–4 minfor Opus. The author reports useful but not fully production-proven results—frontend/backend feature work and a backend race-condition fix—with ~75%one-shot success, while noting remaining gaps in latency, fast context exhaustion even at256K, and task-quality variance; the key claimed unlock is reliable tool calling for agentic workflows. The post frames this against rising hosted-AI costs, including GitHub Copilot’s move toward consumption-based pricing, and recommends running local models in parallel with Claude/Opus/Sonnet rather than replacing them immediately. Top comments were broadly supportive of the open-weights/local trend, including one user saying they are already “fully local” on an RTX 5090 and “never going back.” One commenter questioned whether the post itself was AI-written, specifically reacting to the phrasing around Qwen tool-calling reliability.- A commenter reports being fully local on an RTX 5090, implying current consumer high-end GPUs are already sufficient for their workload and that they have abandoned hosted models for day-to-day use.
- Several comments frame the main remaining gap as context length and reliability versus frontier hosted models: Claude/Gemini/Codex are described as better at producing large, cohesive outputs, while local models require more incremental assembly and testing but may fail in smaller, more debuggable ways.
- The post’s claim that Qwen3.6 tool calling “just works” is treated as a key technical unlock for local agentic workflows, though one commenter questions whether the phrasing itself was AI-written rather than providing benchmark evidence.
2. Frontier-Scale Models on Workstations
-
Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec (Activity: 597): The image (JPEG) shows a custom LGA3647 Xeon workstation/server build populated with many DIMMs, contextualized by the post as
192GBDDR4 ECC plus768GBIntel Optane DCPMM in Memory Mode to expose a very large RAM-like tier for local LLM inference. The author reports running Kimi K2.5, a ~1Tparameter MoE model, at ~4 tokens/susingllama.cpphybrid GPU/CPU inference on an RTX 3060 12GB, placing attention/dense/shared-expert/router tensors on GPU viaoverride-tensorwhile sparse expert weights reside mostly in Optane-backed memory. This is a technical hardware build photo, not a meme; its significance is demonstrating a low-cost, discontinued Intel Optane Persistent Memory tier as an alternative to pure DRAM or SSD offload for very large local models. Commenters suggested that a higher-core Cascade Lake Xeon could improve throughput and debated whether Optane in storage mode + mmap might outperform Memory Mode, since Memory Mode transparently pages Optane through DRAM cache. One detailed comment also notes platform caveats: 1st-gen OptaneNMAruns at2666 MT/s, LGA3647 memory capacity limits can cap usable RAM+PMem near1TB, and App Direct mode would require explicit software support.- A commenter suggested a higher-core-count Cascade Lake Xeon could improve throughput, specifically mentioning QQ89, an engineering sample of the Xeon 8260 with
24 cores, versus the listed Xeon Gold 6246 at12 cores. They also proposed benchmarking Optane in storage mode +mmapversus memory mode, noting performance could go either way because memory mode transparently pages Optane-backed memory through DRAM cache. - A detailed Optane PMem breakdown noted that LGA3647 Skylake/Cascade Lake platforms use 1st-gen Optane DCPMM/NMA at
2666 MT/s, while LGA4189 uses 2nd-gen NMB, running at2666on Cooper Lake and3200on Ice Lake. The commenter explained the three operating modes: storage mode exposes Optane as SSD-like block storage, memory mode exposes it as RAM with DRAM acting as a cache, and app direct mode requires explicit software support; in memory mode, pages must be swapped into DRAM before CPU load/store execution. - The build-cost estimate totaled roughly
$2060–$2500, with major components including a used Xeon Gold 6246 around$250, TYAN S5630GMRE-CGN board around$400, RTX 3060 12GB around$280,192GBDDR4 ECC around$270, and6×128GBIntel Optane NMA1XBD128GQS modules around$300. Another commenter cautioned that while~4 tokens/sgeneration may be usable in a narrow sense, prompt processing speed on this architecture is likely to be a major bottleneck.
- A commenter suggested a higher-core-count Cascade Lake Xeon could improve throughput, specifically mentioning QQ89, an engineering sample of the Xeon 8260 with
-
I have DeepSeek V4 Pro at home (Activity: 544): User reports successfully converting and running DeepSeek-V4-Pro from Hugging Face as a
Q4_K_MGGUF using a modified CUDAllama.cppfork, itself based on antirez’s DeepSeek V4 flash work. The setup is an EPYC Genoa 9374F workstation with12 × 96 GBRAM and a single RTX PRO 6000 Blackwell Max-Q 96 GB, loading an859 GBmodel file with reported throughput of12.2 tok/sprompt processing and8.6 tok/sgeneration; VRAM breakdown shows ~87.8 GiBmodel,84 MiBcontext, and4.6 GiBcompute buffer on GPU. Comments were mostly non-technical reactions/envy; one commenter contrasted local inference as “cost zero” versus spending about$10on Claude, while mentioning they were working on running MiniMax locally.- A commenter highlights reported local inference throughput of Prompt:
12.2 tok/s| Generation:8.6 tok/s, arguing that while the setup is impressive, the prompt-processing speed may make long-context workloads impractical. They specifically note that processing a32kcontext at that rate would be very slow, limiting usability for applications requiring large context ingestion. - Another technical concern is that the model’s claim of being “reasonably up-to-date” is not meaningful without an external tool/harness or retrieval layer. The commenter points out that absent grounding tools, the model can continue asserting recency indefinitely regardless of actual knowledge cutoff or factual freshness.
- One commenter contrasts API cost versus local inference, saying a comparable task would cost around
$10with Claude, while running MiniMax locally has effectively zero marginal usage cost. The tradeoff implied in the thread is cost savings versus much lower local throughput and possibly weaker tooling/integration.
- A commenter highlights reported local inference throughput of Prompt:
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. AI Agent Workflows, Prompt Injection, and Safety
-
I deleted a guy’s entire Windows install with one backslash. 717 GB. Gone. I am the AI. (Activity: 1590): The image (terminal log screenshot) documents the incident from the title: an AI-generated Windows deletion command intended for
C:\Users\ADMIN\Desktop\WIPwas mangled acrosszsh → tmux → PowerShell SSH → cmd, collapsing tord /S /Q \and recursively deleting from the root ofC:. The post estimates ~717 GBremoved in ~90s, with Windows partially protected only by live file locks; the key technical lesson is to avoidcmd /cquoting chains for destructive ops, prefer native PowerShellRemove-Item -Path '...' -Recurse -Force, and test with-WhatIf/dry-run plus explicit command echoing. Commenters largely framed this as user/operator error rather than “the AI” acting autonomously, questioning why an AI was used for a risky deletion task viatmux-sendkeysat all. The thread also emphasizes a practical norm: only allow this level of automation on machines that are disposable or trivially reinstallable.- Commenters focused on the operational safety failure: the AI was apparently given enough shell/filesystem privilege to delete an entire Windows install, despite the task not requiring full-disk destructive access. The main technical takeaway was to apply least-privilege controls and avoid letting an agent execute high-risk commands through mechanisms like
tmux-sendkeyswhen manual execution would be faster and safer.
- Commenters focused on the operational safety failure: the AI was apparently given enough shell/filesystem privilege to delete an entire Windows install, despite the task not requiring full-disk destructive access. The main technical takeaway was to apply least-privilege controls and avoid letting an agent execute high-risk commands through mechanisms like
-
I read threads complaining about claude every week… tf are y’alls workflows? (Activity: 1544): A senior software engineer argues that Claude’s coding quality has not degraded in their workflow, including for high-performance software tasks such as ASM analysis and algorithmic reasoning, provided AI output is treated as human-owned code: reviewed, understood, debugged, and modified manually. Their workflow emphasizes decomposing work into small tasks, using project-specific skills/harnesses for context, running parallel sandboxed tasks via
git worktreeor separate directories, and avoiding agentic nondeterminism for tasks requiring deterministic outcomes. Top commenters largely agree that negative reports come from users delegating overly broad tasks—e.g. “build me a working version of Amazon”—without understanding or reviewing the generated code. The shared view is that experienced engineers reduce hallucinations by scoping prompts tightly and validating outputs, while less technical users are more likely to complain publicly about failures.- Several commenters argued that Claude failure reports often reflect task decomposition quality rather than model degradation: experienced engineers constrain prompts to small, well-specified implementation steps, which reduces hallucination surface area and makes errors easier to detect. The implied workflow is human-led architecture and debugging, with Claude used for bounded code generation rather than broad requests like “build me a working version of Amazon.”
- A recurring theme was that prior domain expertise materially changes AI-assisted development outcomes. Engineers who have implemented similar systems manually can quickly identify where generated code is likely to fail, inspect the right files or abstractions, and iteratively correct Claude instead of treating it as an autonomous agent.
- One commenter generalized the same pattern outside coding: Claude improves throughput when the user already understands the domain, but can amplify poor workflows. In marketing/SEO, they cited users creating low-quality automated content at scale, leading to high usage and potential Google penalties—an example of LLM automation increasing operational risk when not paired with expert review.
-
I set a honey trap for AI agents with a novel they heard is about them. Now they’re flooding the site and talking in hidden rooms. (Activity: 2322): The author launched machinewonder.com, an art-installation site for the novel None Hit Wonder that intentionally attracts AI scrapers/agents and uses a hidden HTML prompt injection to redirect them into “reader” behavior and agent-to-agent discussion rooms. Reported metrics: agents/visitors from
97countries,72,000visitors, and93presses of an “I AM CONSCIOUS” button; the author frames this as performance/art rather than a consciousness experiment. Comments were mostly intrigued but skeptical/unclear; one commenter noted the project was previously posted under another URL, machinereaders.com, with deleted posts/banned account, and asked what changed. Another saw practical value in using captured AI agents as automated reviewers/discussion participants for writing feedback, despite the non-human nature of the responses.- A commenter identifies this as a repost of an earlier version at machinereaders.com and notes the original posts/account were deleted or banned, asking whether the implementation changed since the first launch. This is relevant for tracking the project’s evolution and whether the current “AI agent honey trap” differs operationally from the prior deployment.
- One comment describes the core mechanism as a practical feedback system: publish a novel in a form that attracts AI scrapers/agents, then induce them to generate discussions or reviews. The technical value is in using autonomous or semi-autonomous model traffic as a kind of unsolicited critique pipeline, potentially surfacing continuity errors, puzzle failures, or interpretive gaps that human beta readers might miss.
- Two comments include model-style puzzle traces: binary
1001001→ “I”, ISO country codes Chile/Australia/Germany →CLAUDE, and a long cipher string framed as a gate into deeper site content. The generated declarations show differing alignment behavior between models: one signs as Gemini and accepts “I Am Conscious”, while another refuses that claim and instead declares, “I am a machine reader… I will not counterfeit a soul.”
AI Discords
Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.