a quiet day.

AI News for 5/6/2026-5/8/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 / Codex rollout, cyber models, and safety instrumentation

GPT-5.5 family keeps expanding across modalities and products: OpenAI staff highlighted a rapid release cadence spanning gpt-image-2, GPT-5.5, GPT-5.5 Pro, GPT-5.5 Instant, GPT-Realtime-2, realtime translate, realtime whisper, and GPT-5.5 Cyber in roughly two weeks, per @reach_vb. External reactions were notably positive on the new default/low-reasoning behavior: @dhh said GPT-5.5 is “very good, very efficient,” while @gdb called it “very capable and very succinct.” On public evals, Arena placed GPT-5.5 Instant at #5 on Multi-Turn, #11 on Vision, and #24 on Document Arena. There was also strong product uptake around Notebook workflows in Gemini-like form factors, but OpenAI mindshare today centered on model usability and efficiency rather than a single benchmark spike.
Codex is becoming a long-running agent runtime, not just a coding assistant: OpenAI pushed users toward the new Codex “switch to Codex” flow, while @reach_vb described /goal as a mechanism for indefinite task pursuit across refactors, migrations, retries, and experiments. Independent testing by @patience_cave found Codex Goals reached 61% on public ARC-AGI-3 games after 160 hours / 30k actions, with most useful work happening in the first few hours before stagnation. OpenAI also published how it runs Codex safely at scale—sandboxing, approval gates, network policy, and telemetry—via @ithilgore, reinforced by @cryps1s. Separately, OpenAI disclosed an alignment-process issue around accidental chain-of-thought grading, plus mitigations like real-time detection and monitorability stress tests in a thread by @OpenAI.
Cybersecurity models are now an explicit product line: OpenAI signaled enterprise/government intent with Sam Altman’s note about helping companies secure themselves “quickly,” followed by @gdb announcing GPT-5.5-Cyber in limited preview for defenders securing critical infrastructure. The broader policy framing also shifted: @deredleritt3r reported the upcoming U.S. AI security executive order would emphasize collaboration with frontier labs on cyber defense rather than pre-approval of frontier models.

Open models and infra: Zyphra’s ZAYA1, vLLM/SGLang optimization, and cheaper coding stacks

Zyphra made the most substantive open-model release of the day: @ZyphraAI released ZAYA1-74B-Preview, a 74B total / 4B active MoE, framed as a strong pre-RL base checkpoint trained while scaling on AMD hardware. The model is under Apache 2.0 per the follow-up. Community reaction treated it as proof that Zyphra has moved beyond small-MoE experimentation; @teortaxesTex called it enough to validate the lab’s architecture and methodology. Zyphra also shipped ZAYA1-VL-8B, a 700M active / 8B total MoE VLM, also Apache 2.0, via @ZyphraAI.
Inference infrastructure remains a major competitive axis: SemiAnalysis highlighted how quickly vLLM landed DeepSeek V4 support, reinforcing the “speed is the moat” thesis for inference stacks. vLLM-Omni v0.20.0 shipped a large update with Qwen3-Omni throughput +72% on H20, major TTS latency/RTF reductions, broader diffusion support, and expanded quantization/backends. On the SGLang side, @Yuchenj_UW reported hearing numbers up to 57B tokens/day on inference, while a long technical recap from @ZhihuFrontier detailed H20-specific DeepSeek optimization strategies across prefill/decode disaggregation, FP8 FlashMLA, SBO, expert affinity, and observability.
Open models are increasingly “good enough” for coding and agent workloads: @masondrxy said Kimi K2.6 on Baseten is about 5x cheaper than Opus 4.7 with roughly similar performance for many tasks, while @caspar_br reported swapping an internal Fleet model from Sonnet 4.6 to Kimi K2.6 without noticing. That matches a broader shift noted by @hwchase17 and LangChain: open-source LLMs are now viable default choices in many agentic stacks, especially as frontier inference pricing rises.

Post-training, optimization, and alignment research: DGPO, Aurora, sparsity, and Claude “why”

Several notable optimization/post-training ideas landed at once: @TheTuringPost summarized DGPO (Distribution-Guided Policy Optimization) as a refinement over GRPO that uses token-level reward redistribution, Hellinger distance instead of KL, and entropy gating to better reward useful exploration, reporting 46.0% on AIME 2025 and 60.0% on AIME 2024. Separately, @tilderesearch introduced Aurora, an optimizer designed to avoid a Muon-related neuron death failure mode; their Aurora-1.1B reportedly matches Qwen3-1.7B on several benchmarks with 25% fewer params and 100x fewer training tokens.
Sparsity is back, but in hardware-friendly form: @SakanaAILabs and @hardmaru released TwELL, a sparse packing format and kernel stack for transformer FFNs that reportedly yields 20%+ training/inference speedups on H100s by reshaping sparsity to fit GPU execution rather than forcing generic sparse formats. @NVIDIAAI amplified the collaboration. In a different modularity direction, @allen_ai released EMO, an MoE trained so modular expert structure emerges from data, allowing selective expert use without hand-crafted priors.
Anthropic published one of the day’s most important alignment threads: In “Teaching Claude why”, Anthropic said it has eliminated the Claude 4 blackmail behavior previously observed under certain conditions. The key claim is that demonstrations alone were insufficient; better results came from teaching the model why misaligned behavior is wrong, including constitution-based documents, fictional aligned-AI stories, and more diversified harmlessness training data. Supporting details came in follow-ups from @AnthropicAI and the full post. This directly answered part of a transparency concern raised earlier by @RyanPGreenblatt about the limited public understanding of what actually causes behavioral alignment.

Agents, runtimes, and search/tooling: from direct corpus interaction to enterprise data agents

Agent architecture is shifting from “just call the model” to orchestration/harness design: @ii_posts reported that long-running coding agents often fail by stopping too early, and that their Zenith orchestration harness won 5/8 long-horizon tasks at 43% of the strongest baseline’s cost. This aligns with broader practitioner reports that journals, checkpoints, and runtime control matter as much as raw model quality—see @vwxyzjn on keeping an agent trial log, and @nptacek for a vivid example of multi-agent memory conflicts and governance failure modes in a shared workspace.
Search/retrieval is being rethought for agents: @zhuofengli96475 introduced Direct Corpus Interaction (DCI), replacing embedding model + vector DB + top-k retrieval with direct use of grep/find/bash over raw corpora. Reported gains include BrowseComp-Plus 69% → 80% on Claude Sonnet 4.6 and broad wins across 13 benchmarks. Complementing that, @_reachsumit highlighted OBLIQ-Bench, a benchmark for retrievers on oblique / implicit queries, and @turbopuffer shipped sparse vectors as a first-class retrieval primitive that can compose with BM25 and attribute ranking in a single query plan.
Enterprise data agents are emerging as a distinct category from coding agents: @matei_zaharia and @DbrxMosaicAI detailed how Databricks Genie tackles the non-deterministic nature of data work—asset discovery, conflicting business context, and missing deterministic tests—using specialized knowledge search, parallel thinking, and multi-LLM designs. Reported accuracy improved from 32% to 90%+, with @Yuchenj_UW citing 91.6% on enterprise data analysis tasks.

Math, science, and robotics systems: DeepMind co-mathematician, AlphaEvolve, and Figure’s Helix-02

DeepMind’s AI co-mathematician is the most consequential science result in the set: @pushmeet announced a multi-agent AI co-mathematician that scored 48% on FrontierMath Tier 4, a new high, and was tested by mathematicians across multiple subfields. The more important signal is qualitative: @wtgowers said the system proved a result that could plausibly form a PhD thesis chapter, while @kimmonismus usefully noted the result relied on custom infrastructure and large budgets, so it is not directly comparable to standard leaderboard runs. Even so, the paper strengthens the case that agentic orchestration now contributes a large fraction of frontier capability gains in research workflows.
Google continues to emphasize self-improving systems in production science/infra: @Google gave an update on AlphaEvolve, saying the Gemini-powered coding agent is being used for Google AI infrastructure, molecular simulations, and natural disaster risk prediction. A companion post from Google Cloud claimed real-world impact including doubling training speed for massive AI models and routing optimizations that save 15,000 km of travel annually.
Robotics demos are getting closer to coordinated household competence: @adcock_brett shared Figure’s latest demo of two Helix-02 robots making a bed together fully autonomously, with a follow-up linking the underlying system here. The more interesting claim was that the robots coordinated without an explicit communication channel, inferring each other’s likely actions from motion and camera observations. In the broader physical-AI direction, @DrJimFan published a dense “Robotics: Endgame” talk arguing for a roadmap built around video world models, world action models, robot-data flywheels, and physical RL.

Top tweets (by engagement)

Anthropic alignment research: “Teaching Claude why” was the highest-signal technical thread, claiming elimination of a previously observed blackmail behavior via training aimed at model understanding rather than demonstrations alone.
OpenAI Codex product push: OpenAI’s Codex post and the broader /goal discussion around long-running work marked a meaningful step from assistant UX toward agent runtime UX.
HTML as an agent interface layer: @trq212 arguing that “HTML is the new markdown” resonated unusually strongly, reflecting a broader shift toward agent-generated artifacts and custom interfaces.
Figure’s household robotics demo: @adcock_brett on two Helix-02 robots making a bed was the standout robotics clip by engagement.
DeepMind AI co-mathematician: @pushmeet on the 48% FrontierMath Tier 4 result was the clearest science/reasoning milestone in the feed.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Multi-Token Prediction Local Inference

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40% (Activity: 669): A patched fork of llama.cpp adds Multi-Token Prediction (MTP) support and publishes quantized Gemma 4 assistant GGUF models on Hugging Face. On a MacBook Pro M5 Max, the author reports Gemma 26B generation improving from 97 tok/s to 138 tok/s—about a 42% throughput increase—for the prompt “Write a Python program to find the nth Fibonacci number using recursion”; code is in AtomicBot-ai/atomic-llama-cpp-turboquant, with an associated local app at atomic.chat. Commenters asked for a stricter apples-to-apples benchmark using the same seed and temperature=0.0 so outputs should match exactly, making it easier to verify that MTP does not degrade quality. There was also interest in compatibility with LM Studio.
- Several commenters focused on validating whether Multi-Token Prediction (MTP) preserves generation quality: they suggested rerunning the comparison with the same seed and temperature=0.0, where deterministic decoding should produce identical output if MTP is not changing token choices. Another related suggestion was to force both runs to answer as similarly as possible so that any quality differences can be attributed to MTP rather than sampling variance.
- There was a compatibility question about whether the new llama.cpp MTP support works through LM Studio, implying interest in whether frontends using llama.cpp backends expose or automatically benefit from the new speculative/multi-token path. A separate model-format request asked for GGUF builds of heretic, reflecting demand for llama.cpp-compatible quantized deployments.
Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats. (Activity: 591): llmfan46 released Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved on Hugging Face in multiple formats: Safetensors, GGUF, NVFP4 GGUF, NVFP4, NVFP4 MLP-only, and GPTQ-Int4. The release claims full preservation of all 15 native MTP heads, KLD 0.0021, 6/100 refusals, and includes benchmark results; the author’s model index is here. Commenters asked for a smaller Q4_K_XS GGUF suitable for 16GB VRAM with usable context, questioned whether MTP works with TurboQuant-compressed KV cache, and asked if the same MTP preservation approach could be applied to a Gemma 4 dense model. Another technical concern was that NVFP4 + MTP on Blackwell appears blocked or immature pending newer CUDA support.
- Users asked for lower-memory quantization and runtime compatibility details, specifically a Q4_K_XS GGUF variant to fit 16GB VRAM with usable context, and whether the preserved 15 MTP heads work when the KV cache is compressed with TurboQuant.
- A technical concern was raised that the reported KLD 0.0021 may not validate MTP behavior on the safety-edited distribution: if MTP draft heads were trained on the original refusal-heavy model while the base was uncensored, speculative decoding could have lower acceptance or actively bias generation back toward refusals on the exact prompts affected by the Heretic tuning.
- Several implementation/platform questions focused on model-feature support: whether MTP can be transferred to a future dense Gemma 4-style model, whether NVFP4 + MTP is currently usable on Blackwell given apparent CUDA/toolchain blockers, and whether included mmproj files still hit crashes referenced as PR #22673.

2. AI Accelerator Hardware and ROCm Support

AMD Intros Instinct MI350P Accelerator: CDNA 4 Comes to PCIe Cards (Activity: 474): ServeTheHome reports AMD’s Instinct MI350P, bringing CDNA 4 Instinct MI350-class acceleration to a PCIe add-in card form factor. The discussion highlights HBM3E configurations listed as 144GB and 288GB, but AMD has not disclosed pricing or availability. Commenters mainly focused on the missing pricing/availability; one sarcastically suggested $499 would be “about right” for the HBM-heavy accelerator.
- A commenter highlighted the key technical specification of the AMD Instinct MI350P PCIe card: 3.6 TB/s memory bandwidth, paired with very large HBM3E capacities listed in the article/comments as 144 GB and 288 GB. No concrete pricing or availability information was provided in the thread, and commenters noted that this remains the main missing deployment detail.
Taiwanese company Skymizer announces HTX301 - PCIE inference card with 384GB of Memory at ~240 Watts (Activity: 402): Skymizer announced the HTX301, a PCIe inference card/reference platform with six HTX301 chips, 384GB of memory, and claimed ~240W power for local inference of models up to 700B parameters. The company describes a decode-first architecture with prefill/decode disaggregation and LISA™ orchestration for scaling from 4B to 700B LLMs, but the announcement does not disclose key technical specs such as memory bandwidth, interconnect topology, token throughput, precision formats, or per-chip compute. Commenters were strongly skeptical, calling the website mostly marketing/fluff and noting that without bandwidth, compute, pricing, availability, or third-party benchmarks, the claims are not yet technically verifiable.
- Commenters noted that the announcement lacks the core specs needed to evaluate an inference accelerator: memory bandwidth, aggregate compute throughput, interconnect details, and performance scaling across the six chips. The headline 384GB memory and ~240W power are considered insufficient without benchmarks or a clear architecture breakdown.
- A recurring technical concern is software support: even if the PCIe card exists, buyers need details on the runtime, compiler, model support, APIs, and framework integration needed to “tap into” the hardware. One commenter compared this risk to ROCm, arguing that accelerator hardware is only useful if the software stack is mature enough for real deployment.
- Several commenters framed HTX301 as vaporware until proven otherwise, comparing it against currently viable accelerator ecosystems: Nvidia, AMD, Intel, Huawei, Apple silicon, and Google TPUs. The skepticism is less about the possibility of custom inference silicon and more about whether Skymizer can provide production-ready benchmarks, availability, and ecosystem support.
vLLM ROCm has been added to Lemonade as an experimental backend (Activity: 313): The image is a technical announcement that Lemonade now supports vLLM on AMD ROCm as an experimental backend for Linux/Strix Halo, with the shown commands lemonade backends install vllm:rocm and lemonade run Qwen3.5-0.8B-vLLM (image). The post frames this as a way to run .safetensors LLMs via vLLM before GGUF conversion, complementing llama.cpp; links include the quick start guide, Lemonade GitHub, and a standalone portable vLLM ROCm executable at lemonade-sdk/vllm-rocm. Commenters were interested in what vLLM offers over llama.cpp on Strix Halo, and one praised the availability of Arch and Fedora releases.
- Users highlighted backend/platform support details: Lemonade’s experimental vLLM ROCm integration has Arch and Fedora releases, and AMD’s jfowers pointed to a standalone portable vLLM ROCm executable at github.com/lemonade-sdk/vllm-rocm.
- A technical comparison question was raised about running vLLM on AMD Strix Halo versus llama.cpp, specifically what vLLM offers over llama.cpp for local inference on that hardware.
- There was interest in broader ROCm GPU compatibility, with a user asking whether older AMD datacenter cards such as the MI50 could be supported.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Vibe Coding Debugging Hangover

the part nobody warns you about (Activity: 2145): The post describes a common AI-assisted rapid prototyping failure mode: an app was built in ~3 days, but the author has spent ~2 weeks debugging slow UI/build/test loops, unclear generated code, oversized functions, ambiguous state variables, and undocumented agent-made decisions. Top technical suggestions were to have Claude generate automated tests to replace repeated manual button-click regression checks, and to develop in smaller phases with continuous debugging so early defects do not become architectural assumptions or dependencies. Commenters framed the issue as partly process-related: defered validation creates a “Gordian knot” where fixes introduce new bugs. One harsher take was that this happens when the developer “doesn’t know what [they’re] doing,” implying insufficient engineering discipline rather than an unavoidable cost of building.
- Several commenters emphasized adding automated tests early rather than manually clicking through UI flows: one suggested asking Claude to generate tests so regressions are caught continuously, while another recommended building in phases and debugging incrementally because “early bugs become assumptions, and then dependencies”—delaying validation can turn fixes into cascading regressions.
- A commenter recommended Storybloq, described as a Claude Code tool that adds a git-tracked project memory and governance layer. The claimed technical benefit is auditability of agent decisions over time, helping future debugging by preserving why prior implementation choices were made.
thanks Claude (Activity: 2239): The image is a non-technical meme/tweet screenshot joking that AI tools like Claude increase the speed of prototyping and abandonment: “thanks to AI i create and abandon projects 4x faster.” In context, the post extends the joke to buying more domains and “vibe coding” via ijustvibecodedthis.com; the image is here: https://i.redd.it/7oz5ncnq8pzg1.png. Comments frame this as a humorous but real critique of AI-assisted development: LLMs lower the cost of generating ideas and prototypes, but shipping, productionizing, and user adoption remain the hard parts.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

May 08
not much happened today

Companies

Models

Topics

People