a quiet day.

AI News for 5/6/2026-5/7/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

OpenAI launched realtime-1.5 3 months ago, but it was a relative drop in the bucket because it was still 4o based intelligence (a +5% bump in Big Bench Audio). You could tell the sheer confidence in today’s realtime-2 release (with a +15.2% bump in BBA), and it was appropriately well received:

As the blogpost explains, 3 models are being released, which one might simplify to “voice-in, voice-out, and voice-to-voice”:

The focus is less about “voice quality”, and more on usability. TLDR:

Preambles: Developers can enable short phrases before a main response, like “let me check that” or “one moment while I look into it”.

Parallel tool calls and tool transparency: The model can call multiple tools at once and make those actions audible with phrases like “checking your calendar” or “looking that up now,” helping agents stay responsive while completing tasks.

Stronger recovery behavior: The model can recover more gracefully by saying things like “I’m having trouble with that right now,” instead of failing or breaking.

Longer context: 32K → 128K

Stronger domain understanding: The model better retains specialized terminology, proper nouns, healthcare terms, and other vocabulary

More controllable tone and delivery: The model can better adjust its tone—speaking calmly, empathetically, or upbeat, based on context

Adjustable reasoning effort: Developers can now select from minimal, low, medium, high, and xhigh reasoning levels, with low as the default.

The Demo video showed off how the audio model is better tuned when the main speaker is speaking to someone else, so it stops interrupting so much:

AI Twitter Recap

Top Story: GPT-Realtime-2 and OpenAI voice AI commentary

What happened

OpenAI launched three new streaming audio models in the Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. OpenAI positioned GPT-Realtime-2 as its “most intelligent voice model yet,” bringing “GPT-5-class reasoning” to real-time voice agents that can listen, reason, handle interruptions, use tools, and sustain longer conversations as they unfold @OpenAI. The companion models target live speech translation and transcription: GPT-Realtime-Translate supports streaming translation from 70+ input languages into 13 output languages, while GPT-Realtime-Whisper streams transcription/captions as speech is produced @OpenAI, @OpenAIDevs. OpenAI said the models are available in the Realtime API now, while ChatGPT voice upgrades are still pending: “Stay tuned, we’re cooking” @OpenAI. Sam Altman framed the launch around a behavioral shift: users increasingly use voice with AI when they need to “dump” lots of context, and OpenAI is also working on improvements to ChatGPT voice @sama.

Facts vs. opinions

Factual / directly claimed by OpenAI and evaluators

Model family: GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper are available in the Realtime API today @OpenAIDevs.
GPT-Realtime-2 capabilities: reasoning-oriented native speech-to-speech model for production voice agents; supports tool use/action, interruption recovery, longer conversations, and “GPT-5-class reasoning” per OpenAI’s wording @OpenAI, @reach_vb.
Context window: community/OpenAI-dev commentary reported 128K context for GPT-Realtime-2 voice agents @reach_vb; Artificial Analysis independently reported the context window increased from 32K to 128K, with 32K max output tokens @ArtificialAnlys.
Translation: GPT-Realtime-Translate supports live speech translation from 70+ input languages into 13 output languages @OpenAI, @reach_vb.
Transcription: GPT-Realtime-Whisper provides low-latency streaming transcription in the Realtime API for captions, notes, and continuous speech understanding @OpenAIDevs.
Prompting/control: OpenAI published a voice prompting guide covering reasoning effort, preambles, tool behavior, unclear audio handling, exact entity capture, and state maintenance in long sessions @OpenAIDevs.
Independent benchmarks: Scale AI reported GPT-Realtime-2 took the top spot on its Audio MultiChallenge S2S leaderboard, with instruction retention rising from 36.7% to 70.8% APR versus GPT-Realtime-1.5 and strong performance on voice editing/real-time repair @ScaleAILabs.
Independent benchmarks: Artificial Analysis reported 96.6% on Big Bench Audio speech-to-speech reasoning, 96.1% on its Conversational Dynamics benchmark, average time-to-first-audio of 2.33s at high reasoning and 1.12s at minimal reasoning, and unchanged audio pricing of $1.15/hour input and $4.61/hour output @ArtificialAnlys, @ArtificialAnlys.
Reasoning-effort controls: Artificial Analysis reported adjustable reasoning levels: minimal, low, medium, high, xhigh, with low as default @ArtificialAnlys.
Enterprise/product evals: Glean said GPT-Realtime-2 delivered a 42.9% relative increase in helpfulness over the previous version in internal evals for real-time organizational voice interactions @glean. Genspark said its Call for Me Agent moved to GPT-Realtime-2 and saw +26% effective conversation rate and fewer dropped calls @genspark_ai.

Opinions / interpretation / commentary

Supporters described the launch as a “big step forward” for voice agents @sama, “total realtime victory” @reach_vb, and the first speech-to-speech model good enough for “real work” in complex voice agents @kwindla.
A more cautious view: Simon Willison noted the announcement does not mean ChatGPT Voice Mode itself has upgraded yet; the ChatGPT upgrade “sounds” like it is coming soon @simonw, @simonw.
Interface skepticism: Will Depue compared audio to VR—frequently exciting, but historically not sticky as an interface—while arguing that real-time tool use, reasoning while speaking, and live translation are the kinds of capabilities that could make audio interfaces finally take off @willdepue.
Broader UX optimism: several commenters framed voice as more natural and bandwidth-efficient for humans @BorisMPower, a path toward Jarvis-like always-available computer agents @willdepue, or eventually displaced by even higher-bandwidth BCIs @iScienceLuvr.
Competitive context: Elon Musk pushed Grok Voice for customer support @elonmusk, underscoring that real-time voice support/customer-service automation is now a competitive surface across labs.

Technical details and benchmark data

GPT-Realtime-2

Native speech-to-speech / real-time voice model, released via OpenAI’s Realtime API @OpenAI.
Framed as “GPT-5-class reasoning” for voice agents @OpenAI.
Designed for agents that can:
- reason mid-conversation,
- use tools/take actions,
- handle interruptions,
- recover when users revise or repair speech,
- sustain longer sessions with expanded context @OpenAI, @reach_vb.
Reported context: 128K tokens, up from 32K @ArtificialAnlys.
Reported max output: 32K tokens @ArtificialAnlys.
Inputs reported by Artificial Analysis: text, audio, and image @ArtificialAnlys.
Reasoning effort levels: minimal, low, medium, high, xhigh; default low @ArtificialAnlys.
Time-to-first-audio:
- 1.12s at minimal reasoning,
- 2.33s at high reasoning @ArtificialAnlys.
Pricing:
- $1.15/hour audio input,
- $4.61/hour audio output,
- unchanged versus prior model according to Artificial Analysis @ArtificialAnlys.
Conversational features: supports short preambles before main responses—e.g. “let me check that”—and audible transparency during tool calls—e.g. “checking your calendar” @ArtificialAnlys.

Benchmarks

Scale AI Audio MultiChallenge S2S: GPT-Realtime-2 placed #1; instruction retention improved from 36.7% to 70.8% APR versus GPT-Realtime-1.5; strong voice editing when users repair/revise speech in real time @ScaleAILabs.
Artificial Analysis Big Bench Audio: GPT-Realtime-2 high variant scored 96.6%, reported as equal to Gemini 3.1 Flash Live Preview High and about ~13% above the previous highest result @ArtificialAnlys.
Justin Uberti separately summarized the improvement as 15 percentage points vs. GPT-Realtime-1.5 on Big Bench Audio, near saturation @juberti.
Conversational Dynamics / Full Duplex Bench subset: GPT-Realtime-2 minimal variant scored 96.1%, with strengths in pause handling and turn-taking @ArtificialAnlys.

GPT-Realtime-Translate

Live streaming speech translation from 70+ input languages to 13 output languages @OpenAI.
OpenAI cofounder Greg Brockman said real-time voice-to-voice translation has been an anticipated OpenAI application since the company’s early days and is now available for anyone to build with @gdb.
Vimeo demonstrated live dubbing with no pre-loaded captions, showing translations generated fully live @Vimeo.
Junling Zhang highlighted the new real-time translation model and encouraged API usage @jxnlco.
Boris Power said live translation “actually works incredibly well” and plans to use it regularly @BorisMPower.

GPT-Realtime-Whisper

Streaming transcription as people speak, for real-time captions, notes, and speech understanding @OpenAI.
Justin Uberti described it as “Whisper, but now with realtime streaming” and updated demos to use the new model @juberti.
Uberti also built a delay selector to expose the latency/accuracy tradeoff in a real-time typing demo @juberti.

Product integrations and demos

Glean: shipped real-time voice powered by GPT-Realtime-2, grounded in organizational context; internal evals showed 42.9% relative helpfulness increase over the previous version @glean.
Vimeo: demonstrated live dubbing using GPT-Realtime-Translate, with translations generated live and no pre-loaded captions @Vimeo.
Genspark: upgraded its Call for Me Agent to GPT-Realtime-2; Genspark Realtime Voice is next; claimed sharper reasoning, tighter instruction following, +26% effective conversation rate, and fewer dropped calls @genspark_ai.
Gradient Bang / game-agent demo: Kyle Windland said GPT-Realtime-2 is the first OpenAI speech-to-speech model good enough for his voice agents that do “real work,” showing it as the ship AI in a complex agent with tool calls and subagents @kwindla.
Voice-controlled market dashboard: Levin Stanley demoed GPT-Realtime-2 controlling an interface by intent—“Focus on Apple,” “How did it do over the last 30 days?”, “Go back”—arguing that real-time interruption and reasoning change the UI loop from navigation to direction @levinstanley.
Realtime demos: Justin Uberti updated hello-realtime for GPT-Realtime-2 and provided a phone demo number @juberti; Diego Cabezas posted a quick GPT-Realtime-2 demo @diegocabezas01; Ray Fernando hosted a “Building a Live Translator” broadcast @RayFernando1337.
Reachy Mini / robotics voice interface interest: Clement Delangue asked who would add the new voice capabilities to Reachy Mini @ClementDelangue, after earlier asking voice AI labs such as Gradium, Kyutai, and ElevenLabs who could help with a robot voice use case @ClementDelangue.

Why this matters

The launch pushes voice agents from “speech I/O wrapper around a chatbot” toward full-duplex, tool-using, long-context, reasoning agents. The technical shift is not just better ASR or TTS; it is the combination of low-latency turn-taking, interruption handling, longer context, tool-call transparency, and adjustable reasoning effort in a single real-time loop. That matters for customer support, meetings, accessibility, live translation, robotics, browser/computer control, and hands-free workflows where text chat is too slow or awkward.

The most important engineering implication is that voice apps now need to be designed as stateful real-time systems, not prompt-response endpoints. OpenAI’s prompting guide explicitly points developers toward reasoning-effort tuning, preambles, tool behavior, unclear-audio recovery, entity capture, and long-session state management @OpenAIDevs. This suggests voice-agent quality will increasingly depend on harness design: latency budgets, interruption semantics, tool-call UX, conversational memory, and failure recovery—not just raw model selection.

The remaining uncertainty is distribution. The API model is available now, but ChatGPT voice mode has not yet received the upgrade, per Simon Willison’s observation @simonw. If and when ChatGPT Voice gets the same capabilities, the consumer impact could be much larger. Until then, the launch primarily benefits developers and platforms building specialized real-time agents.

OpenAI Voice, Codex, and Cybersecurity Releases

GPT-Realtime-2 and new audio stack: OpenAI released GPT-Realtime-2 in the API, described as its most capable voice model with GPT-5-class reasoning, tool use, interruption handling, and longer conversations; it ships alongside GPT-Realtime-Translate for streaming translation across 70+ input languages / 13 output languages and GPT-Realtime-Whisper for low-latency streaming transcription @OpenAI. OpenAI says ChatGPT voice updates are still forthcoming @OpenAI. Artificial Analysis reports GPT-Realtime-2 reaches 96.6% on Big Bench Audio, leads its Conversational Dynamics benchmark at 96.1%, expands context from 32K to 128K, and keeps audio pricing unchanged @ArtificialAnlys. Scale AI also placed GPT-Realtime-2 at #1 on its Audio MultiChallenge S2S leaderboard, with instruction retention rising from 36.7% to 70.8% APR versus GPT-Realtime-1.5 @ScaleAILabs.
Codex gets browser control: OpenAI shipped a Chrome plugin for Codex on macOS and Windows, letting Codex operate across background tabs without taking over the user’s browser; it can use plugins where possible, Chrome for logged-in sites, and combine tools for workflows like debugging browser flows, checking dashboards, research, or CRM updates @OpenAI. The dev team emphasized browser DevTools, multi-tab parallelism, and web-app testing as key use cases @OpenAIDevs.
Cyber-specific GPT-5.5 access: OpenAI announced GPT-5.5 with Trusted Access for Cyber for defensive workflows and a limited-preview GPT-5.5-Cyber for authorized red teaming, pentesting, and validation under enhanced verification and account controls @cryps1s. Separately, Micah Carroll said OpenAI found instances of accidental CoT grading in previous RL runs after building a scanner, but did not find clear evidence those instances degraded CoT monitorability @MicahCarroll.

Anthropic, Interpretability, and AI Safety Tooling

Natural Language Autoencoders: Anthropic introduced Natural Language Autoencoders, a method for translating model activations into human-readable text so researchers can inspect “thought-like” internal representations rather than only sparse features or supervised probes @AnthropicAI. Miles Brundage/ML-powered commentary framed NLAs as complementary to probing and dictionary learning, noting they revealed planning behavior and helped identify training-pipeline translation bugs; open-model NLAs are available on Neuronpedia @mlpowered. Ryan Greenblatt cautioned that early tests did not recover “internal CoT” on single-forward-pass math cases, suggesting limitations or missing activation locations @RyanPGreenblatt.
Goodfire’s neural geometry agenda: Goodfire launched a research series arguing neural networks “think in shapes,” with manifolds as a core primitive for interpreting and controlling behavior @GoodfireAI. The thread contrasts manifold-level structure with SAE-style feature shattering, includes examples where steering along a learned manifold preserves coherent world-model behavior, and teases work on unsupervised manifold discovery and in-context geometry @GoodfireAI. Goodfire also linked the agenda to scientific discovery, citing reverse-engineering of a scientific foundation model to uncover biomarker structure in a curved manifold @GoodfireAI.
Anthropic safety infrastructure: Anthropic shared the research agenda for The Anthropic Institute, focused on economic diffusion, threats/resilience, AI systems in the wild, and AI-driven R&D with human visibility and control @AnthropicAI. It also moved Petri, its open-source interactive behavioral-evals tool, to Meridian Labs as an independent project @AnthropicAI, and opened its security bug bounty publicly on HackerOne @AnthropicAI.

Agents, RL Environments, and Coding Workflows

Prime Intellect Lab and Ramp Fast Ask: Prime Intellect opened Lab out of beta as a full stack for building RL environments/evals, evaluating, post-training, deploying, and serving agents @PrimeIntellect. Ramp Labs used Prime Intellect to train Fast Ask, a small RL-trained subagent for spreadsheet QA that reportedly scores +4% exact-match over Opus at Haiku-level latency @RampLabs; Prime says it outperformed Opus 4.6 while running faster and cheaper @PrimeIntellect.
Hermes Agent momentum: Nous/Teknium shipped Hermes Agent v0.13.0 with multi-agent orchestration via Kanban, enforced goal completion with /goal, disk-usage optimizations, custom LLM providers, and custom gateway channels @Teknium. Earlier updates added agent-free cron jobs via Hermes Gateway for programmatic recurring tasks @Teknium, blank-slate profiles with --no-skills @Teknium, and Lightpanda as a machine-native browser backend with Chrome fallback @lightpanda_io.
Cursor orchestration and PR workflows: Cursor introduced /orchestrate, a skill that recursively spawns planner, worker, and verifier agents via the Cursor SDK; internally it reportedly cut skill token use by 20% while improving evals and reduced backend cold-start time by 80% @cursor_ai. Cursor 3 also added an integrated PR review experience with diffs, commits, comments, review status, a file tree, and skill quick-action pills @cursor_ai.
Agent infra patterns: LangGraph is adding delta channels, storing checkpoint history as diffs to control storage bloat for long-context agents @sydneyrunkle. Deep Agents added sandbox backends for provider-agnostic isolated execution across Daytona, Modal, Runloop, and LangSmith, with an auth proxy pattern to keep credentials out of prompt-injectable sandboxes @sydneyrunkle.

Models, Benchmarks, and Inference Systems

xAI, Zhipu, Zyphra, DeepSeek ecosystem: xAI made Image Generation Quality Mode available on the xAI API after powering more than 300M images in Grok, claiming better realism, text rendering, and creative control @xai. Zhipu published the GLM-5V-Turbo technical report, highlighting CogViT dual-teacher distillation, multimodal multi-token prediction, multimodal coding/tool use, and RL across 30+ task categories @Zai_org. Zyphra’s ZAYA1-8B was described as AMD-trained, using under 1B active parameters, large-scale RL, and a test-time method called Markovian RSA @kimmonismus. Antirez also released DS4, a specialized inference engine for DeepSeek v4 Flash built on llama.cpp/GGML lineage @antirez.
Google model and API updates: Google AI Studio announced Gemini 3.1 Flash-Lite as its most cost-efficient model for high-volume agentic tasks, translation, and simple data processing @GoogleAIStudio. Google also evolved the Gemini Interactions API from role-based user/model messages to typed steps such as user_input, thought, function_call, tool_call, and model_output, targeting richer multi-step agent workflows @GoogleAIStudio. Gemma 4’s MTP/speculative decoding was reported to deliver up to 3× faster on-device inference @googlegemma, with independent vLLM tests showing large throughput gains and 129 tok/s on simple generation on an RTX Pro 6000 @bnjmn_marie.
Sequence models and coding evals: Aviv Bick and Albert Gu introduced Raven, a fixed-state sequence model that learns which finite memory slots to update, aiming to fix persistence failures in SSMs and sliding-window attention and outperform prior linear models at 16× training sequence length @avivbick, @_albertgu. Scale released the SWE Atlas Refactoring leaderboard, testing whether agents can restructure code without regressions; Claude Opus 4.7 with Claude Code leads @ScaleAILabs. Arena’s longitudinal analysis says open models have largely closed the Text Arena gap, with the proprietary lead now around +30 Arena points, though expert prompts remain harder @arena.

AI Infrastructure, Health, Robotics, and Applied Products

Compute and infrastructure: Anthropic’s SpaceX/xAI compute deal remained a major theme: Dario Amodei called the SpaceX partnership “visionary engineering + Claude” @Mononofu, while Simon Willison highlighted that Anthropic reportedly gets Colossus 1, xAI keeps the larger Colossus 2, and Colossus 1 has environmental controversy @simonw. Lambda closed a $1B senior secured credit facility to expand AI factories @LambdaAPI, AMD promoted MI350P PCIe with 144GB HBM3E and up to 2299 TFLOPS MXFP4 @AMD, and Ai2 brought new NSF OMAI compute online with NVIDIA Blackwell Ultra systems from a $152M NSF/NVIDIA investment @allen_ai.
Google Health and medical AI: Google is turning Fitbit into the Google Health app on May 26, combining Fitbit tracking with Google services and a Gemini-powered Google Health Coach @googlehealth. Google says Health Premium will be included in AI Pro and Ultra plans @shimritby, and announced Fitbit Air, a screenless wearable with up to one-week battery and $99.99 preorder pricing @Google. Separately, Glass Health launched an ambient scribing API at $0.85/hour for transcription plus token-priced note generation @GlassHealthHQ.
Robotics and local agents: Perplexity released Personal Computer in a new Mac app, letting agents operate across local files, native Mac apps, web, and Perplexity servers, including remote initiation from iPhone and always-on Mac mini setups @perplexity_ai. NVIDIA Robotics highlighted Hugging Face’s Reachy Mini “agentic robotics app store” and Isaac GR00T N integration with LeRobot workflows @NVIDIARobotics. EO-1 is now available through the standard LeRobot policy interface for robot-control training/eval/deploy workflows @SongHaomin92651.

Top tweets by engagement

OpenAI GPT-Realtime-2 API launch — 11.7K engagement @OpenAI
Anthropic Natural Language Autoencoders — 10.1K engagement @AnthropicAI
Claude Mythos helped Firefox fix more security bugs in April than prior 15 months — 9.7K engagement @alexalbert__
OpenAI Codex Chrome plugin — 7.7K engagement @OpenAI
Goodfire neural geometry research agenda — 5.1K engagement @GoodfireAI
Sam Altman on voice as a high-context AI interface — 5.0K engagement @sama
xAI Image Generation Quality Mode API — 4.5K engagement @xai

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.6 27B Local Inference and Quantization

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints (Activity: 1798): A recent llama.cpp MTP PR (#22673) enables Qwen 3.6 27B’s built-in multi-token prediction tensors for speculative decoding; the poster converted MTP-capable GGUF quants (HF) and reports ~2.5× faster generation on an M2 Max 96GB, reaching 28 tok/s with --spec-type mtp --spec-draft-n-max 3. They also published fixed Jinja chat templates (HF) and provide llama-server settings for OpenAI/Anthropic-compatible local serving with q8_0 KV cache and up to 262144 context; recommendations emphasize q8_0-mtp as the best speed/quality quant, avoiding q4_0 KV beyond 64k, and note that Qwen3.6-27B only uses KV cache in 16/65 layers due to hybrid linear attention, reducing KV memory ~4×. A commenter reports on an RTX Pro 6000 Max-Q that Qwen 3.6 “2.7B” Q8 increases from 36 tok/s to 78 tok/s with MTP, at ~20% slower prompt processing, with no observed output-quality degradation; the post also warns that vision currently crashes llama.cpp when combined with MTP. Commenters broadly frame this as part of a major recent acceleration in local inference, making consumer-hardware agentic coding more viable. One technical question asks whether turbo3/turbo4 was merged separately or is part of the MTP PR.
- A user benchmarked qwen 3.6 2.7B Q8 on an RTX Pro 6000 MaxQ and reported generation increasing from 36 tok/s to 78 tok/s with MTP, roughly a 2.17x speedup. They noted an approximately 20% prompt-processing slowdown, but said output quality appeared unchanged, making the tradeoff favorable for generation-heavy workloads.
- One commenter asked whether the speedup depends on the recent turbo3/turbo4 merge or is specifically part of the MTP PR, highlighting that the implementation path matters for reproducing the claimed inference gains.
- There was a technical comparison question against Qwen 3.6 Dflash variants and low-bit iq3_XS quantizations. The commenter reported usually fitting 256k context into 16GB VRAM and asked whether these quants can also support 256k context without mmproj, indicating interest in KV-cache/context-length feasibility across quant formats.
Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,…) (Activity: 820): The post benchmarks Qwen 3.6 27B GGUF quantizations on a deliberately odd PGN-to-SVG chess-rendering task, testing board-state tracking, piece placement, orientation, and last-move highlighting with identical llama.cpp sampling settings (temp=0.6, top_p=0.95, top_k=20, ctx=65536). The author reports BF16/Q8_0 as essentially correct, Q6_K showing placement degradation, Q5_K_XL/Q4_K_XL/IQ4_XS still usable, IQ3_XXS mostly correct but with wrong board orientation, and Q2_K_XL structurally broken despite correct piece positions; full outputs are posted at qwen3-6-27b-benchmark.vercel.app. For local 16 GB VRAM use, they prefer IQ4_XS, reporting about pp 100 tps / tg 8 tps on vanilla llama.cpp, improved to roughly pp 760 tps / tg 22 tps using TheTom’s TurboQuant fork with -ngl 99, turbo4/turbo2 KV-cache quantization, and context limited below ~75k. The main technical caveat raised in comments is that the evaluation appears to be single-run, so stochastic variance could make individual quantization results outliers; commenters still noted that the observed degradation trend broadly matches expectations.
- Several commenters questioned whether the quantization comparison used single-run evaluations or repeated trials, noting that LLM outputs can vary enough that “one run is not enough” and may produce misleading conclusions from statistical noise or outlier generations. They still observed an apparent expected trend of quality degradation as quantization becomes more aggressive, but wanted multiple samples per quant level to support the findings.
- One technically substantive takeaway was that 4-bit quantization appears to remain the practical sweet spot, with 3-bit quants still described as usable despite common skepticism. A commenter argued that above roughly 5-bit, users may often gain more by moving to a larger/better model rather than preserving extra precision on a smaller one, citing comparisons like 122B UD-Q3_K_XL versus 35B IQ4_NL.
Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats. (Activity: 530): llmfan46 released Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved on Hugging Face, claiming KLD = 0.0021, 6/100 refusals, and preservation/retention of the full 15 native MTP heads across Safetensors, GGUF, NVFP4, NVFP4-GGUF, NVFP4-MLP-only, and GPTQ-Int4 variants. The post says the release includes benchmarks and that all variants were checked for full MTP retention; the author’s full model list is here. Commenters requested additional deployment-oriented quantization support, especially Q4_K_XS for 16GB systems, and asked whether MTP works with TurboQuant-compressed KV cache or could be applied to Gemma 4 dense models. One technical concern was that if the MTP draft heads were trained on the original refusal-aligned model while only the base was fine-tuned, MTP acceptance may degrade or “fight the heretic” specifically on newly unlocked refusal/tail-behavior cases despite the low aggregate KLD = 0.0021.
- A key concern was whether preserving the full 15 MTP heads is actually beneficial after an uncensoring/heretic fine-tune: if the draft heads retain the original refusal distribution while the base model was modified, speculative decoding may “fight” the newly unlocked outputs. One commenter noted that the reported KLD 0.0021 indicates the base stayed close overall, but may not capture tail behavior on refusal/unlocked prompts, making MTP acceptance rate on heretic cases the more important validation metric.
- Users asked for deployment-specific quantization details, including a Q4_K_XS GGUF target to fit 16GB VRAM while retaining useful context, and whether preserved MTP remains compatible with TurboQuant-compressed KV cache. Another hardware-focused question flagged that NVFP4 + MTP on Blackwell may currently be blocked by CUDA/tooling support, with the commenter saying the stack appears “dead in the water until a new CUDA version is released.”
- There were implementation questions around multimodal packaging and stability: commenters noted the inclusion of mmproj files and asked whether crashes related to PR #22673 are still present. Another asked whether the same MTP-preservation approach could apply to a future Gemma 4 dense model, implying interest in portability of native MTP heads across architectures/fine-tunes.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Limits Raised via SpaceX Compute

Doubled Rate Limits for Claude Code (Activity: 3901): Anthropic says a new compute-capacity partnership with SpaceX, plus other recent compute deals, enabled higher usage limits across Claude Code and the Claude API (announcement). Effective immediately, Claude Code Pro/Max no longer has the prior peak-hours limit reduction, and Opus-model API rate limits are being “substantially” raised. Top comments were mostly non-technical reactions: surprise/skepticism about whether the announcement is real, plus speculation that the SpaceX/Anthropic tie-up reflects Elon Musk’s rivalry with Sam Altman.
SpaceX Conpute Deal - Double Limits (Activity: 1931): Anthropic announced a compute partnership with SpaceX to “substantially increase” capacity, alongside other compute deals, and is immediately changing limits: removing peak-hours limit reductions for Claude Code Pro/Max and substantially raising API rate limits for Opus models (Anthropic announcement). The post does not specify exact new rate-limit numbers or the nature of the SpaceX compute arrangement. Comments are skeptical that higher limits will materially improve usable capacity, with one noting users may simply hit weekly caps faster and another comparing Claude unfavorably to OpenAI Codex usage economics. There’s also concern that any improvement may be temporary and regress within weeks or months.
- Several commenters argue that a raw compute-capacity deal would not materially improve Claude Chat unless Anthropic also changes product-level throttles: “A usage limit increase that doesn’t change the weekly limit is practically useless.” The key technical/product distinction raised is between backend compute availability and enforced per-user weekly quota policy.
- One comparison frames Anthropic’s quota pressure against OpenAI Codex pricing/usage: a user claims “$20 on codex gets you infinitely more usage than Claude,” suggesting Anthropic may be reacting to user churn caused by stricter effective compute limits. The discussion implies that any short-term limit relaxation may be temporary if demand again saturates available capacity.

2. AI Lab Corporate Governance Drama

Sam Altman texts Mira Murati. November 19, 2023. [This document is from Musk v. Altman (2026).] (Activity: 5431): The post references an image/document titled “Sam Altman texts Mira Murati. November 19, 2023”, allegedly from Musk v. Altman (2026), but the linked Reddit gallery was inaccessible due to 403 Forbidden, so the actual text-message contents could not be verified or summarized. No technical claims, model details, benchmarks, implementation facts, or litigation-document substance were available from the provided post metadata.
xAI will be dissolved as a separate entity. (Activity: 2116): The image is a non-technical screenshot of an X.com post attributed to Elon Musk, claiming that xAI would be dissolved as a separate company and folded into “SpaceXAI,” described as AI products from SpaceX: image. No implementation details, model changes, infrastructure plans, or product roadmap are provided in the post/title, so the significance is primarily corporate-structure/contextual, not technical. Comments frame the move as consistent with Musk’s prior desire to combine AI work with his other companies, while skeptics characterize it as potentially moving unprofitable AI efforts into SpaceX, a profitable/government-contract-supported entity.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

May 07
GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

Companies

Models

Topics

People