a quiet day.

AI News for 6/5/2026-6/8/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!


AI Twitter Recap

Coding Agents, Loops, and the Shift from “Passing Tests” to Mergeable Software

Model Releases, Local Inference, and Serving Stack Upgrades

  • Kimi shipped both a stronger coding agent and a desktop agent product: Moonshot released a major update to Kimi Code, its open-source coding agent, adding one-line CLI install, drag-and-drop video as coding context, ACP support, plugins, and IDE integration (announcement). It also launched Kimi Work, a desktop agent product with up to 300 local sub-agents, browser-use via extension, finance-focused tool access, and persistent memory (product launch, desktop availability).
  • Google pushed hard on efficient local deployment: Gemma got several notable upgrades. New QAT Gemma 4 checkpoints reportedly preserve performance while using ~4x less memory, with Gemma 4 E2B fitting in about 1GB using a mobile quantization format (@_philschmid). Separately, Gemma 4 MTP was merged into llama.cpp, enabling faster decoding when paired with QAT checkpoints (Gemma team). llama.cpp also added video input support, expanding local multimodal use cases.
  • Open-source/open-weight competition remains intense: Artificial Analysis reported MiniMax-M3 at 55 on its Intelligence Index, which would make it the leading open-weights model once weights are released. M3 adds native multimodality and a 1M token context window, with strong GPQA/MMMU-Pro numbers but notable abstention on hallucination-sensitive evals. Meanwhile norpadon announced Apple-hardware-optimized quantized Qwen3.5 checkpoints.
  • Serving infrastructure is broadening from text LLMs to world models and omni models: vLLM-Omni 0.22.0 added day-0 support for NVIDIA Cosmos 3 world models, robot serving APIs, TTS models such as Qwen3-TTS and VoxCPM2, faster image/video serving, and broader quantization/hardware coverage (release). This reflects a broader trend toward generalized multimodal serving rather than text-only inference stacks.

Benchmarks, Evaluation Methodology, and Real-World Agent Measurement

  • Agent evaluation is moving from synthetic tasks to in-the-wild telemetry: Arena launched Agent Arena, a leaderboard based on over 1M real-world sessions, using causal tracing rather than voting to estimate treatment effects of orchestrators/harnesses across five signals: confirmed success, praise vs complaint, steerability, bash recovery, and tool hallucination (overview, methodology thread). Whether the methodology fully holds up remains to be seen, but it’s one of the clearest attempts yet to benchmark deployed agents using actual usage traces.
  • Specialized benchmarks keep proliferating into new output domains: Hugging Face and Mecado released CADGenBench, a benchmark for generating and editing engineering-grade 3D CAD parts from drawings or STEP modifications, with metrics covering geometry, topology, interface compatibility, and CAD validity (launch thread, Thom Wolf summary). This is a meaningful shift: evaluation is expanding beyond text/code into structured artifacts where correctness is physical and geometric.
  • A recurring thesis: good benchmarks become training pipelines: Ofir Press argued that the best benchmarks are scalable and rooted in real-world crawled data sources, making them useful not just for measurement but also for data generation. That view shows up implicitly in both FrontierCode and Agent Arena: benchmarks are no longer static scoreboards; they are becoming feedback loops for product and RL improvement.

Google, Apple, and the Consumer AI Platform Race

  • Google expanded AI packaging, Search, and developer surfaces: Google announced a more capable NotebookLM with agentic chat, stronger reasoning, and more output formats for Ultra subscribers (launch). It also cut Google AI Plus pricing from $7.99 to $4.99/month while doubling storage to 400GB (pricing update). On the platform side, Google highlighted a major Search upgrade, including multimodal search and Gemini 3.5 Flash as the new default in AI Mode.
  • Apple’s WWDC AI story centered on integration, not frontier leadership: Commentary around WWDC focused on a rebuilt Siri AI with on-screen awareness, app actions, personal context, and better voice interaction, alongside concerns about EU availability and hardware gating (kimmonismus live thread, regional limitation note). A technically notable detail came from awnihannun: Apple’s on-device model is reportedly a 20B-parameter query-routed architecture that loads experts from NAND into RAM once per query, a nonstandard design optimized for device constraints.

Research Directions: Continual Learning, Agent Training, and Optimization Debates

Top Tweets (by engagement)


AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Commodity-Hardware LLM Inference Updates

  • llama.cpp Gemma4 MTP support merged! (Activity: 1097): llama.cpp merged PR #23398, adding Gemma 4 multi-token prediction (MTP) support via --spec-type draft-mtp and a draft/assistant GGUF model, enabling speculative-style decoding for supported Gemma 4 variants. A commenter reports 140 tok/s on Gemma 4 12B using 12GB VRAM on an RTX 4070 Super with Unsloth QAT GGUF, an MTP assistant/drafter Q8_0 GGUF, and --spec-draft-n-max 4; the PR’s mtp-bench results show roughly >2Ă— dense-model throughput gains versus non-MTP, while MoE variants reportedly did not speed up on the author’s system. The implementation is reported to reproduce Gemma team AIME-26 performance around ~87% for 31B and 26B-4B models; E4B/E2B variants remain unsupported, and multi-GPU may require --spec-draft-device with -sm layer. Commenters are enthusiastic about combining QAT + MTP, with explicit thanks to contributor u/am17an for the llama.cpp integration.

    • A user reports Gemma 4 12B running at 140 tok/s on an RTX 4070 Super with 12GB VRAM using the newly merged llama.cpp MTP support, Unsloth QAT GGUF weights, and an MTP drafter model. Their command uses --model-draft, --spec-type draft-mtp, --spec-draft-n-max 4, and a large --ctx-size 131072, with model links to Unsloth QAT GGUF and MTP assistant/drafter Q8_0 GGUF.
    • One benchmark on NVIDIA GB10 Grace Blackwell / Asus Ascent GX10 tested Gemma-4-31B-it-Q8_0.gguf with gemma-4-31B-it-MTP-Q8_0.gguf, describing Q8 as “basically full precision.” Without MTP, throughput was consistently around 6.2–6.4 tok/s; with --spec-type draft-mtp --spec-draft-n-max 7, throughput rose to 15.7–31.2 tok/s depending on task, roughly a 3–5x speedup while preserving reasoning mode via --reasoning on.
    • The detailed MTP benchmark shows task-dependent acceptance behavior: translation reached 31.2 tok/s with 0.699 draft acceptance, summarization hit 29.4 tok/s with 0.645, while creative writing was much lower at 15.7 tok/s with only 0.277 acceptance. This suggests Gemma 4 MTP acceleration is highly workload-sensitive, with deterministic or constrained tasks benefiting more from speculative multi-token prediction than open-ended creative generation.
  • You don’t need a GPU to run gemma-4-26B-A4B (Activity: 902): OP reports running Gemma 26B-A4B CPU-only on an Intel i5-8500 + 32GB RAM, Linux, via KoboldCpp, achieving roughly 7 tok/s with no GPU; prior ~12B dense models were usable but slower. Commenters note the key technical reason is that the model has only about 4B active parameters despite 26B total parameters, so CPU inference is practical as long as the quantized weights fit in system RAM. Comments broadly agree that capable local inference does not necessarily require cloud access or high-end GPU rigs, though one commenter argues even a cheap used GPU with 8GB VRAM would provide a large speedup.

    • Commenters note that Gemma 26B-A4B is relatively feasible on CPU/consumer hardware because it has only about 4B active parameters per token despite a larger total parameter count; the main constraint is fitting the model weights in system RAM rather than requiring high-end GPU compute.
    • A technical caveat raised is that even a small used GPU with 8GB VRAM could significantly improve usability, with one commenter estimating roughly 5x better performance versus CPU-only execution, assuming the model or active working set can benefit from GPU acceleration.
  • Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server (Activity: 818): Xiaomi MiMo claims MiMo-V2.5-Pro-UltraSpeed reaches 1000+ tokens/s decode throughput—reportedly up to ~1200 tps—for a 1T-parameter MoE on a single “standard” 8-GPU commodity node, via TileRT persistent/fused/pipelined kernels plus DFlash speculative decoding with acceptance lengths around 4.3–6.3 tokens. The key model-side optimization is selective MXFP4 QAT: Xiaomi says naively applying FP4 hurts reasoning/code, so they quantize only the MoE experts—the bulk of parameters and most quantization-tolerant modules—while keeping other modules at original precision to reduce bandwidth pressure with minimal quality loss. Access is positioned as a limited enterprise/API trial from June 9–23, 2026, with promotional pricing at 3Ă— MiMo-V2.5-Pro. Commenters focused on whether “standard 8-GPU node” is underspecified—asking which GPUs were used—and framed the result as evidence that compressed sparse/MoE architectures are becoming increasingly economical despite prior skepticism. One commenter argued the real “Token Winter” is not model capability but consumer hardware scarcity/pricing while datacenters monopolize GPUs for inefficient inference.

    • Commenters highlighted that Xiaomi’s reported 1,000+ TPS depends heavily on the unspecified “standard 8-GPU server” configuration, with questions about whether the GPUs are datacenter-class cards or consumer GPUs such as RTX 5090/3090, making the throughput claim hard to evaluate without hardware details.
    • A key technical point was Xiaomi’s selective FP4 quantization strategy for MiMo-V2.5-Pro: instead of applying FP4 to the full model, they quantize only the MoE expert layers, which contain most parameters and are more quantization-tolerant, while keeping non-expert modules at original precision. The cited claim is that FP4 QAT preserves reasoning/code capability while reducing model size and improving memory-bandwidth utilization.
    • The released weights were linked on Hugging Face: XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash. One commenter also questioned the architecture notation, asking whether the model is effectively 1T-A1B, implying a very large total-parameter MoE model with a much smaller active parameter count per token.
  • Gemma 4 Chat Template now has preserve thinking (Activity: 447): ****Google’s Gemma Team has updated the official google/gemma-4-31B-it chat template to support preserve_thinking, according to the linked Hugging Face discussion for google/gemma-4-31B-it. The thread also documents practical inference/deployment paths for the multimodal 31B instruction model, including transformers pipeline / AutoProcessor + AutoModelForImageTextToText, plus OpenAI-compatible serving via vLLM and SGLang. Commenters view official preserve_thinking support as validation of earlier community “aftermarket” chat-template modifications, with one noting they “know that it works very well.” Several users want a larger Gemma 4 124B MoE variant to better exploit the updated template, especially for agentic coding workloads.

    • Users note that the official Gemma 4 chat template appears to be adding preserve_thinking, a behavior some had already enabled via aftermarket/custom templates and found effective. The technical claim is that retaining hidden/structured reasoning across turns is particularly useful for agentic coding workflows, where tool-use and multi-step context continuity matter.
    • One commenter cautions that the change may not actually be live yet: they report it is still an open PR, not merged, and that the model files show no update for roughly 21 days. This suggests users should verify the template version before assuming preserve_thinking is available in official Gemma 4 artifacts.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Code Security, Privacy, and Token Limits

  • An active attack is planting backdoors inside Claude Code right now. If you use npm, your credentials may already be compromised. (Activity: 1039): The post alleges an active npm supply-chain campaign against @redhat-cloud-services packages (32 packages, ~117k weekly downloads) plus a later “Phantom Gyp” wave (57 packages, ~647k monthly downloads), where malicious install/build hooks exfiltrate credentials and persist via ~/.claude/settings.json Claude Code SessionStart hooks and .vscode/tasks.json folderOpen tasks; sources cited include Microsoft’s Miasma writeup, StepSecurity on binding.gyp abuse, and Snyk cleanup guidance. The recommended incident-response order is: check dependency trees/lockfiles for affected packages/versions, inspect editor persistence, disconnect and clean before rotating secrets, then rotate from a trusted machine across npm/GitHub/SSH/cloud/Kubernetes/Vault, audit npm publish history/GitHub security logs/self-hosted runners/OIDC trusts, and temporarily use npm install --ignore-scripts plus lockfile integrity hashes and least-privilege CI/CD tokens. Top comments are mostly operational: one commenter thanks the author, while another asks whether this is the same as an earlier incident or a separate new campaign.

    • A detailed remediation checklist identifies potentially affected npm packages: @redhat-cloud-services, @vapi-ai/server-sdk, and ai-sdk-ollama, recommending npm ls checks plus lockfile review for versions published around June 1 and June 3–4. The guidance emphasizes containment before token rotation: inspect ~/.claude/settings.json for unexpected SessionStart hooks and .vscode/tasks.json for suspicious folderOpen tasks, then disconnect/clean before rotating credentials from a trusted machine.
    • The comment describes suspected worm behavior across GitHub/npm supply-chain surfaces: checking the GitHub security log for unexpected repos, GitHub Actions workflows, self-hosted runners, and references such as “Miasma” or “Shai-Hulud.” It specifically calls out GitHub Actions OIDC trust relationships as a high-value rotation target, noting this as the hole allegedly used in the Red Hat compromise, and advises reviewing npm publish history for unauthorized republished package versions.
    • Mitigations discussed include pinning dependencies with integrity hashes so republished packages with different contents fail before execution, and temporarily using npm install --ignore-scripts to block malicious install hooks, binding.gyp, and node-gyp build-time execution. Another commenter questions why the alleged direct pushes to Red Hat repositories were possible at all, arguing that protected main/master branches should require PR-based merges with multiple approvers.
  • Anthropic changed their privacy policy today and there’s a specific clause that every Claude user needs to know about (Activity: 784): OP claims Anthropic published a revised Privacy Policy on 2026-06-08, effective 2026-07-08, changing law-enforcement disclosure from externally compelled legal process to disclosure based on Anthropic’s internal “good faith belief” that it is necessary. The post argues this creates risk for false positives from automated safety classifiers—especially roleplay, fiction, threats in narrative context, or mental-health venting—because conversations could allegedly be escalated to authorities without a court order, user notice, appeal path, or defined evidentiary threshold. OP also compares this unfavorably with OpenAI/Mistral policies and raises UK GDPR/DBS concerns, but no direct policy-change link was provided in the post; a top commenter explicitly asked for the source URL. Top comments were strongly negative, framing the change as a major privacy regression and part of broader “enshittification”; one commenter said they would move back to Codex due to perceived high cost, restrictive behavior, and weakened privacy. Another commenter requested a link to verify the claimed new policy.

    • One commenter connected Anthropic’s policy change to broader AI-provider duty-of-care questions, citing a lawsuit against OpenAI/Sam Altman where families allege a mass shooter’s ChatGPT use had been flagged internally but not reported to police (BIV report). The implication is that providers may increasingly reserve rights to monitor/escalate user activity when internal safety systems identify severe risks.
    • Another commenter argued that Anthropic escalation may be justified for high-severity misuse, specifically linking Anthropic’s own biorisk red-team work (Anthropic Red Teaming: Biorisk). This frames the privacy-policy concern against concrete threat models such as AI assistance for biological harm, where user-content review or reporting could be positioned as a safety control.
  • Claude’s new usage limits are insane. (Activity: 1122): The screenshot (image) shows a Claude coding session on Opus 4.8 with 1M context consuming 1.1M tokens over ~12m 54s, leaving the user at 21% of a 5-hour limit after a single prompt. The post argues that combining Opus + 1M context + UltraCode can multiply token usage because multiple parallel agents may each read large context, making one request behave like many expensive calls rather than a single efficient inference pass. Commenters largely push back on the complaint, arguing this is expected behavior when using the most expensive model/context/agent mode combination—“crush an ant with an excavator” was the analogy. They emphasize that UltraCode is intentionally not token-efficient and should be reserved for narrow, high-value tasks rather than treated as a default “max thinking” mode.

    • Several commenters argued the high usage was expected because the user combined the most token-intensive settings: Ultra Code, high “thinking” level, and large context. The technical takeaway was that Ultra Code is not a token-efficient replacement for “Max thinking”; it is designed for a narrower class of tasks where much higher token burn and cost are acceptable.
    • A recurring point was that developers need to choose model/tool configurations based on task complexity and cost constraints. Commenters framed the issue as an optimization problem: using an overpowered coding mode for routine work will predictably exhaust limits, so workflows should reserve Ultra Code-style modes for cases where the extra reasoning/context budget materially improves outcomes.

2. Mythos 5 and Ideogram 4.0 Creative Model Reports

  • Mythos 5: We’re Not Ready (Activity: 1348): A post claims Anthropic’s “Mythos 5” test model is unusually strong at SVG/code-based visual generation, frontend/UI creation, games, websites, and even code-generated music, with outputs sometimes taking minutes to produce. It also cites an alleged Anthropic internal result of up to 52Ă— training-code optimization speedups versus ~4Ă— for skilled humans, and expects the public release to be expensive and likely nerfed relative to the test model. Top comments were mostly skeptical or sarcastic: commenters questioned the “too dangerous SVG generation” framing, and the only claim one commenter found plausible was that any public model would be a downgraded/nerfed version; another objected to the expected higher cost.

    • A commenter highlights skepticism that the released model may differ substantially from the internal test version, quoting the claim that “the public version will likely be a nerfed version of the current testing model.” The technical implication is that any reported capability claims for Mythos 5 may not transfer to production if Anthropic applies post-training restrictions, capability gating, or safety/performance tradeoffs before public release.
    • One substantive suggestion is that if Mythos 5 is significantly more expensive to run, Anthropic may need to ship smaller, cheaper, domain-specialized models rather than relying on a single frontier generalist model. This reflects a common deployment tradeoff: specialized models can reduce inference cost and latency while preserving task performance in constrained domains.
  • Ideogram 4.0’s Understanding of Characters and IP is Crazy for an Open Model (Activity: 1081): The post reports strong zero-LoRA character/IP recall from Ideogram 4.0 run locally in ComfyUI using INT8 model variants at 1440Ă—1024 (~1.5 MP), with Kijai’s Ideogram 4 Prompt Builder KJ node and SilverOxide’s workflow (Pastebin). The author also highlights Ideogram 4.0’s inpainting quality, optionally using ComfyUI-Inpaint-CropAndStitch, and shared a Mario/Sonic prompt JSON using structured fields like high_level_description, style_description, and bounding-box-based compositional_deconstruction. Commenters were notably surprised that the samples used no LoRAs, with one asking whether LoRA training for Ideogram 4.0 is already practical. Another commenter praised specific IP/detail handling, e.g. the “note from Link to Zelda.”

    • The OP reports that Ideogram 4.0 can reproduce recognizable character/IP concepts without LoRAs, calling it the strongest open model they have tried for this use case. Images were generated locally in ComfyUI at 1440x1024 (~1.5 MP) using the INT8 Ideogram 4.0 models, plus Kijai’s Ideogram 4 Prompt Builder KJ node and SilverOxide’s workflow (pastebin).
    • A technical workflow detail shared was the use of structured prompt JSON with fields for high_level_description, style_description, and compositional_deconstruction, including object-level bbox regions and descriptions. The example prompt explicitly placed Mario and Sonic with bounding boxes, gestures, facial expressions, and background franchise context, suggesting Ideogram 4.0 benefits from spatially decomposed prompting.
    • The OP also notes that Ideogram 4.0 performs well for inpainting, often enough that cleanup is unnecessary, but they use ComfyUI-Inpaint-CropAndStitch for masked face/detail fixes when needed (GitHub). This enables a practical workflow of generating at lower megapixels, then selectively inpainting problematic regions for higher fidelity.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.