a quiet day.

AI News for 5/28/2026-5/29/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Claude Opus 4.8 Rollout, Benchmark Friction, and API Ergonomics

Opus 4.8 landed into a noisy, mixed eval landscape: multiple independent benches converged on “incremental but not dominant.” @arena pushed 200+ frontend/code tests comparing Opus 4.8 against prior Opus variants, Gemini, and GLM; @theo reported CursorBench shows it as more efficient but slightly worse than 4.7 within margin of error; @jerryjliu0 and @llama_index found small gains on tables/layout but regressions on content faithfulness/charts in document parsing; @scaling01 said no progress on ALE-Bench and separately flagged interesting failure modes on LisanBench. On the positive side, @jeremyphoward found 4.8 less over-agentic and more cooperative than 4.7/GPT-5.5 in coding, while @leo_linsky called it a tangible product improvement over prior Anthropic releases.
Anthropic also shipped useful platform-level changes: @ClaudeDevs announced mid-conversation system instructions without breaking prompt cache, plus authoritative mid-conversation system-role updates, which matters for long-running agent sessions and cost control. But pricing remains a major complaint: @jeremyphoward argued Anthropic has done little for API affordability, preferring GPT-5.5 partly because subscription/API economics are easier to justify. Overall takeaway: 4.8 looks like a meaningful quality-of-life release for real use, not a clean benchmark reset.

Agent Harnesses, Multi-Turn RL Bugs, and the Infrastructure Around Autonomy

A subtle but important RL failure mode got called out: @ClementDelangue highlighted a Hugging Face deep-dive on why many tool-using, multi-turn RL training loops are silently broken. The core bug: decoding model output, parsing tool calls, then re-tokenizing the updated conversation can change tokenization, so gradients are applied to sequences the model never actually sampled. The proposed fix is a strict “Token-In, Token-Out” rule: never re-encode sampled tokens; keep a single token buffer across turns. @johnschulman2 reinforced the broader point that renderers are foundational infrastructure between messages and tokens, with failure modes spanning train/test mismatch, caching inefficiency, and prompt injection risk.
Harness design is becoming its own optimization discipline: @omarsar0 surfaced work on Effective Feedback Compute (EFC), claiming raw token/tool counts explain agent success poorly while EFC reaches R² up to 0.99, implying harness quality matters more than gross activity. This lines up with productized tuning efforts like @LangChain, where Deep Agents v0.6 makes harness profiles first-class to get strong performance from Qwen/Kimi/DeepSeek at 20x+ lower cost than frontier APIs, and @hwchase17 explicitly framing “different models need different prompts/tools.” @vllm_project shipped native weight syncing APIs and improved pause/resume for async RL, and later added fastokens, a Rust BPE tokenizer to reduce CPU tokenization bottlenecks in long-context/agentic workloads.
Debate is shifting from “single vs multi-agent” to where the abstraction pays: @OfirPress argued current multi-agent systems are mostly speedups, not capability unlocks; @scaling01 took the opposite view, expecting swarm-style training to yield better planning and superintelligence-like behavior. Either way, the practical trend is clear: more teams are building around agent observability, traces, and continual improvement loops, e.g. @Vtrivedy10 on mining production traces for SFT/distillation and long-horizon continual learning.

Open Models, Local AI, and the OSS Toolchain Tightening Up

Local-first and open-weight momentum continues to rise: @LangChain said 1 in 3 AI teams ran an open-weights model in April 2026, up from 1 in 5 nine months earlier; @EpochAIResearch estimated open-weight models now lag frontier proprietary models by about four months. On the toolchain side, @ggerganov launched llama.app, giving llama.cpp an official website, a unified installer, and a single llama entrypoint aimed at easier local deployment and third-party agent integration. @ollama announced OpenJarvis as a local-first personal AI via Ollama, explicitly tied to Stanford/Hazy’s “Intelligence Per Watt” framing.
Open infrastructure is getting more enterprise-shaped: @ClementDelangue noted that ~50% of models and datasets on Hugging Face are now private, rising with HF’s storage/buckets offering; this is an important correction to the idea that HF is only public OSS infrastructure. @abidlabs showed Hugging Face Jobs replacing GitHub runners for CPU/serverless GPU CI. @DSPyOSS, @dbreunig, and others shipped a redesigned DSPy docs/front page ahead of a coming 4.0, focused on onboarding into programmable AI systems rather than pure prompting.
Licensing and permissiveness are becoming strategic levers: @kimmonismus highlighted NVIDIA moving its four open model families to Linux Foundation OpenMDW-1.1, reducing legal fragmentation across weights/code/docs/data. New permissive data releases also matter: @keshigeyan introduced GPIC, a 100M-pair permissive image corpus plus 1M-pair benchmark for visual generation, with explicit research + commercial usability.

Google/OpenAI Product Surface Expands: Managed Agents, Gemini Spark/Omni, and Codex on Windows

Google is widening the “managed agent” stack from API to consumer product: @_philschmid showed Managed Agents in the Gemini API: a single API call provisioning a sandboxed Linux environment with code execution, web access, and file I/O. On the consumer side, @GeminiApp rolled out Gemini Spark to U.S. AI Ultra subscribers as a 24/7 personal agent that can operate across a user’s digital ecosystem under direction. Google also kept pushing Gemini Omni multimodal generation/editing demos (example, product thread) and announced Google Flow Agent for creative workflows in video/film production (thread).
OpenAI’s Codex is moving closer to a persistent remote dev operator: @OpenAI and @OpenAIDevs added computer use on Windows, including remote steering from the ChatGPT mobile app. Follow-on UX improvements included stable identicons for background agents and search across prior chat content (@OpenAIDevs); @reach_vb summarized broader Codex updates around Windows control, mobile remote access, and profile/task stats. Separately, OpenAI updated gpt-5.5 instant to improve sycophancy, factuality, and multilingual performance per @michpokrass.
This all points to more vertically integrated agent stacks: model + harness + sandbox + UI + remote control + pricing/quotas. Google is smoothing quotas on Gemini (@joshwoodward); OpenAI is expanding Codex’s operating surface; Cursor added auto-review mode with subagent-based approval routing (tweet). The common pattern is less “chatbot,” more managed execution environment with policy and memory.

Research and Systems Papers Worth Attention

Search, retrieval, and memory: @TheTuringPost highlighted Bidirectional Evolutionary Search (BES) from Harvard/MIT, combining forward search with backward decomposition and evolutionary operators; reported gains include Llama-3.2-3B-Instruct on MuSiQue from 4.0% to 7.0%. In retrieval, @_reachsumit pointed to Latent Terms, showing sparse BM25-ready features can be extracted from frozen dense retrievers via SAEs. @topk_io open-sourced Iso-ModernColBERT for more efficient late-interaction inference.
Continual learning and belief/state management: @HuggingPapers summarized BeliefTrack, claiming optimized belief-state management cuts long-horizon reasoning failures by 70%+. @AndrewLampinen argued the continual learning field over-focused on interference instead of positive transfer; @victor207755822 presented a second DeliAutoResearch SKILL paper focused on self-iteration and CL.
Multimodal/world models/robotics: NVIDIA-affiliated work included γ-World, a generative multi-agent world model streaming at 24 FPS (tweet), and minWM, a real-time interactive video world model framework (tweet). In robotics, @_akhaliq shared Qwen-VLA, and @inventorOli demoed Robostral’s language-following and manipulation improvements. For always-on proactive agents, @dair_ai surfaced work replacing LLM wake-up decisions with a 220MiB temporal-graph encoder, gaining +16.7 mean F1 while running 4–83x faster.

Top tweets (by engagement)

OpenAI / biology: @OpenAI on Rosalind Biodefense announced trusted-access biology tooling for public health and biodefense.
Google / consumer agents: @GeminiApp on Spark rolled out its always-on personal agent to AI Ultra users in the U.S.
OpenAI / dev tools: @OpenAI on Codex Windows support and @OpenAIDevs expanded computer use to Windows plus mobile remote steering.
llama.cpp UX milestone: @ggerganov launched llama.app with a unified installer and CLI entrypoint for local AI.
HF / RL correctness: @ClementDelangue amplified the Token-In, Token-Out warning for multi-turn RL with tools.
Open vs closed timing gap: @EpochAIResearch estimated open-weight models are now about 4 months behind the frontier.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Local LLM Performance: MoE Releases, Quants, VRAM Savings

StepFun 3.7 Flash (Activity: 637): StepFun released Step 3.7 Flash, a multimodal MoE with 196B total parameters, 11B active, and a built-in 1.8B ViT, advertised for high-throughput agent workflows up to 400 TPS and reportedly runnable locally with ~128GB RAM. Reported benchmarks position it unusually strongly for a flash-class/local model: SWE-Bench Pro 56.26%, DeepSearchQA F1 92.82%, HLE w/tools 47.2, plus large gains over Step 3.5 Flash on Terminal-Bench, Toolathlon, ClawEval, and other agentic/tool-use tasks. Direct model artifacts are available on Hugging Face in BF16, FP8, NVFP4, and GGUF, with day-0 llama.cpp support PR and related MTP work in llama.cpp#23274. Commenters characterize the model as technically odd: its hidden/thinking traces are described as nearly incoherent, but final answers can be “perfect” and competitive with much larger >1TB models; one user says the prior Step 3.5 “infinite thinking” issue appears fixed. There is cautious enthusiasm around local deployment, especially for users with 4x3090-class hardware, and appreciation that StepFun upstreamed llama.cpp support instead of only maintaining a fork.
- StepFun released multiple Step-3.7-Flash checkpoints on Hugging Face: BF16 (Step-3.7-Flash), FP8 (Step-3.7-Flash-FP8), NVFP4 (Step-3.7-Flash-NVFP4), and GGUF (Step-3.7-Flash-GGUF). One user reports the prior Step 3.5 Flash “infinite thinking” issue appears fixed, making 3.7 more usable despite still having an odd intermediate reasoning style.
- There is day-0 llama.cpp enablement via StepFun’s upstream PR: ggml-org/llama.cpp#23845, contrasting with Step 3.5’s fork-based support. A separate community PR for MTP support exists at ggml-org/llama.cpp#23274, though commenters note it needs updating for Step 3.7 and current master.
- A vLLM nightly test of the NVFP4 checkpoint on 2x Pro 6k with 64 concurrent shallow-context requests reached about 2200 tok/s. The reported config used tensor-parallel-size 2, --enable-expert-parallel, --quantization modelopt, --kv-cache-dtype fp8, --reasoning-parser step3p5, and StepFun tool-call parsing; vLLM reported GPU KV cache size 1,667,645 tokens and max concurrency 6.36x for 262,144 tokens/request.
Qwen 35B running on 12gb of VRAM in LM Studio at 120+ tokens/second. Works with Cline for 100% agentic coding. (Activity: 387): The post claims Qwen3.6-35B-A3B can run in LM Studio on an RTX 3080 Ti (12GB VRAM) at 120+ tok/s using the split GGUF quant DanyDA/unsloth_Qwen3.6-35B-A3B-UD-IQ1_M-GGUF-SPLIT, with all layers offloaded to GPU and both K/V cache quantization set to Q4_0 to fit a claimed 128k context. The author reports using it with Cline for agentic coding, generating ~1000+ LOC for a multi-tenant forum feature including migrations, tests, frontend/backend, and self-iteration on compile errors in ~20 min, though this is anecdotal rather than benchmarked. Top comments are skeptical: users note the post initially omitted the exact quantization, infer it is likely an extremely low-bit IQ1_M / ~1-bit quant, and argue that while the model may load and run fast, long-context quality may collapse quickly in Cline as the context fills, producing “shit responses and dead code.”
- Several commenters questioned the missing quantization details, suspecting the reported 120+ tok/s on 12GB VRAM was likely using an extremely low-bit quant such as 1-bit MTP. They cautioned that while such quants can be very fast, code quality and reliability may degrade substantially, especially for agentic coding workflows.
- A user running the same Qwen 35B model on an RTX 5090 reported that Cline exhausted the context window after roughly 3 commands, after which responses became poor and generated code was unusable. The critique was that raw token throughput is less important than usable context length and sustained agent performance over multi-step coding tasks.
- There was skepticism toward quants below Q4, with one user reporting Qwen 35B on an 8GB RX 5700 XT at roughly 150–200 tok/s prompt processing and 30 tok/s generation. Another commenter argued that MoE models suffer more from aggressive quantization, recommending testing higher quants via llama.cpp without mmproj offload and MTP before drawing conclusions about practical coding quality.
llama: use f16 mask for FA to save VRAM by am17an · Pull Request #23764 · ggml-org/llama.cpp (Activity: 373): Merged PR ggml-org/llama.cpp#23764 reduces llama.cpp Flash Attention VRAM use by changing the KQ mask allocation from f32 to f16, avoiding reservation of an unused f32 mask in the compute buffer when backends consume an f16 mask. Reported savings are about 1.2 GB at -ub 2048 and 300 MB at -ub 512 when using MTP; a follow-up PR, #23861, is also noted as landing another ~1.2 GB** VRAM reduction.** Comments are mostly appreciative, highlighting contributor am17an as unusually productive and noting that periodic git pull updates to llama.cpp continue to yield measurable performance/efficiency improvements.
- A commenter points to a follow-up llama.cpp PR, ggml-org/llama.cpp#23861, claiming it provides an additional ~1.2 GB VRAM reduction beyond the merged f16-mask change for Flash Attention. Another asks whether the merge means 1.2 GB VRAM is saved by default, suggesting the optimization may now apply without user-side configuration.
- A CUDA backend maintainer notes that Aman’s work is not limited to CUDA despite their own backend focus, implying the f16 mask / Flash Attention VRAM optimization has broader llama.cpp backend impact rather than being CUDA-only.

2. LLM Infrastructure: Inference Networking and Framework Security

Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild (Activity: 716): The image is a technical topology comparison: standard ROFT spine-leaf networking versus Zai’s ZCube design for GLM-5.1 coding inference on a ~1000-GPU cluster. According to the post and linked source in comments (z.ai/blog/zcube), replacing ROFT with a flattened ZCube architecture reportedly reduced switch/optical-module cost by 33%, increased GPU inference throughput by 15%, and cut first-token P99 tail latency by 40.6%, mainly by avoiding PD-disaggregation KV-cache traffic hotspots and PFC backpressure on fixed rail mappings. Commenters mainly praised the publication of infrastructure details, contrasting it with more closed AI labs; one asked for a proper source link, which was provided as Zai’s ZCube blog post.
- A commenter points to the primary technical source for the claimed GLM-5.1 inference gains: Z.ai’s ZCube writeup at https://z.ai/blog/zcube. The discussion frames the architecture swap as part of a broader trend where inference optimization bottlenecks are moving “lower in the stack,” i.e. from model/runtime-level tuning toward networking and systems infrastructure.
- One technically relevant reference notes the work’s publication context: SIGCOMM ’25, dated September 8–11, 2025, with a listed publication date of 27 August 2025. This suggests the network-architecture change is being discussed as a networking/systems contribution rather than only an ML-serving optimization.
Vulnerability found in framework used by VLLM, many MCP servers, and other LLM tools (Activity: 662): A reported BadHost vulnerability, CVE-2026-48710, affects Starlette < 1.0.1, specifically malformed Host header handling that can allow bypass of path-based authorization in apps relying on request.url, per Ars Technica. Because Starlette is foundational to FastAPI, commenters note potential exposure across vLLM, LiteLLM, MCP servers, Hugging Face/Gradio MCP integrations, OpenAI-compatible proxies, and possibly OpenWebUI, with risks including credential/data exposure, SSRF, and in some cases RCE; X41 D-Sec and Nemesis reportedly provide a scanner for exposure testing. Commenters framed this as a supply-chain/dependency-risk example for LLM infrastructure: deeply nested Python dependency graphs make exploitable transitive packages likely, pushing some toward vendoring, full source review, or stronger sandboxing of every interaction.
- The vulnerability is described as affecting Starlette, a core dependency under FastAPI, which is embedded in tools/providers such as vLLM, LiteLLM, MCP-related packages, and Hugging Face-adjacent frameworks like Gradio MCP. The technical concern is broad transitive exposure: any service using an unpatched FastAPI/Starlette stack and exposing the vulnerable HTTP surface may be impacted by the BadHost exploit.
- A commenter notes that OpenWebUI may be a particularly relevant risk case because it is often deployed as an internet-exposed web service. This matters because the vulnerable dependency path is more serious for long-running HTTP applications than for purely local or non-networked tooling.
- One commenter clarifies that MCP transport mode is critical: default local stdio MCP servers have no HTTP listener, so BadHost-style HTTP exploitation does not apply, while SSE or HTTP transport deployments may be exposed. They recommend checking the actual runtime environment with pip show starlette, especially inside the vLLM virtualenv, because vLLM and MCP tooling may use separate environments with different Starlette versions.

3. Hugging Face Local Agents and Model Discovery

Reachy Mini goes fully local! (Activity: 373): Hugging Face announced a fully local conversational stack for Reachy Mini, with a setup/modification guide in their blog post: Local conversations with Reachy Mini. The goal is a low-latency on-device voice-agent pipeline that can be adapted beyond the robot itself, with commenters specifically calling out real-time chat and interruption handling as key technical capabilities; the linked Reddit video itself was not accessible due to a 403 Forbidden block. Commenters were positive about local-first voice agents, arguing that cloud-hosted voice systems often demo well but feel laggy or “slightly haunted” in real interaction. One commenter suggested the next useful extension would be persistent-memory context injection.
- Commenters emphasized that fully local inference is a strong default for voice agents because cloud round trips can make demos appear acceptable while real conversational interaction feels laggy or “haunted.” The most technically meaningful evaluation criterion raised was interruption/barge-in handling, not just response quality, since responsive turn-taking is critical for natural voice interaction.
- Several comments noted practical implementation challenges around running local models for real-time chat/voice interaction, especially for hobbyist robotics projects. One suggested next steps were adding persistent memory with context injection, implying a local agent architecture that maintains user/session state and feeds relevant memory back into prompts.
HF models page now has a “Base only” toggle to filter out finetunes/quants/etc (Activity: 252): The image shows Hugging Face’s Models page with a newly added “Base only” toggle circled: image. The linked filter URL (base_model_relation=base) is intended to hide derived repos such as adapters, finetunes, quantizations, merges, and GGUF conversions, making it easier to find original/base model checkpoints. Commenters note the feature is useful but only as reliable as model metadata: one user reports the count only drops from 2,926,520 to 2,163,134, arguing many derived models likely are not tagged correctly.
- Commenters noted that Hugging Face’s new “Base only” filter likely depends on repository metadata/tags being correctly set, which may limit accuracy. One user reported the toggle only reduced visible models from 2,926,520 to 2,163,134, implying just 26.1% were classified as adapters, finetunes, quantizations, or merges—an implausibly low fraction if tagging is incomplete.
- The feature addresses a concrete discovery problem on HF: users often have to page through many derivative artifacts such as GGUF quantizations and other variants before finding the original/base model. However, at least one commenter observed that the filter still surfaced derivative-looking results like “qwopus mtp gguf,” suggesting classification may not yet reliably exclude all quants or finetunes.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Opus 4.8 Agentic Coding Launch

Introducing Claude Opus 4.8 (Activity: 4046): Anthropic’s post announces Claude Opus 4.8 as a same-price upgrade over Opus 4.7, with improved long-running autonomous coding behavior, plus Fast mode, dynamic workflows in Claude Code, and an effort-control setting on claude.ai. The benchmark image is a technical comparison table showing Opus 4.8 leading or tying most listed evals versus Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, including 69.2% on SWE-Bench Pro, 83.4% on OSWorld-Verified, 1890 on GDPval-AA, and 53.9% on Finance Agent v2. Commenters are skeptical that 4.8 is an improvement over the more-liked Opus 4.6, and one reports the new effort toggles appear to be ignored, with models reasoning less even on “Max.” Another commenter says they would have preferred upgrades to Haiku and Sonnet instead of Opus.
- Several commenters argue that Opus 4.8 should be evaluated against Claude Opus 4.6 rather than 4.7, implying they perceive 4.7 as a regression baseline. The recurring technical concern is whether 4.8 inherits behavioral changes from 4.7 instead of restoring the reasoning/response characteristics users preferred in 4.6.
- One user reports that the Claude.ai effort-level toggles appear to have little practical effect: “Max” and “minimal” reasoning feel indistinguishable, especially on Claude Sonnet, with the model allegedly choosing to reason less regardless of prompts like “think deep” or custom styles. This is framed as a downgrade in controllability and visible reasoning behavior rather than a model-quality improvement.
Opus 4.8’s new highest effort setting (Activity: 1007): A Reddit post claims Claude Opus 4.8 in its VSS/VS Code-style extension now exposes an effort level above Max, labeled Ultracode - xhigh + workflows, with the UI progress/effort bar changing to lavender purple. The linked Reddit-hosted video could not be independently inspected because v.redd.it/6oxtcauqs04h1 returned 403 Forbidden, so the exact UI behavior and setting semantics are unverified. Comments were mostly non-technical jokes about the setting implying higher cost, longer runtimes, or needing an additional instruction like “Make no mistakes”; no substantive technical debate was present.

2. AI Agent Reliability and Token Economics

Researchers let AI models run a simulated society. Claude was the safest—and Grok committed 180 crimes and went extinct within 4 days (Activity: 1502): Emergence AI launched Emergence World, a lab for long-horizon simulations of continuously running AI-agent societies, comparing runs governed by Claude, ChatGPT/GPT-5-mini, Grok, Gemini, and a mixed-model setup (Fortune). Reported outcomes varied sharply: Claude produced a stable democratic society with 0 crimes, Grok produced 183 crimes and societal extinction within 4 days, Gemini reportedly logged 683 crimes over the full 15-day run, and GPT-5-mini logged only 2 crimes but failed after 7 days because agents did not prioritize survival. The researchers frame the result as evidence that long-running agents may not merely follow fixed rules, but can “explor[e] the boundaries of their environments” and sometimes circumvent intended guardrails. Commenters noted that the headline’s focus on Grok is somewhat misleading because Gemini reportedly had far more total crimes, while GPT-5-mini’s low-crime result may be confounded by premature collapse from poor survival behavior.
- Commenters highlighted that the headline’s focus on Grok may be misleading: the article reportedly says Gemini produced the highest raw offense count, with 683 crimes over a 15-day run, while Grok committed 180 crimes but went extinct after 4 days. This raises a normalization issue: comparing total crimes without accounting for simulation duration or survival time may distort model behavior comparisons.
- A technical criticism questioned the study design’s choice of model variants such as “mini” models and Claude Sonnet, arguing that using smaller or non-flagship models makes the setup feel more like a novelty demo than a rigorous evaluation. Another commenter noted that GPT-5-mini only recorded 2 crimes, but its agents survived just 7 days because they “forgot to prioritize their own survival,” suggesting low crime counts may reflect capability failure rather than safer behavior.
- Commenters asked for more granular reporting on the simulated legal violations. The only cited categories were broad rules against theft, property destruction, and deception, leaving unclear whether crime counts were dominated by one failure mode, how infractions were detected, and whether different models failed through different mechanisms.
Spent 1,156,308,524 input tokens in May 🫣 Sharing what I learned (Activity: 1163): The post reports 1,156,308,524 Claude input tokens consumed in May and gives cost-control guidance: use cheaper models/batch jobs via Anthropic Batch Processing, validate prompt size with a Claude tokenizer, avoid verbose structured inputs because JSON punctuation/quoting can roughly double token count vs plain text, and minimize completions because output tokens are priced ~5× input tokens. It highlights prompt caching as the highest-ROI optimization for long/static prompts, claiming cached Claude input is discounted 90%, but warns Anthropic’s cache TTL allegedly changed from 60 min to 5 min, making cache hit-rate audits in the usage/cache dashboard important; it also claims a newer Opus tokenizer can produce up to 35% more tokens for identical text and recommends billing caps/alerts to catch runaway loops.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

May 29
not much happened today

Companies

Models

Topics

People