a quiet day.

AI News for 5/13/2026-5/14/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Coding Agent Tooling: Codex Mobile, GitHub’s New App, VS Code Multi-Agent UX, and Hermes/Codex Interop

OpenAI pushed Codex further into day-to-day workflows: the biggest product launch in this set was Codex in the ChatGPT mobile app, letting users start tasks, review outputs, approve commands, and steer execution remotely while Codex continues running on a laptop, Mac mini, or devbox. OpenAI also noted Remote SSH is now generally available for managed remote environments, and later added hooks plus programmatic access tokens for Business/Enterprise automation around the Codex loop (OpenAI, OpenAI follow-up, @OpenAIDevs on mobile workflow, @OpenAIDevs on Remote SSH, @OpenAIDevs on hooks/tokens). Separately, OpenAI published a technical writeup on the Windows sandbox for Codex, focused on the tradeoff between utility and constrained machine access for coding agents (OpenAI Devs, @gdb).
The broader IDE/app ecosystem is converging on “agent-first” UX: GitHub announced a technical preview of the GitHub Copilot App, described as a desktop environment for parallel workstreams, repo/PR lifecycle management, and model flexibility (GitHub, @adrianmg, @OrenMe). VS Code shipped a new Agents window for multi-agent, multi-project workflows, browser/mobile support via vscode.dev/agents, BYOK improvements, and token-efficiency features like compressed terminal output (VS Code, remote/browser support, BYOK updates, terminal compression). On the open side, Nous/Hermes Agent added Codex runtime integration, effectively routing OpenAI-backed turns through Codex CLI/app-server and reusing ChatGPT subscription-backed execution in Hermes sessions (Nous Research, @Teknium, @HermesAgentTips). Kimi also shipped Kimi Web Bridge, a browser extension exposing human-like web interaction to Kimi Code CLI, Claude Code, Cursor, Codex, Hermes, and others (Moonshot AI).

Agent Infrastructure and Self-Improvement Loops: LangSmith Engine, SmithDB, Sandboxes, and Continual Learning

LangChain’s launch stack was the most substantive agent-infra release cluster: SmithDB is a database purpose-built for agent trace data, while LangSmith Engine consumes traces, clusters failures, identifies likely code issues, and proposes fixes/evals—turning observability into an improvement loop rather than passive inspection (@hwchase17, @caspar_br on Engine, @bentannyhill). Community commentary emphasized SmithDB’s architectural shift toward object storage and a custom storage/query path for this workload shape (@caspar_br on SmithDB, @ngates_, Chinese summary).
LangChain also announced LangChain Labs, an applied research effort around continual learning for agents, with the thesis that production traces should become training signal, evals, and targeted capability improvements over long horizons (LangChain, @jakebroekhuizen, @willccbb, Prime Intellect partnership).
Execution isolation for agents continues to mature: W&B/CoreWeave launched CoreWeave Sandboxes for isolated execution in RL, tool use, and eval workloads, explicitly testing destructive commands like rm -rf / at scale (Weights & Biases). In a similar spirit, open-source/local dev tooling surfaced around agent debugging: @benhylak highlighted a free local agent debugging stack with traces exposed to Codex/Claude Code for automated eval authoring.

Anthropic Claude Code Restrictions and the Developer Backlash

The sharpest ecosystem reaction was to Anthropic restricting/reshaping Claude Code usage, especially for third-party wrappers and high-volume programmatic workflows. Theo’s thread became the focal point: he argued users of T3 Code were effectively hit with dramatic rate-limit reductions despite integrating through the officially supported path, and he subsequently cancelled his subscription while encouraging others to post cancellation screenshots for open-source donations (@theo initial thread, subscription cancellation, donation thread, T3 Code clarification). Other prominent builders echoed the complaint that Anthropic had effectively cut off open-source devs/apps and destabilized harnesses built around claude -p (@theo, @andersonbcdefg).
There was also a more strategic counterargument: some users argued Anthropic does not owe developers heavily subsidized flat-fee tokens for third-party apps, and that the ecosystem will likely shift toward more explicit API economics and smarter routing between expensive and cheap models (Sentdex, @tadasayy). Still, the visible churn signal was nontrivial, including users estimating meaningful ARR loss from reply-thread cancellations alone (@thegenioo, Uncle Bob Martin, Theo later). For agent engineers, the practical takeaway is straightforward: subscription-backed harnesses are not stable platform primitives; provider/model abstraction and BYOK paths look increasingly mandatory.

Robotics and Embodied AI: Figure’s 24/7 Sorting Stream and the Broader Automation Signal

Figure’s livestream dominated robotics discussion. The company first showed 8 hours of fully autonomous, unsupervised work, then extended to a 24/7 livestream, eventually reporting 24+ hours of continuous autonomous operation without failure, around human-parity throughput on small package sorting, and operation by Helix-02 running entirely onboard with automatic resets for OOD cases—explicitly claiming no teleoperation (Figure CEO Brett Adcock, 24h update, detailed technical clarifications, Day 2 livestream). The repeated “Bob, Frank, and Gary” updates were fluffier, but the core signal was sustained autonomous operation at production-like uptime.
Interpretation split between skepticism about Figure specifically and broader conviction about robotics acceleration. Some commenters argued that critics were underestimating what these demonstrations imply for near-term labor substitution, while others noted skepticism was directed more at Figure than at robotics as a category (@cloneofsimo, @iScienceLuvr, @kimmonismus). Either way, this was one of the clearest “continuous uptime” demos in the batch.

Research, Benchmarks, and Open Models: Diffusion LMs, Time-Series FMs, Mechanistic Interpretability, and RL/Search

A few technically significant model/research releases stood out:
- Zyphra’s ZAYA1-8B-Diffusion-Preview claims a 4.6–7.7x decoding speedup versus autoregressive generation with limited quality loss, making the usual case that diffusion LMs enable cheaper rollouts and richer generation modes (Zyphra).
- Datadog’s Toto 2.0 released 5 open-weights time-series forecasting models from 4M to 2.5B params under Apache 2.0, claiming #1 on BOOM, GIFT-Eval, and TIME and, more importantly, evidence that scaling laws may finally hold cleanly for TSFMs (Datadog, @atalwalkar, @ClementDelangue).
- Goodfire’s interpretability post argued that Llama uses a geometric “shape-rotating calculator” / Fourier-feature-like mechanism for arithmetic, with steering-based evidence rather than pure post-hoc description (GoodfireAI, follow-up).
On RL/search and optimizer-style progress, several threads were notable: a survey framing LLM RL as rollout engineering across Generate / Filter / Control / Replay rather than just PPO-vs-GRPO (The Turing Post); Pedagogical RL using privileged information to actively find useful rollouts (Souradip Chakraborty, @lateinteraction); and Prime Intellect’s autonomous optimizer search on the nanoGPT speedrun benchmark, where Opus 4.7 reached 2930 steps and GPT-5.5 2950, beating the 2990 human baseline after ~10k runs / ~14k H200 hours (Prime Intellect, @eliebakouch). Also noteworthy: Kimi K2.6 was reported as #1 open-weight model on Finance Agent Benchmark V2 (Moonshot AI), and Ring-2.6-1T got day-0 vLLM support as an open release (vLLM).

Top Tweets (by engagement)

OpenAI’s Codex mobile launch was the clearest product winner by engagement and practical relevance: remote control/review of running coding-agent sessions from ChatGPT mobile (OpenAI).
Theo’s Claude Code backlash threads captured the strongest developer sentiment shift around platform risk and subscription-backed agent workflows (@theo, @theo donations thread).
Figure’s autonomous humanoid sorting livestream remained one of the most discussed embodied-AI demos, especially once it crossed the 24-hour mark with detailed claims about onboard policy execution and no teleop (Brett Adcock).
GitHub’s Copilot App and LangChain’s Engine/SmithDB/Labs were the most important non-OpenAI tooling launches for agent engineers this cycle (GitHub, LangChain, @hwchase17).
Prime Intellect’s autonomous optimizer-search result is worth watching as a concrete example of coding agents being looped into open-ended ML optimization, not just app dev (Prime Intellect).

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Speedups and Quantization

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant (Activity: 514): A patched llama.cpp fork adds Multi-Token Prediction (MTP) support for Qwen plus TurboQuant, reporting 21 tok/s → 34 tok/s on a MacBook Pro M5 Max 64GB, with a claimed 90% MTP acceptance rate; note the raw speedup is ~62%, not 40%. Code is published at AtomicBot-ai/atomic-llama-cpp-turboquant, with GGUF MTP quantizations for Qwen 3.6 27B/35B in the AtomicChat/qwen-36-udt-mtp HF collection. Commenters questioned the TurboQuant framing, arguing it is often slower than f16, q8, or q4; one noted a TurboQuant PR to llama.cpp was rejected because existing Q4 KV-quant rotation support already covered most benefits, with gains mainly at Q3 where quality degradation becomes a concern. Others asked for quality/eval data, since higher speculative/MTP acceptance and tokens/s do not alone establish output parity.
- Several commenters argued that TurboQuant is not generally faster in llama.cpp, with one noting it can be slower than f16, q8, or q4. A prior TurboQuant PR to llama.cpp was reportedly rejected because llama.cpp already implements rotations for Q4 KV-cache quantization, where standard Q4 was faster and showed little gain; TurboQuant may only help around Q3, but with notable quality degradation.
- Users distinguished between speed, quality, and context tradeoffs: MTP without TurboQuant was suggested for speed, while standard Q4_1 or Q4_0 quantization was recommended for longer context/quality retention. One commenter questioned whether TurboQuant had any Mac-specific advantage, implying the benefit is hardware- or workload-dependent rather than broadly useful.
- A commenter recommended using dflash instead of built-in MTP, claiming it is 30–40% faster. They also mentioned that a pull request for this already existed, suggesting the implementation work may duplicate prior llama.cpp integration efforts.
we really all are going to make it, aren’t we? 2x3090 setup. (Activity: 487): A dual RTX 3090 (48 GB VRAM total, no NVLink) setup running club-3090 reportedly improved from WSL2 performance of ~30 tok/s generation and ~400 pp/s prompt processing to native Ubuntu at ~113 tok/s and ~4000 pp/s. The author says recent fixes for an “sse-session drop bug” and tool-calling made local workflows viable, with Qwen “3.6” 27B at 262k context feeling “almost-Sonnet level” for coding, monkey patches, and code review on consumer GPUs. Commenters frame this as evidence that local AI has crossed from demos into practical coding workloads, crediting faster runtimes, infrastructure, and small-model quality. There is cautious optimism that domain-specific frontier-class models may fit prosumer hardware within 1–2 years, while one user recommends avoiding dual boot and running a dedicated Ubuntu GPU server/API box instead.
- Commenters noted a major capability jump in local inference: consumer dual-RTX 3090 setups are now being described as usable for near-Claude-Sonnet-level coding workflows, rather than just toy 7B summarization demos. The discussion attributes this to faster-than-expected gains in runtime/software optimization, smaller-model capability, and local inference infrastructure, with speculation that domain-specific frontier-quality models may fit on prosumer hardware within 1–2 years.
- One user described running a 2x RTX 3090 Ubuntu box in a garage at 100% GPU utilization while serving API calls remotely, suggesting a practical local-server deployment pattern rather than desktop dual-boot usage. This highlights the shift from experimentation to always-on local inference infrastructure using commodity GPUs.
I don’t get Quants, I’m running Qwen3.6-27b flawlessly at iq3, makes no sense (Activity: 325): The poster reports running a bartowski GGUF quant of a Qwen 27B dense coding-capable model at roughly IQ3 quantization, fitting ~90k context in 16GB VRAM and generating around 30 tok/s, while still performing well on Godot/GDScript tasks. They observe little apparent degradation from low-bit quantization and hypothesize the strong results may come from the Pi harness plus Context7/ContextQMD retrieval/checking against current syntax, since the same model allegedly performs worse in other harnesses such as Opencode despite similar tool connections.

2. Open-Source Local AI App and Voice Model Releases

TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui). (Activity: 1092): oobabooga/textgen has been repackaged from the long-running text-generation-webui into a portable, no-install Electron desktop app for Windows/Linux/macOS, with self-contained user_data storage and release variants for CUDA, Vulkan, CPU-only, Apple Silicon/Intel macOS, and ROCm via the GitHub releases. The author positions it as a private, open-source LM Studio alternative, emphasizing zero outbound requests, ik_llama.cpp support with newer quantization formats like IQ4_KS/IQ5_KS, OpenAI/Anthropic-compatible APIs including Claude Code compatibility via ANTHROPIC_BASE_URL=http://127.0.0.1:5000, plus built-in web search, PDF extraction via PyMuPDF, trafilatura page cleanup, Jinja2 chat-template rendering, and tool calling through Python files or MCP servers; source is AGPLv3 at github.com/oobabooga/textgen. Top comments were largely positive and non-technical, focusing on excitement about a more private LM Studio competitor and recognition of oobabooga from the earlier text-generation-webui era.
- One user highlighted that text-generation-webui/oobabooga helped them learn that most local LLM frontends ultimately expose or consume an OpenAI-compatible API, implying that frontend choice often comes down to UX, packaging, and local model/runtime integration rather than a fundamentally different serving abstraction.
- A commenter reported the new desktop app working successfully with Gemma 4 31-B, saying it was intuitive and sufficient for their workflow. They also noted they now prefer it over KoboldCPP, suggesting the app may be competitive for users who want a local desktop frontend rather than a web UI or standalone llama.cpp-style runner.
DramaBox - Most Expressive Voice model ever based on LTX 2.3 (Activity: 405): Resemble AI released DramaBox, an open expressive voice/TTS model based on LTX 2.3, with code on GitHub, weights on Hugging Face, and a hosted HF Space. The post positions it as a highly emotive voice model; commenters frame it as potentially useful for indie game voice acting and other character-dialogue workflows. Top comments are broadly positive on expressiveness—“actually sounds like a real person emotes”—but one technical critique says the model reaches roughly 95% speaker/character likeness while still feeling only around 60% in audio naturalness due to robotic or low-quality artifacts.
- A commenter assessed the model as achieving roughly 95% voice likeness while still sounding only about 60% free of robotic/low-quality audio artifacts, implying that DramaBox/LTX 2.3 may have strong speaker similarity and expressiveness but still needs improvements in audio fidelity and naturalness.
- Several comments framed the model as practically useful for indie game development, especially because it is described as an open model capable of more human-like emotional delivery than typical TTS/voice models.
- One user referenced the creator’s earlier post and thanked them for releasing the code, suggesting the project has an ongoing public implementation rather than being only a demo.

3. Retrieval Bottlenecks for Local LLM Workflows

Web-Search is coming to a screeching performance halt as Google shuts down their free search index, and traffic defenders like Cloudflare challenge AI at every gateway. What are our options? (Activity: 838): The post argues that AI-agent web search/retrieval pipelines are degrading as Google restricts free site-specific/custom search to 50 domains with a legacy cutoff of 2027-01-01, while Cloudflare defaults to challenging AI scrapers across customer sites, reportedly extending via a GoDaddy partnership. Commenters identify existing alternatives: decentralized YaCy, self-hosted meta-search SearXNG, Common Crawl for non-real-time bulk web data, Brave Search API with an independent index and 2,000 free queries/month, plus retrieval fallbacks such as the Wayback Machine, archive.today, and Jina Reader. The main debate is economic rather than purely technical: commenters expect a shift toward paid search because bot/API traffic does not monetize through ads—“how do you monetize searches when there’s no human eyes to land on advertising?” The likely near-term stack is viewed as paid or federated search APIs plus caching/reader services, not unlimited free Google-backed search.
- Several commenters framed the problem as an infrastructure/economics shift: API-driven AI search has no ad impressions, so free high-volume access to commercial indexes is likely unsustainable. Suggested alternatives included SearXNG as a self-hosted metasearch layer over Bing/DuckDuckGo/Brave, Brave Search API with an independent index and a cited free tier of 2,000 queries/month, and Common Crawl for non-real-time use cases where petabyte-scale public crawl data can be indexed locally.
- A technically important distinction was made between search and content retrieval: search APIs can still return URLs, but Cloudflare-style bot challenges mostly break the subsequent scraping/fetching step. Proposed mitigations included cached or archived sources such as the Wayback Machine API, Google Cache while available, archive.today, and reader/extraction services such as Jina Reader (r.jina.ai) that are designed to retrieve simplified page content.
- One commenter pointed to YaCY (yacy.net, Wikipedia) as a long-running open-source P2P decentralized search engine, arguing that centralized indexes becoming paid or restricted could make distributed crawling/indexing more relevant. Another suggested a more radical variant: scrape content once, package it into distributable archives, and share it over P2P to reduce repeated bandwidth costs on origin sites.
Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What’s your setup? (Activity: 719): The thread asks whether local LLMs are viable as a daily personal knowledge base over private notes/PDFs, with concerns around RAG reliability, quant/model choice, framework complexity, and context growth. The most concrete setup reported: M3 Max 36GB, Qwen3-32B via Ollama, bge-m3 embeddings, Obsidian as source of truth, Postgres + pgvector, and ~300 lines of custom Python instead of LlamaIndex; key implementation details were heading-based Markdown chunking with title/parent-heading prefixes, hybrid BM25+dense retrieval with RRF, mandatory source citations/quotes, and nightly full reindexing of ~3000 notes in ~4 min. Another commenter described a non-KB but practical local-AI workflow using speech-to-text/translation, screenshot-to-vision translation, clipboard automation, TTS, and future document extraction for business task tracking, noting Whisper-class ASR and vision models were more reliable than older speech/OCR pipelines. The strongest technical opinion was that retrieval quality matters more than context length or model choice: “you do not need 200k context… you need the right 6 chunks in 8k context,” and large context is often used to mask poor retrieval. They also warned that mixing daily journals with reference notes degrades retrieval because emotional fragments surface during factual queries, recommending separate indexes routed at query time.
- One commenter described an 8-month daily local RAG setup on a 36GB M3 Max using Qwen3 32B as the answer model, bge-m3 embeddings, Obsidian as source-of-truth, Postgres + pgvector for indexing, and Ollama for serving, with a hand-written Python retriever instead of LlamaIndex. Their main technical findings were that markdown-heading-based chunking with prepended document/parent-heading context dramatically improved recall, BM25 + dense hybrid retrieval with RRF fusion fixed proper-noun failures at about +50ms, and citations/quoted chunks were required to detect hallucinated claims.
- The same RAG user argued that very long context windows are often compensating for poor retrieval: “you do not need 200k context. you need to put the right 6 chunks in 8k context.” They rebuild the index nightly via cron in about 4 minutes for roughly 3000 Obsidian notes, and found that daily journals should be indexed separately from reference notes because emotionally phrased journal fragments polluted factual retrieval results.
- Another commenter built a local-ish multilingual gaming assistant combining speech-to-text, vision translation, clipboard automation, and TTS: holding middle mouse records speech, translates it to Spanish, and copies it for game chat; a hotkey screenshots the chat region and sends it to an AI vision model for translation because OCR was unreliable. They specifically noted Whisper speech recognition was accurate enough that they had not noticed transcription errors, and they are extending similar document-ingestion ideas to scan staff task sheets, extract text, create database tasks, and generate summaries.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude SDK Credit Limits Backlash

Anthropic just ripped off everyone and they still managed to make it sound deceptively friendly (Activity: 2761): The image is a screenshot of a ClaudeDevs/X announcement saying that, starting June 15, paid Claude plans can claim dedicated monthly credits for programmatic usage via the Claude Agent SDK, claude -p, Claude Code GitHub Actions, and third-party Agent SDK apps. In context, the Reddit post argues this is effectively a pricing/usage-limit nerf: programmatic Claude Code usage that previously benefited from opaque, heavily subsidized subscription limits is now routed through a fixed dollar credit allowance, allegedly reducing practical value from roughly “$2000 of tokens” to $200 for heavy SDK/CLI users. Commenters largely view the change as a disguised downgrade rather than a benefit, especially for autonomous claude -p workflows where credits may burn faster than interactive subscription usage. One user says this pushes them toward a “permanent local mode,” reflecting concern that Anthropic is making cloud-based coding-agent workflows less economical.
- Commenters focused on the impact to Claude Code / claude -p autonomous programmatic usage, arguing that runs may now be constrained by a separate monthly credit pool rather than effectively sharing subscription access. One user noted that these credits appear to “burn faster than the sub usage,” which would materially affect agentic workflows that make many repeated CLI/API-style calls.
- Several users highlighted ambiguity in Anthropic’s wording around a “dedicated monthly credit for programmatic usage”, specifically whether it changes normal Claude Code usage versus only autonomous or scripted usage. The concern is that unclear product/usage-billing boundaries make it hard to estimate costs or decide whether to migrate to local models sooner.
In Time (2011) was a documentary about Claude Pro users and nobody told us (Activity: 5292): The image is a non-technical meme linking the film In Time’s glowing life-clock mechanic to Claude Pro token/message limits, showing a forearm counter reading Tokens Remaining: 125 (image). The post frames paid LLM usage caps as a productivity-era version of a life-or-death countdown, joking that “Justin Timberlake was just a guy trying to finish his PR before the window closed.” Comments mostly extend the joke, but one commenter makes a broader critique that AI companies exploit users’ data and collective human intelligence, arguing that high-quality human-generated real-world data is the scarce resource analogous to “time” in the movie.
- One comment frames the “resource extraction” analogy around AI training data and human-generated intelligence, arguing that LLM progress depends on high-quality real-world human data rather than recursively training on LLM outputs. The commenter specifically claims that models “can’t get smarter by training on other LLMs” and that AI companies are effectively competing to exploit scarce, high-signal human-produced data.

2. AI Image Perception and Generation Glitches

Twitter user posts a real Monet and says it’s AI (Activity: 3110): This is a non-technical meme/social experiment: the image is a screenshot of an X/Twitter post where a user labels what is allegedly a real Claude Monet water-lily painting as “AI-generated,” prompting replies that confidently identify supposed AI flaws such as poor depth, cohesion, brushwork, and lack of “feeling.” The contextual significance is about cognitive bias in AI art discourse—once viewers are told an image is AI, they may reverse-engineer criticisms even when the work is human-made. Image link Top comments frame it as a useful example of confirmation bias and ideology-driven perception, with users joking that “all of a sudden everyone’s an expert on impressionism” and warning not to show it to anti-AI communities.
Bruh… (Activity: 2856): The image is a non-technical meme about an AI image-editing model failing a hand-vectorization request: it first generates a hand with extra fingers, then “corrects” it by changing the gesture into a raised middle finger. It illustrates a common generative-image failure mode around hand/finger topology and poor instruction-following during iterative edits. Image

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

May 14
not much happened today

Companies

Models

Topics

People