a quiet day.
AI News for 5/12/2026-5/13/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINewsâ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
Agent Infrastructure, Harnesses, and Developer Platforms
- Cline, LangChain, Notion, and Cursor all pushed deeper into agent platform territory: Cline open-sourced a rebuilt Cline SDK and refreshed CLI with a TUI, agent teams, scheduled jobs, and connectors, positioning its harness as a reusable substrate for custom coding agents. LangChain shipped a large batch of agent lifecycle infrastructure at Interrupt: LangSmith Engine, SmithDB, Sandboxes, Managed Deep Agents, LLM Gateway, Context Hub, and Deep Agents 0.6. The most technically notable piece is SmithDB, a purpose-built observability database for nested, long-running traces with large payloads, reportedly yielding 12â15Ă faster access on key workloads; the team says it is built atop Apache DataFusion and Vortex. In parallel, Notionâs External Agents API lets third-party agents such as Claude, Codex, Cursor, Decagon, Warp, and Devin operate directly inside Notion as a shared, reviewable context layer rather than another silo. Cursor expanded cloud agents with fully configured development environments including cloned repos, dependencies, version history, rollback, scoped egress, and isolated secrets.
- Agent UX is increasingly about long-running state, streaming, and orchestration rather than chat: Several launches converged on the same design direction. Duet Agent proposes a state-machine harness for jobs that last weeks or months, with parent/sub-agent coordination and memory replacing compaction. LangChainâs OSS updates added streaming typed projections, checkpoint storage, code interpreter, harness profiles, and model-specific tuning, all aimed at richer agent event streams than plain tokens. Tabracadabra moved from autocomplete to a context-aware assistant in any textbox, while VS Code introduced an Agents window and better multi-project task review. The architectural message across these releases is that production agents increasingly need durable execution, inspectable intermediate state, and tool-native UI surfaces rather than stateless prompt/response loops.
Model Training, Architecture, and Data Efficiency
- Pretraining efficiency and architectural experimentation were the strongest research throughline: Nous Researchâs Token Superposition Training modifies the early phase of pretraining so the model reads/predicts contiguous bags of tokens before reverting to standard next-token prediction; they report 2â3Ă wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M to 3B dense and 10B-A1B MoE. Jonas Geiping et al. argued current message-based/chat training overly constrains agents to a single stream and released a multi-stream LLM paper claiming lower latency, cleaner separation of concerns, and more legible parallel reasoning/tool use; paper and code are linked here. δ-mem proposed an external online associative memory attached to a frozen full-attention backbone, with an 8Ă8 state reportedly improving average score by 1.10Ă and beating non-δ-mem baselines by 1.15Ă, with larger gains on memory-heavy benchmarks.
- Post-training/compression and data curation also produced notable results: NVIDIAâs Star Elastic claims one post-training run can derive a family of reasoning model sizes, at 360Ă lower cost than pretraining a family and 7Ă better than SOTA compression. Datologyâs VLM work, highlighted by Siddharth Joshi and Pratyush Maini, argues data curation alone can produce major multimodal gains: +11.7 points across 20 public VLM benchmarks at 2B, beating InternVL3.5-2B by roughly 10 points at about 17Ă less training compute, and near-frontier 4B performance with 3.3Ă lower response FLOPs than Qwen3-VL-4B. On the open data side, Percy Liang said the next Marin run already has 18T tokens in its mix and is still seeking more pretraining, mid-training, and SFT data, with a companion token viewer shared here.
- Open evaluation and dataset work is maturing alongside model building: Kevin Liâs SWE-ZERO-12M-trajectories is positioned as the largest open agentic trace dataset: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages. Victor Mustar flagged llama-eval as a step toward more comparable llama.cpp community evals. Meanwhile, Steve Rabinovich and Sayash Kapoor argued credible agent evaluation requires log analysis, not outcome-only metrics, because stronger agents expose hidden benchmark bugs and reward-hacking paths.
Enterprise AI Pricing, Platform Competition, and Distribution
- Anthropic vs OpenAI competition sharpened around enterprise distribution and developer lock-in: Ramp data cited by Andrew Curran showed Anthropic at 34.4% of businesses vs OpenAI at 32.3% in April, the first apparent lead change in business adoption; The Rundown amplified the same figures. At the same time, Anthropic changed plan economics: ClaudeDevs announced that paid Claude plans will get a dedicated monthly credit for programmatic usage across the Agent SDK,
claude -p, GitHub Actions, and third-party SDK apps. This was immediately read by power users as a major restriction on subscription-subsidized harnesses, with criticism from Theo, Jeremy Howard, Matt Pocock, and Omar Sanseviero. Anthropic partially offset that backlash with a separate 50% increase in Claude Code weekly limits through July 13, stacked on the previously announced 2Ă 5-hour limit increase. - OpenAI responded aggressively with Codex enterprise incentives: OpenAI Devs and Sam Altman offered two months of free Codex usage for enterprise customers switching in the next 30 days. OpenAI also published more technical platform detail, including a Windows sandbox design write-up describing the combination of local users, firewall rules, ACLs, write-restricted tokens, DPAPI, and helper executables needed to safely run coding agents with local filesystem/tool access. The competitive dynamic now looks less like âbest model winsâ and more like subsidy + workflow control + harness compatibility.
- Enterprise adoption is increasingly tied to runtime/security assurances: Perplexity described a hardware-isolated sandbox architecture with VPC-level separation, short-lived proxy tokens, and scanning of external content before agent actions, with additional details on encryption and auto-deletion. Aravind Srinivas framed this as foundational to Perplexity becoming an enterprise knowledge/research platform. The broader pattern: agent vendors are no longer selling only intelligence; theyâre selling bounded execution environments.
Autonomous Science, Cyber Capability, and Robotics
- Recursive self-improvement moved from idea to startup cluster: The largest single meta-theme was the launch of Recursive, founded to build AI that automates science and safely improves itself. Launch posts from Richard Socher, Josh Tobin, Dominik Schmidt, Jenny Zhang, and Shengran Hu suggest a team drawn from open-endedness, AI Scientist, and research automation work. In adjacent work, Adaptionâs AutoScientist aims to automate the full training-research loop outside frontier labs, with Sarah Hooker arguing that most model training failures are due to research-loop brittleness rather than mere compute scarcity.
- Cyber capability evaluations continue to steepen: The UK AI Security Institute said the length of cyber tasks frontier models can complete has been doubling every few months, and that recent models are beating prior trends. Anthropic/Glasswingâs Logan Graham said Claude Mythos Preview is the first model to solve both AISI end-to-end cyber ranges, including Cooling Tower, and the only one to clear every task under the instituteâs 2.5M-token cap. XBOW reportedly found âtoken-for-token, unprecedented precision,â and partner usage allegedly surfaced thousands of high/critical vulnerabilities in weeks. Independent commentary from scaling01 claimed a newer Mythos version completed a cyber range 6/10 times vs 3/10 for the preview baseline.
- Robotics got a concrete long-horizon deployment demo: Figureâs Brett Adcock streamed humanoid robots running a full 8-hour autonomous shift on package sorting using Helix-02, with follow-up details that the robots reason from camera pixels, operate around human parity (~3s/package), perform on-device inference, coordinate as a networked fleet, autonomously swap for low battery, and self-diagnose/fail over to maintenance when needed here. This is one of the clearer public demonstrations of multi-robot, long-duration, no-human-in-the-loop orchestration rather than a short benchmark clip.
Top tweets (by engagement)
- Claude Code pricing and limits: @ClaudeDevs on 50% higher weekly limits, @ClaudeDevs on programmatic credits, and the ensuing developer backlash from @theo made pricing policy the dayâs most consequential developer story.
- Codex enterprise push: @sama offering two free months of Codex usage for switchers and @OpenAIDevsâ enterprise call-to-action signaled an unusually direct go-to-market counterpunch.
- Figureâs 8-hour humanoid shift: @adcock_brettâs livestream post drew enormous attention and is one of the few viral posts in the set with clear technical substance.
- Cline SDK launch: @clineâs SDK release was one of the highest-engagement genuinely technical launches, reflecting demand for open coding-agent harnesses.
- Token Superposition Training: @NousResearchâs TST post stood out as a rare pretraining-method tweet that broke through widely, likely because the claimâ2â3Ă training speedup without changing inference-time architectureâis concrete and economically important.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Efficient On-Device LLM Inference
-
Needle: We Distilled Gemini Tool Calling Into a 26M Model (Activity: 451): Cactus Compute open-sourced Needle, a
26M-parameter single-shot function/tool-calling model using a âSimple Attention Networkâ architectureâattention + gating with no FFNs/MLPsâarguing tool use is mainly retrieval/slot extraction/JSON assembly rather than deep reasoning. It was pretrained on200Btokens over16 TPU v6ein27h, post-trained on2BGemini-synthesized function-calling tokens in45m, claims6000 tok/sprefill and1200 tok/sdecode on consumer devices, and reportedly beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling; code/weights are MIT-licensed on GitHub, Hugging Face, with architecture notes in the SAN writeup. Commenters framed Needle as potentially useful as a lightweight router that selects tools or dispatches queries to larger LLMs with parameters, while questioning whether the same no-FFN/cross-attention approach could generalize to summarization. One technical caution noted the repository apparently includes Pythonpicklefiles, which are discouraged due to code-execution/security risks and Python-specific portability issues.- Several commenters focused on the architectural implication of a 26M distilled tool-calling model as a lightweight router: it could classify/route requests to the appropriate larger LLM, tool, or RAG pipeline with the right parameters, rather than generating full answers itself. One suggested this could be extended into a small post-trained model that consumes structured RAG output and verbalizes it in natural language.
- A technical point was raised around the claimed âno FFNâ result: if external structured knowledge is always supplied via tools/RAG/retrieval, the model may not need FFN layers to store factual knowledge in weights. This implies a possible design pattern where compact attention-heavy models specialize in orchestration or grounding over provided context instead of memorization.
- One commenter noted that publishing pickle files is increasingly uncommon because of Python-specific dependency coupling and arbitrary-code-execution risks during deserialization. Another highlighted that Gemini has had visible tool-calling quirks, including system-prompt-level patches around tool specificity and avoiding inefficient file operations like
catin favor of dedicated tools such asgrep_search, which could matter if Gemini-generated traces were used as distillation data.
-
I got a real transformer language model running locally on a stock Game Boy Color! (Activity: 1326): The image shows a stock Game Boy Color running a local transformer demo labeled
TINYSTORIES Q8 GBC, matching the postâs claim that Andrej Karpathyâs TinyStories-260K was converted toINT8/fixed-point and executed directly on-device without PC, WiâFi, link cable, or cloud inference: image. The project uses GBDK-2020, an MBC5 Game Boy ROM, bank-switched cartridge ROM for weights, cartridge SRAM for the KV cache, and on-device tokenization/prompt entry; the author notes generation is extremely slow and mostly gibberish due to heavy quantization/approximation, but the transformer prefill + autoregressive loop works. Source code: github.com/maddiedreese/gbc-transformer. Comments are mostly impressed rather than technical, framing it as an impractical but compelling proof-of-conceptâe.g. âPointless. Therefore, indispensable.â and interest in porting similar experiments to other retro hardware like the N64.- A commenter references a related prior project, GBALM, linking to
https://code.heni.lol/heni/gbalm. The comment does not provide implementation details, but the link may be relevant for readers comparing other attempts at running language-model-like systems on Game Boy-class hardware.
- A commenter references a related prior project, GBALM, linking to
-
Solar Powered Qwen 3.6 Server (Activity: 449): A user reports running a local Qwen 27B GGUF model build from Unsloth,
UD-Q4_K_XL, with100kcontext on an M1 Max 32GB, achieving roughly~10 tok/s. The inference server is powered by3 Ă 100 Wsolar panels feeding an Anker1.25 kWall-in-one power unit; observed power draw is~80â85 Wunder inference load, sometimes dropping to~30 W, with idle drawâ¤5 W. The user says performance is âreally goodâ in Hermes and opencode workflows. Commenters mainly highlighted the practicality of Apple Silicon for off-grid inference due to low power draw, with one noting that non-Mac solutions would drain batteries too quickly and that winter operation is challenging for fully off-grid setups, especially in northern climates.- One technically relevant thread notes that an off-grid whole-house power setup constrains hardware choice: the commenter uses Macs because alternative server/GPU solutions would drain battery capacity too quickly. They also highlight seasonal reliability issues for solar/off-grid compute, saying winter near the Baltic is difficult enough that they plan to move to a hybrid power setup.
-
Stop wasting electricity (Activity: 1104): A user reports that running
llama.cppllama-serveron an RTX 4090 withQwen3.6-27B-UD-Q4_K_XL.gguf,--flash-attn on,-ngl all,-ctk q4_0 -ctv q4_0, and-c 262144remains GPU power-limit-bound undernvidia-smi -pl N, implying actual board power tracks the configured cap. Their observation is that reducing the GPU power limit can cut consumption to roughly 40% without materially hurting decode/token-generation throughput, while also reducing heat/noise; a commenter adds that prefill is more sensitive but reportedly only drops about15â20%when reducing from450Wto270W, depending on model. Commenters push for separating prefill/prompt-processing from decode benchmarks, since decode throughput may hide power-limit-induced regressions. Another user notes they already power-cap an RTX 5090 due to connector/thermal concerns and may lower the cap further based on these results.- Users discussed GPU power limiting for local inference, specifically that reducing an RTX 5090 from
450Wto270Wreportedly has little impact on decode/token generation (tg) throughput, while prefill (pp) performance drops more noticeably but only around15â20%depending on the model. This suggests a potentially favorable efficiency tradeoff for inference workloads where decode dominates runtime. - One commenter noted capping a
5090due to concerns about connector or hardware overheating, while another mentioned heavily power-limiting3090sto reduce noise for overnight operation. The technical implication is that aggressive power caps may materially improve thermals/acoustics and power efficiency without proportionally reducing LLM inference throughput, especially during decode-heavy workloads.
- Users discussed GPU power limiting for local inference, specifically that reducing an RTX 5090 from
2. Open-Source Local Agent Interfaces
-
TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui). (Activity: 795): oobabooga/TextGen has been refactored from
text-generation-webuiinto a portable, no-install Electron desktop app for Windows/Linux/macOS, with self-containeduser_datastorage and builds for CUDA, Vulkan, CPU-only, ROCm, and Apple Silicon/Intel macOS via the GitHub releases. The app positions itself as an open-source LM Studio alternative with zero outbound requests,ik_llama.cppsupport for newer quant types likeIQ4_KS/IQ5_KS, built-in web search viaddgs, Python/HTTP/stdio MCP tool calling with approval gates, OpenAI/Anthropic-compatible APIs including Claude Code support, PDF extraction viaPyMuPDF, web cleanup viatrafilatura, and Jinja2 chat-template rendering; source is AGPLv3 at oobabooga/textgen. Top comments are mostly enthusiastic rather than technical, emphasizing recognition of oobabooga and demand for a more private, open alternative to LM Studio.- A commenter framed the project as filling a gap for an open-source, private native desktop alternative to LM Studio, contrasting it with prior local LLM UX options that were often web UIâcentric rather than packaged app workflows.
- One technical observation noted that after using text-generation-webui, they realized much of the local LLM ecosystem converges around an OpenAI-compatible API, implying that frontends and tooling can often be swapped as long as they target that API surface.
-
Letâs build claude code from scratch! (Activity: 462): The image is a technical terminal screenshot (not a meme) showing a custom CLI coding agent branded as âNANO CLAUDEâ in
~/projects/nano-claude, described as âClaude Code ¡ from scratchâ and prompting the user to enter a coding request. The post links a build-from-scratch tutorial video and GitHub repo for the implementation: YouTube, GitHub, and the screenshot is available here. Commenters mainly warned that using âClaudeâ in the project name may create trademark risk with Anthropic, citing prior renaming pressure around OpenClaw/Clawdbot. Others suggested similar tools already exist, such asopencode, or pointed to Pi as an alternative.- One commenter argued that reimplementing a Claude Code-like agent is valuable for understanding the underlying agent/tool loop, since many users rely on these tools without understanding how model calls, tool invocation, and iterative execution are orchestrated under the hood.
- Another commenter pointed to opencode as an existing implementation in this space, implying that similar Claude Code-style coding agents already exist and may be useful as references before starting a from-scratch build.
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. Real-World AI Agent Failure Modes
-
Inherited a 3-month old repo from a Vibe Engineer. Wrote the most satisfying PR in my career (Activity: 6187): The image shows an extreme PR diff of
+10,197additions andâ3,618,778deletions, contextualizing the postâs claim that a 3-month-old backend repo produced via âagenticâ/vibe coding had accumulated massive generated or unnecessary code, docs, logs, secrets, and unused handlers. The author says they rewrote the repo in a week with Claude, preserving functionality while replacing a bloated architectureâ309kLOC,240kdocs,1M+markdown log lines,220handlers with only ~20used, and40+secrets with only2neededâwith a cleaner backend and integration tests. The comments shown are mostly non-technical jokes around the term âvibe engineerâ and the irony of using AI-assisted coding to clean up an AI-generated codebase; there is no substantive technical debate in the provided top comments.- Several commenters framed the repo as an example of AI/agent-generated technical debt, suggesting that âfixing vibe-coded messâ may become a lucrative maintenance niche as teams inherit code produced without conventional engineering discipline. The discussion also notes a credibility gap: praise for âagentic approachesâ often comes from people who are not software professionals, implying that generated code may look impressive while still requiring significant human refactoring, deletion, and validation.
-
I made an AI concierge for my wedding guests. The second most popular thing they did with it was try to jailbreak it. (Activity: 2003): The image is an illustrated usage report for a custom wedding AI concierge (âAidoâ) built for a destination wedding in Mauritius, reportedly connected to wedding/travel info via an API/MCP server. It shows
719sessions,8,678messages, and29users, with the largest categories being sincere logistics (35%) and jailbreak/hack attempts (25%), highlighting that even low-stakes private assistants attract adversarial prompting. Image: AI Concierge Report Card. Commenters found the project more interesting than a generic chatbot, but were surprised by the engagement volumeâover8,000messages from only29usersâand amused that jailbreak attempts were the second-largest use case.- The OP described building a two-part system: first a wedding planning assistant for a destination wedding in Mauritius, then a guest-facing AI concierge connected to an API through an
MCP serverso it could retrieve event/travel information dynamically for users. - One commenter highlighted the usage volume as notable for a small deployment: only
29users generated over8,000messages, implying unusually high engagement and/or repeated probing such as jailbreak attempts. - A privacy concern was raised around observability and message logging: a commenter asked whether guests were uncomfortable with the OP being able to read their interactions, which is relevant for any personal-event chatbot that stores or inspects user messages.
- The OP described building a two-part system: first a wedding planning assistant for a destination wedding in Mauritius, then a guest-facing AI concierge connected to an API through an
AI Discords
Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.