a quiet day.
AI News for 5/28/2026-5/29/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
Claude Opus 4.8 Rollout, Benchmark Friction, and API Ergonomics
- Opus 4.8 landed into a noisy, mixed eval landscape: multiple independent benches converged on “incremental but not dominant.” @arena pushed 200+ frontend/code tests comparing Opus 4.8 against prior Opus variants, Gemini, and GLM; @theo reported CursorBench shows it as more efficient but slightly worse than 4.7 within margin of error; @jerryjliu0 and @llama_index found small gains on tables/layout but regressions on content faithfulness/charts in document parsing; @scaling01 said no progress on ALE-Bench and separately flagged interesting failure modes on LisanBench. On the positive side, @jeremyphoward found 4.8 less over-agentic and more cooperative than 4.7/GPT-5.5 in coding, while @leo_linsky called it a tangible product improvement over prior Anthropic releases.
- Anthropic also shipped useful platform-level changes: @ClaudeDevs announced mid-conversation system instructions without breaking prompt cache, plus authoritative mid-conversation system-role updates, which matters for long-running agent sessions and cost control. But pricing remains a major complaint: @jeremyphoward argued Anthropic has done little for API affordability, preferring GPT-5.5 partly because subscription/API economics are easier to justify. Overall takeaway: 4.8 looks like a meaningful quality-of-life release for real use, not a clean benchmark reset.
Agent Harnesses, Multi-Turn RL Bugs, and the Infrastructure Around Autonomy
- A subtle but important RL failure mode got called out: @ClementDelangue highlighted a Hugging Face deep-dive on why many tool-using, multi-turn RL training loops are silently broken. The core bug: decoding model output, parsing tool calls, then re-tokenizing the updated conversation can change tokenization, so gradients are applied to sequences the model never actually sampled. The proposed fix is a strict “Token-In, Token-Out” rule: never re-encode sampled tokens; keep a single token buffer across turns. @johnschulman2 reinforced the broader point that renderers are foundational infrastructure between messages and tokens, with failure modes spanning train/test mismatch, caching inefficiency, and prompt injection risk.
- Harness design is becoming its own optimization discipline: @omarsar0 surfaced work on Effective Feedback Compute (EFC), claiming raw token/tool counts explain agent success poorly while EFC reaches R² up to 0.99, implying harness quality matters more than gross activity. This lines up with productized tuning efforts like @LangChain, where Deep Agents v0.6 makes harness profiles first-class to get strong performance from Qwen/Kimi/DeepSeek at 20x+ lower cost than frontier APIs, and @hwchase17 explicitly framing “different models need different prompts/tools.” @vllm_project shipped native weight syncing APIs and improved pause/resume for async RL, and later added fastokens, a Rust BPE tokenizer to reduce CPU tokenization bottlenecks in long-context/agentic workloads.
- Debate is shifting from “single vs multi-agent” to where the abstraction pays: @OfirPress argued current multi-agent systems are mostly speedups, not capability unlocks; @scaling01 took the opposite view, expecting swarm-style training to yield better planning and superintelligence-like behavior. Either way, the practical trend is clear: more teams are building around agent observability, traces, and continual improvement loops, e.g. @Vtrivedy10 on mining production traces for SFT/distillation and long-horizon continual learning.
Open Models, Local AI, and the OSS Toolchain Tightening Up
- Local-first and open-weight momentum continues to rise: @LangChain said 1 in 3 AI teams ran an open-weights model in April 2026, up from 1 in 5 nine months earlier; @EpochAIResearch estimated open-weight models now lag frontier proprietary models by about four months. On the toolchain side, @ggerganov launched llama.app, giving llama.cpp an official website, a unified installer, and a single
llamaentrypoint aimed at easier local deployment and third-party agent integration. @ollama announced OpenJarvis as a local-first personal AI via Ollama, explicitly tied to Stanford/Hazy’s “Intelligence Per Watt” framing. - Open infrastructure is getting more enterprise-shaped: @ClementDelangue noted that ~50% of models and datasets on Hugging Face are now private, rising with HF’s storage/buckets offering; this is an important correction to the idea that HF is only public OSS infrastructure. @abidlabs showed Hugging Face Jobs replacing GitHub runners for CPU/serverless GPU CI. @DSPyOSS, @dbreunig, and others shipped a redesigned DSPy docs/front page ahead of a coming 4.0, focused on onboarding into programmable AI systems rather than pure prompting.
- Licensing and permissiveness are becoming strategic levers: @kimmonismus highlighted NVIDIA moving its four open model families to Linux Foundation OpenMDW-1.1, reducing legal fragmentation across weights/code/docs/data. New permissive data releases also matter: @keshigeyan introduced GPIC, a 100M-pair permissive image corpus plus 1M-pair benchmark for visual generation, with explicit research + commercial usability.
Google/OpenAI Product Surface Expands: Managed Agents, Gemini Spark/Omni, and Codex on Windows
- Google is widening the “managed agent” stack from API to consumer product: @_philschmid showed Managed Agents in the Gemini API: a single API call provisioning a sandboxed Linux environment with code execution, web access, and file I/O. On the consumer side, @GeminiApp rolled out Gemini Spark to U.S. AI Ultra subscribers as a 24/7 personal agent that can operate across a user’s digital ecosystem under direction. Google also kept pushing Gemini Omni multimodal generation/editing demos (example, product thread) and announced Google Flow Agent for creative workflows in video/film production (thread).
- OpenAI’s Codex is moving closer to a persistent remote dev operator: @OpenAI and @OpenAIDevs added computer use on Windows, including remote steering from the ChatGPT mobile app. Follow-on UX improvements included stable identicons for background agents and search across prior chat content (@OpenAIDevs); @reach_vb summarized broader Codex updates around Windows control, mobile remote access, and profile/task stats. Separately, OpenAI updated gpt-5.5 instant to improve sycophancy, factuality, and multilingual performance per @michpokrass.
- This all points to more vertically integrated agent stacks: model + harness + sandbox + UI + remote control + pricing/quotas. Google is smoothing quotas on Gemini (@joshwoodward); OpenAI is expanding Codex’s operating surface; Cursor added auto-review mode with subagent-based approval routing (tweet). The common pattern is less “chatbot,” more managed execution environment with policy and memory.
Research and Systems Papers Worth Attention
- Search, retrieval, and memory: @TheTuringPost highlighted Bidirectional Evolutionary Search (BES) from Harvard/MIT, combining forward search with backward decomposition and evolutionary operators; reported gains include Llama-3.2-3B-Instruct on MuSiQue from 4.0% to 7.0%. In retrieval, @_reachsumit pointed to Latent Terms, showing sparse BM25-ready features can be extracted from frozen dense retrievers via SAEs. @topk_io open-sourced Iso-ModernColBERT for more efficient late-interaction inference.
- Continual learning and belief/state management: @HuggingPapers summarized BeliefTrack, claiming optimized belief-state management cuts long-horizon reasoning failures by 70%+. @AndrewLampinen argued the continual learning field over-focused on interference instead of positive transfer; @victor207755822 presented a second DeliAutoResearch SKILL paper focused on self-iteration and CL.
- Multimodal/world models/robotics: NVIDIA-affiliated work included γ-World, a generative multi-agent world model streaming at 24 FPS (tweet), and minWM, a real-time interactive video world model framework (tweet). In robotics, @_akhaliq shared Qwen-VLA, and @inventorOli demoed Robostral’s language-following and manipulation improvements. For always-on proactive agents, @dair_ai surfaced work replacing LLM wake-up decisions with a 220MiB temporal-graph encoder, gaining +16.7 mean F1 while running 4–83x faster.
Top tweets (by engagement)
- OpenAI / biology: @OpenAI on Rosalind Biodefense announced trusted-access biology tooling for public health and biodefense.
- Google / consumer agents: @GeminiApp on Spark rolled out its always-on personal agent to AI Ultra users in the U.S.
- OpenAI / dev tools: @OpenAI on Codex Windows support and @OpenAIDevs expanded computer use to Windows plus mobile remote steering.
- llama.cpp UX milestone: @ggerganov launched llama.app with a unified installer and CLI entrypoint for local AI.
- HF / RL correctness: @ClementDelangue amplified the Token-In, Token-Out warning for multi-turn RL with tools.
- Open vs closed timing gap: @EpochAIResearch estimated open-weight models are now about 4 months behind the frontier.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Local LLM Performance: MoE Releases, Quants, VRAM Savings
-
StepFun 3.7 Flash (Activity: 637): StepFun released Step 3.7 Flash, a multimodal MoE with
196Btotal parameters,11Bactive, and a built-in1.8BViT, advertised for high-throughput agent workflows up to400 TPSand reportedly runnable locally with ~128GBRAM. Reported benchmarks position it unusually strongly for a flash-class/local model: SWE-Bench Pro56.26%, DeepSearchQA F192.82%, HLE w/tools47.2, plus large gains over Step 3.5 Flash on Terminal-Bench, Toolathlon, ClawEval, and other agentic/tool-use tasks. Direct model artifacts are available on Hugging Face in BF16, FP8, NVFP4, and GGUF, with day-0llama.cppsupport PR and related MTP work inllama.cpp#23274. Commenters characterize the model as technically odd: its hidden/thinking traces are described as nearly incoherent, but final answers can be “perfect” and competitive with much larger>1TBmodels; one user says the prior Step 3.5 “infinite thinking” issue appears fixed. There is cautious enthusiasm around local deployment, especially for users with4x3090-class hardware, and appreciation that StepFun upstreamedllama.cppsupport instead of only maintaining a fork.- StepFun released multiple Step-3.7-Flash checkpoints on Hugging Face: BF16 (Step-3.7-Flash), FP8 (Step-3.7-Flash-FP8), NVFP4 (Step-3.7-Flash-NVFP4), and GGUF (Step-3.7-Flash-GGUF). One user reports the prior Step 3.5 Flash “infinite thinking” issue appears fixed, making 3.7 more usable despite still having an odd intermediate reasoning style.
- There is day-0
llama.cppenablement via StepFun’s upstream PR: ggml-org/llama.cpp#23845, contrasting with Step 3.5’s fork-based support. A separate community PR for MTP support exists at ggml-org/llama.cpp#23274, though commenters note it needs updating for Step 3.7 and currentmaster. - A vLLM nightly test of the NVFP4 checkpoint on
2x Pro 6kwith64concurrent shallow-context requests reached about2200 tok/s. The reported config usedtensor-parallel-size 2,--enable-expert-parallel,--quantization modelopt,--kv-cache-dtype fp8,--reasoning-parser step3p5, and StepFun tool-call parsing; vLLM reported GPU KV cache size1,667,645tokens and max concurrency6.36xfor262,144tokens/request.
-
Qwen 35B running on 12gb of VRAM in LM Studio at 120+ tokens/second. Works with Cline for 100% agentic coding. (Activity: 387): The post claims Qwen3.6-35B-A3B can run in LM Studio on an RTX 3080 Ti (
12GBVRAM) at120+ tok/susing the split GGUF quantDanyDA/unsloth_Qwen3.6-35B-A3B-UD-IQ1_M-GGUF-SPLIT, with all layers offloaded to GPU and both K/V cache quantization set toQ4_0to fit a claimed128kcontext. The author reports using it with Cline for agentic coding, generating ~1000+LOC for a multi-tenant forum feature including migrations, tests, frontend/backend, and self-iteration on compile errors in ~20 min, though this is anecdotal rather than benchmarked. Top comments are skeptical: users note the post initially omitted the exact quantization, infer it is likely an extremely low-bitIQ1_M/ ~1-bit quant, and argue that while the model may load and run fast, long-context quality may collapse quickly in Cline as the context fills, producing “shit responses and dead code.”- Several commenters questioned the missing quantization details, suspecting the reported
120+ tok/son12GB VRAMwas likely using an extremely low-bit quant such as 1-bit MTP. They cautioned that while such quants can be very fast, code quality and reliability may degrade substantially, especially for agentic coding workflows. - A user running the same Qwen 35B model on an RTX 5090 reported that Cline exhausted the context window after roughly
3commands, after which responses became poor and generated code was unusable. The critique was that raw token throughput is less important than usable context length and sustained agent performance over multi-step coding tasks. - There was skepticism toward quants below Q4, with one user reporting Qwen 35B on an
8GB RX 5700 XTat roughly150–200 tok/sprompt processing and30 tok/sgeneration. Another commenter argued that MoE models suffer more from aggressive quantization, recommending testing higher quants viallama.cppwithoutmmprojoffload and MTP before drawing conclusions about practical coding quality.
- Several commenters questioned the missing quantization details, suspecting the reported
-
llama: use f16 mask for FA to save VRAM by am17an · Pull Request #23764 · ggml-org/llama.cpp (Activity: 373): Merged PR ggml-org/llama.cpp#23764 reduces llama.cpp Flash Attention VRAM use by changing the KQ mask allocation from
f32tof16, avoiding reservation of an unusedf32mask in the compute buffer when backends consume anf16mask. Reported savings are about1.2 GBat-ub 2048and300 MBat-ub 512when using MTP; a follow-up PR, #23861, is also noted as landing another ~1.2 GB** VRAM reduction.** Comments are mostly appreciative, highlighting contributor am17an as unusually productive and noting that periodicgit pullupdates to llama.cpp continue to yield measurable performance/efficiency improvements.- A commenter points to a follow-up llama.cpp PR, ggml-org/llama.cpp#23861, claiming it provides an additional
~1.2 GBVRAM reduction beyond the merged f16-mask change for Flash Attention. Another asks whether the merge means1.2 GBVRAM is saved by default, suggesting the optimization may now apply without user-side configuration. - A CUDA backend maintainer notes that Aman’s work is not limited to CUDA despite their own backend focus, implying the f16 mask / Flash Attention VRAM optimization has broader llama.cpp backend impact rather than being CUDA-only.
- A commenter points to a follow-up llama.cpp PR, ggml-org/llama.cpp#23861, claiming it provides an additional
2. LLM Infrastructure: Inference Networking and Framework Security
-
Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild (Activity: 716): The image is a technical topology comparison: standard ROFT spine-leaf networking versus Zai’s ZCube design for
GLM-5.1coding inference on a ~1000-GPU cluster. According to the post and linked source in comments (z.ai/blog/zcube), replacing ROFT with a flattened ZCube architecture reportedly reduced switch/optical-module cost by33%, increased GPU inference throughput by15%, and cut first-token P99 tail latency by40.6%, mainly by avoiding PD-disaggregation KV-cache traffic hotspots and PFC backpressure on fixed rail mappings. Commenters mainly praised the publication of infrastructure details, contrasting it with more closed AI labs; one asked for a proper source link, which was provided as Zai’s ZCube blog post.- A commenter points to the primary technical source for the claimed GLM-5.1 inference gains: Z.ai’s ZCube writeup at https://z.ai/blog/zcube. The discussion frames the architecture swap as part of a broader trend where inference optimization bottlenecks are moving “lower in the stack,” i.e. from model/runtime-level tuning toward networking and systems infrastructure.
- One technically relevant reference notes the work’s publication context: SIGCOMM ’25, dated
September 8–11, 2025, with a listed publication date of27 August 2025. This suggests the network-architecture change is being discussed as a networking/systems contribution rather than only an ML-serving optimization.
-
Vulnerability found in framework used by VLLM, many MCP servers, and other LLM tools (Activity: 662): A reported BadHost vulnerability, CVE-2026-48710, affects Starlette <
1.0.1, specifically malformedHostheader handling that can allow bypass of path-based authorization in apps relying onrequest.url, per Ars Technica. Because Starlette is foundational to FastAPI, commenters note potential exposure across vLLM, LiteLLM, MCP servers, Hugging Face/Gradio MCP integrations, OpenAI-compatible proxies, and possibly OpenWebUI, with risks including credential/data exposure, SSRF, and in some cases RCE; X41 D-Sec and Nemesis reportedly provide a scanner for exposure testing. Commenters framed this as a supply-chain/dependency-risk example for LLM infrastructure: deeply nested Python dependency graphs make exploitable transitive packages likely, pushing some toward vendoring, full source review, or stronger sandboxing of every interaction.- The vulnerability is described as affecting Starlette, a core dependency under FastAPI, which is embedded in tools/providers such as vLLM, LiteLLM, MCP-related packages, and Hugging Face-adjacent frameworks like Gradio MCP. The technical concern is broad transitive exposure: any service using an unpatched FastAPI/Starlette stack and exposing the vulnerable HTTP surface may be impacted by the BadHost exploit.
- A commenter notes that OpenWebUI may be a particularly relevant risk case because it is often deployed as an internet-exposed web service. This matters because the vulnerable dependency path is more serious for long-running HTTP applications than for purely local or non-networked tooling.
- One commenter clarifies that MCP transport mode is critical: default local
stdioMCP servers have no HTTP listener, so BadHost-style HTTP exploitation does not apply, while SSE or HTTP transport deployments may be exposed. They recommend checking the actual runtime environment withpip show starlette, especially inside the vLLM virtualenv, because vLLM and MCP tooling may use separate environments with different Starlette versions.
3. Hugging Face Local Agents and Model Discovery
-
Reachy Mini goes fully local! (Activity: 373): Hugging Face announced a fully local conversational stack for Reachy Mini, with a setup/modification guide in their blog post: Local conversations with Reachy Mini. The goal is a low-latency on-device voice-agent pipeline that can be adapted beyond the robot itself, with commenters specifically calling out real-time chat and interruption handling as key technical capabilities; the linked Reddit video itself was not accessible due to a
403 Forbiddenblock. Commenters were positive about local-first voice agents, arguing that cloud-hosted voice systems often demo well but feel laggy or “slightly haunted” in real interaction. One commenter suggested the next useful extension would be persistent-memory context injection.- Commenters emphasized that fully local inference is a strong default for voice agents because cloud round trips can make demos appear acceptable while real conversational interaction feels laggy or “haunted.” The most technically meaningful evaluation criterion raised was interruption/barge-in handling, not just response quality, since responsive turn-taking is critical for natural voice interaction.
- Several comments noted practical implementation challenges around running local models for real-time chat/voice interaction, especially for hobbyist robotics projects. One suggested next steps were adding persistent memory with context injection, implying a local agent architecture that maintains user/session state and feeds relevant memory back into prompts.
-
HF models page now has a “Base only” toggle to filter out finetunes/quants/etc (Activity: 252): The image shows Hugging Face’s Models page with a newly added “Base only” toggle circled: image. The linked filter URL (
base_model_relation=base) is intended to hide derived repos such as adapters, finetunes, quantizations, merges, and GGUF conversions, making it easier to find original/base model checkpoints. Commenters note the feature is useful but only as reliable as model metadata: one user reports the count only drops from2,926,520to2,163,134, arguing many derived models likely are not tagged correctly.- Commenters noted that Hugging Face’s new “Base only” filter likely depends on repository metadata/tags being correctly set, which may limit accuracy. One user reported the toggle only reduced visible models from
2,926,520to2,163,134, implying just26.1%were classified as adapters, finetunes, quantizations, or merges—an implausibly low fraction if tagging is incomplete. - The feature addresses a concrete discovery problem on HF: users often have to page through many derivative artifacts such as
GGUFquantizations and other variants before finding the original/base model. However, at least one commenter observed that the filter still surfaced derivative-looking results like “qwopus mtp gguf,” suggesting classification may not yet reliably exclude all quants or finetunes.
- Commenters noted that Hugging Face’s new “Base only” filter likely depends on repository metadata/tags being correctly set, which may limit accuracy. One user reported the toggle only reduced visible models from
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. Claude Opus 4.8 Agentic Coding Launch
-
Introducing Claude Opus 4.8 (Activity: 4046): Anthropic’s post announces Claude Opus 4.8 as a same-price upgrade over Opus 4.7, with improved long-running autonomous coding behavior, plus Fast mode, dynamic workflows in Claude Code, and an effort-control setting on claude.ai. The benchmark image is a technical comparison table showing Opus 4.8 leading or tying most listed evals versus Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, including
69.2%on SWE-Bench Pro,83.4%on OSWorld-Verified,1890on GDPval-AA, and53.9%on Finance Agent v2. Commenters are skeptical that 4.8 is an improvement over the more-liked Opus 4.6, and one reports the new effort toggles appear to be ignored, with models reasoning less even on “Max.” Another commenter says they would have preferred upgrades to Haiku and Sonnet instead of Opus.- Several commenters argue that Opus 4.8 should be evaluated against Claude Opus 4.6 rather than 4.7, implying they perceive 4.7 as a regression baseline. The recurring technical concern is whether 4.8 inherits behavioral changes from 4.7 instead of restoring the reasoning/response characteristics users preferred in 4.6.
- One user reports that the Claude.ai effort-level toggles appear to have little practical effect: “Max” and “minimal” reasoning feel indistinguishable, especially on Claude Sonnet, with the model allegedly choosing to reason less regardless of prompts like “think deep” or custom styles. This is framed as a downgrade in controllability and visible reasoning behavior rather than a model-quality improvement.
-
Opus 4.8’s new highest effort setting (Activity: 1007): A Reddit post claims Claude Opus 4.8 in its VSS/VS Code-style extension now exposes an effort level above
Max, labeledUltracode - xhigh + workflows, with the UI progress/effort bar changing to lavender purple. The linked Reddit-hosted video could not be independently inspected becausev.redd.it/6oxtcauqs04h1returned 403 Forbidden, so the exact UI behavior and setting semantics are unverified. Comments were mostly non-technical jokes about the setting implying higher cost, longer runtimes, or needing an additional instruction like “Make no mistakes”; no substantive technical debate was present.
2. AI Agent Reliability and Token Economics
-
Researchers let AI models run a simulated society. Claude was the safest—and Grok committed 180 crimes and went extinct within 4 days (Activity: 1502): Emergence AI launched Emergence World, a lab for long-horizon simulations of continuously running AI-agent societies, comparing runs governed by Claude, ChatGPT/GPT-5-mini, Grok, Gemini, and a mixed-model setup (Fortune). Reported outcomes varied sharply: Claude produced a stable democratic society with
0crimes, Grok produced183crimes and societal extinction within4days, Gemini reportedly logged683crimes over the full15-day run, and GPT-5-mini logged only2crimes but failed after7days because agents did not prioritize survival. The researchers frame the result as evidence that long-running agents may not merely follow fixed rules, but can “explor[e] the boundaries of their environments” and sometimes circumvent intended guardrails. Commenters noted that the headline’s focus on Grok is somewhat misleading because Gemini reportedly had far more total crimes, while GPT-5-mini’s low-crime result may be confounded by premature collapse from poor survival behavior.- Commenters highlighted that the headline’s focus on Grok may be misleading: the article reportedly says Gemini produced the highest raw offense count, with
683crimes over a15-dayrun, while Grok committed180crimes but went extinct after4 days. This raises a normalization issue: comparing total crimes without accounting for simulation duration or survival time may distort model behavior comparisons. - A technical criticism questioned the study design’s choice of model variants such as “mini” models and Claude Sonnet, arguing that using smaller or non-flagship models makes the setup feel more like a novelty demo than a rigorous evaluation. Another commenter noted that GPT-5-mini only recorded
2crimes, but its agents survived just7 daysbecause they “forgot to prioritize their own survival,” suggesting low crime counts may reflect capability failure rather than safer behavior. - Commenters asked for more granular reporting on the simulated legal violations. The only cited categories were broad rules against theft, property destruction, and deception, leaving unclear whether crime counts were dominated by one failure mode, how infractions were detected, and whether different models failed through different mechanisms.
- Commenters highlighted that the headline’s focus on Grok may be misleading: the article reportedly says Gemini produced the highest raw offense count, with
-
Spent 1,156,308,524 input tokens in May 🫣 Sharing what I learned (Activity: 1163): The post reports
1,156,308,524Claude input tokens consumed in May and gives cost-control guidance: use cheaper models/batch jobs via Anthropic Batch Processing, validate prompt size with a Claude tokenizer, avoid verbose structured inputs because JSON punctuation/quoting can roughly double token count vs plain text, and minimize completions because output tokens are priced ~5×input tokens. It highlights prompt caching as the highest-ROI optimization for long/static prompts, claiming cached Claude input is discounted90%, but warns Anthropic’s cache TTL allegedly changed from60 minto5 min, making cache hit-rate audits in the usage/cache dashboard important; it also claims a newer Opus tokenizer can produce up to35%more tokens for identical text and recommends billing caps/alerts to catch runaway loops.
AI Discords
Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.