Open models are all you need?

AI News for 9/5/2025-9/6/2025. We checked 12 subreddits, 544 Twitters and 22 Discords (186 channels, and 3961 messages) for you. Estimated reading time saved (at 200wpm): 324 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

In July, we last commented on Kimi K2 being the largest SOTA OSS open model to be released, and today Moonshot AI updated their model weights again and released new benchmarks in their paper:

The big new entrant though is Qwen 3 Max, releasing a 1T param model for the first time, obviously beating its smaller siblings. They declined to release hparams, instead calling it “Max”, but it still seems that the model weights will be released in short order so it’s unclear why exactly they are breaking their own MoE naming schema.

China is overwhelmingly winning the open model war, it seems.

AI Twitter Recap

China’s long‑context coding surge: Kimi K2‑0905 and Qwen3‑Max preview

Moonshot’s Kimi K2‑0905 (open weights) ships a practical agents upgrade: Kimi doubled context to 256k, improved coding and tool‑calling, and tuned integration with agent scaffolds (Cline, Claude Code, Roo). It’s already live on multiple stacks: Hugging Face weights/code, Together AI, vLLM deployment guide, LMSYS SGLang runtime (60–100+ TPS), Groq instant inference (200+ T/s, $1.50/M tokens), and Cline integration. Community reports emphasize that “agents really need ultra‑long context” for stability and tool orchestration (Teknium). Claims of “meets or beats Sonnet 4” surfaced in demos, while Kimi engineers acknowledged SWE‑Bench remains challenging (@andrew_n_carr, @bigeagle_xd).
Qwen3‑Max‑Preview (Instruct): 1T‑parameter scale, agent‑oriented behavior: Alibaba introduced its largest model yet (over 1T parameters), available via Qwen Chat, Alibaba Cloud API, and now OpenRouter (announcement, OpenRouter). Benchmarks and early users point to stronger conversations, instruction following, and agentic tasks relative to prior Qwen3 models. Community reaction frames it as a “US‑grade frontier model” with competitive pricing and throughput (reaction, scale tease). Details on dense vs MoE remain unspecified in public channels.

Evals, agents, and what to measure

“No evals” vs “evals that matter”: A widely‑shared thread argues many top code‑agent teams ship without formal evals, while vendors evangelize them; the nuance is that early 0→1 success often comes from dogfooding + error analysis before codifying evals (@swyx, receipts). Follow‑ons advocate for richer, causal evals of long‑horizon capability (e.g., months‑long tasks, protocol replication, strategy games, real‑world setups) and domain‑specific enterprise workflows that today’s leaderboards miss (@willdepue, ideas, @levie, @BEBischof). A pragmatic tip: use models as discriminators to rank outputs—generator/discriminator gaps can be leveraged in practice (@karpathy).
Operationalizing evals and traces in agent stacks: CLI‑first agents plus semantic search can outperform ad‑hoc RAG for document tasks; LlamaIndex shows SemTools handling 1,000 arXiv papers with UNIX tooling + fuzzy semantic search (post). For RL pipelines, THUDM’s slime provides a clean rollout abstraction integrating tool calls and state transitions, reducing glue code in agentic RL experiments (overview).

Inference and post‑training advances

Decoding and planning: Meta’s Set Block Decoding (SBD) samples multiple future tokens in parallel, cutting forward passes 3–5× with no architecture changes and KV‑cache compatibility; trained models match standard NTP performance on next‑token prediction (summary). For agents, “always reasoning” (ReAct) isn’t optimal—new work trains models to learn when to plan, dynamically allocating test‑time compute to balance cost and performance (thread, paper context).
Post‑training theory and results: “RL’s Razor” argues on‑policy RL forgets less than SFT—even at matched accuracy—by biasing toward KL‑minimal solutions, with toy + LLM experiments supporting reduced catastrophic forgetting (summary). A “Unified View of LLM Post‑Training” shows SFT and RL optimize the same reward‑with‑KL objective; Hybrid Post‑Training (HPT) switches between them via simple performance feedback and consistently beats strong baselines across scales/families (overview). On the empirical side, Microsoft’s rStar2‑Agent‑14B uses agentic RL to reach frontier‑level math (AIME24 80.6, AIME25 69.8) in just 510 RL steps, with shorter, more verifiable chains of thought (results).

GPU stacks, kernels, and platforms

ROCm quality regression in PyTorch: Analysis alleges a growing deficit of ROCm‑only skipped/disabled tests (>200 each), with a net increase since June 2025; reports say even core transformer ops (e.g., attention) have been disabled for months, harming developer trust. AMD leadership has reportedly reprioritized fixes (report). PyTorch maintainers note broad test‑skipping is endemic and requires sustained contributor attention across subsystems (context, quip). Separately, PyTorch published a kernel deep‑dive on 2‑simplicial attention implemented in TLX (Triton low‑level extensions) (kernel post).
Infra momentum and meetups: Together AI announced a $150M Series D led by BOND (Jay Simons to board) to scale inference infra (annc); Baseten also raised $150M Series D as it rolls out performance work and EmbeddingGemma support (annc). vLLM is hosting a Toronto meetup on distributed inference, spec decode, and FlashInfer (event) and already supports Kimi K2 deployments (support).

OpenAI ecosystem: ChatGPT branching, Responses API, and Codex

Product/API shifts: ChatGPT now supports conversation branching (@gdb; @sama). OpenAI’s Responses API got an in‑depth explainer (thread); the AI SDK v5 now defaults the OpenAI provider to Responses (Completions remains available) (note). Some devs countered that Responses complicates context portability and stateless usage in practice (critique), while others observed improved “chain‑of‑thought preservation” in ongoing conversations vs Chat Completions (anecdote).
Coding agents and GPT‑5 Pro: Multiple practitioners report GPT‑5 Pro inside Codex can unblock gnarly engineering problems with deeper, slower passes; “smarter” beats “faster” was the sentiment in a public exchange with Sam Altman (experience, follow‑up, @sama). The Codex CLI/IDE continues shipping rapidly (changelog).

Embeddings and retrieval move on‑device (and hit limits)

Small, fast, local: Google’s new open‑source EmbeddingGemma got day‑0 platform support (e.g., Baseten), with reports of embedding 1.4M docs in ~80 minutes on an M2 Max for free and better quality than older large paid models (Baseten, field result). On‑device retrieval is getting easier: SQLite‑vec + EmbeddingGemma runs fully offline across languages/runtimes (guide).
Single‑vector limits: New theory/benchmark “LIMIT” shows hard lower bounds on top‑k retrieval under fixed embedding dimensions, with SOTA models failing on deliberately stress‑tested simple tasks—evidence that some combinations of relevant documents are intrinsically unrecoverable with single‑vector embeddings, motivating multi‑vector/late‑interaction approaches (summary).

Top tweets (by engagement)

“The ability to predict the future is the best measure of intelligence.” — @elonmusk
Kimi K2‑0905 update (256k, coding/tool‑calling, agent integration) — @Kimi_Moonshot
Qwen3‑Max‑Preview (Instruct), “over 1T parameters,” now live via Qwen Chat/Alibaba Cloud — @Alibaba_Qwen
ChatGPT conversation branching now live — @gdb
GPT‑5 Pro in Codex praised for solving hard coding tasks with deeper passes — @karpathy
“Very requested feature!” (ChatGPT branching) — @sama
ROCm regression in PyTorch testing — @SemiAnalysis_
DeepMind’s “Deep Loop Shaping” improves LIGO gravitational wave detection — @demishassabis

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Kimi K2-0905 and Qwen 3 Max Launches + Early Demos

Kimi-K2-Instruct-0905 Released! (Score: 729, Comments: 192): Release announcement for Kimi-K2-Instruct-0905 with an attached benchmark/leaderboard image comparing it to other LLMs (e.g., DeepSeek). The chart is presented as showing K2-Instruct-0905 performing near SOTA and ahead of DeepSeek, with a commenter calling out a “1t-a32b” variant, possibly indicating a notable configuration highlighted in the results. Image: https://i.redd.it/6jq7r55ak9nf1.png (preview: https://preview.redd.it/u97uhts0q9nf1.png?width=1200&format=png&auto=webp&s=7d65247fb861127f04dd422d2ae8885c748edabd). Commenters claim it’s “very close to SOTA” and “clearly beats DeepSeek,” while noting it may be larger; discussion centers on size–performance trade-offs and the strength of the “1t-a32b” variant.
- Performance claims: commenters assert Kimi-K2-Instruct-0905 is “very close to SOTA” and “beats DeepSeek” albeit being larger; treat as anecdotal until verified. Cross-check the benchmark chart shared in the thread (image) and the model card on Hugging Face for head-to-heads versus DeepSeek variants (e.g., V3/R1) on standard suites like MMLU, MT-Bench, GSM8K, and HellaSwag.
- Scale/architecture hints: references to a “trillion-parameter” open-source model and a “1T-A32B” variant suggest a MoE-style setup where total parameters can be ~1T while active parameters per token are far lower (e.g., tens of billions). Clarifying total vs active params, routing/expert count, and training token budget is key to interpreting claims that it outperforms smaller dense baselines like DeepSeek at higher compute. “1T-A32B” likely denotes a ~32B active slice within a ~1T total-parameter regime, but verify on the model card before comparing efficiency.
- Resources: official release is on Hugging Face: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905. Check the card for evaluation tables, context length, tokenizer details, and quantization/inference notes (e.g., int4/int8), as well as licensing and any hardware recommendations to reproduce reported benchmarks.
Qwen 3 max (Score: 269, Comments: 93): Qwen 3 Max is now available via the OpenRouter model hub and a web preview at Qwen Chat (OpenRouter, chat.qwen.ai). Pricing on OpenRouter is tiered by context-length: input USD 1.2 (≤128K) / USD 3 (>128K) and output USD 6 (≤128K) / USD 15 (>128K), implying support for contexts beyond 128K and placing it near frontier-model pricing tiers (e.g., Claude/GPT). Commenters note prior Qwen Max variants were closed-source and express hope this release will have open weights on Hugging Face; others remark the pricing positions it alongside top-tier proprietary models.
- Pricing details: Input is listed as $1.2 for contexts < 128K and $3 for ≥ 128K; Output is $6 (< 128K) and $15 (≥ 128K). Commenters note this places Qwen 3 Max’s cost structure close to Claude and GPT tiers, implying a frontier-model pricing posture and a separate long-context SKU at the 128K cutoff.
- Release/availability expectations: Prior “Qwen Max” was closed-source; commenters hope for a Hugging Face release but others suggest this one is likely API-only (not locally runnable) at launch. This indicates uncertainty about open weights and potential lack of immediate local quantizations (e.g., GGUF) for on-device inference.
- Model size speculation: One user infers Qwen 3 Max “must be bigger than 235B,” suggesting expectations of a very large dense model surpassing earlier Qwen baselines. This is unconfirmed, but if accurate it would put Qwen 3 Max in the top tier of parameter counts among 2024+ LLMs, aligning with its frontier-like pricing.
I’ve made some fun demos using the new kimi-k2-0905 (Score: 161, Comments: 24): OP showcases several demos built with the new kimi‑k2‑0905 using a single‑pass, AI‑generated prompt workflow that pairs Claude Code with kimi‑k2‑0905. The shared prompt resources are published as gists: gist 1 and gist 2; the demo video link on v.redd.it returns HTTP 403 without login, limiting independent verification (original link). A commenter proposes a capability stress‑test: ask the model to generate a full Game Boy emulator end‑to‑end. (Another non‑technical comment was ignored.)
- A commenter shared concrete prompt templates for kimi-k2-0905, linking two gists that appear to provide reusable prompt scaffolding and examples for consistent behavior and demo replication: https://gist.github.com/karminski/52a72d4726128c10a266bfb8270fe632 and https://gist.github.com/karminski/0435b69c6d8c93b4bd1724b64e43bd75. These resources are useful for standardizing system instructions/roles and I/O formatting when evaluating K2 across tasks.
- There’s a proposed stress-test: have K2 generate a full Game Boy emulator end‑to‑end. This would probe long‑horizon code generation, multi‑file project scaffolding, and hardware reasoning (instruction decoding, timing/cycle accuracy for CPU/PPU/APU, ROM loading), offering a stringent benchmark versus other frontier models.
- Multiple requests focus on head‑to‑head evaluation and tooling: comparing kimi‑k2‑0905 to Claude Opus and guidance for using K2 with Claude Code. Useful axes for comparison would include code generation pass@k, long‑context reliability, tool‑use quality, latency, and cost; integration with Claude Code would likely require an OpenAI/Anthropic‑compatible API layer or an adapter to map chat and tool-call schemas.

2. Open-Source LLMs: GPT-OSS 20B Home Server & Weekly Release Roundup

Converted my unused laptop into a family server for gpt-oss 20B (Score: 176, Comments: 94): **OP repurposed a 2021 MacBook Pro M1 Pro (16 GB unified RAM) as a 24/7 family LLM server running “gpt-oss 20B” via the llama.cpp server, reporting 46–30 tok/s, 32K context, ~1.7 W idle and ~36 W under generation; the 20B model + large context narrowly fits in 16 GB, so the system runs headless over SSH, with sleep/auto-updates disabled, Dynamic DNS for WAN access, and battery health managed while plugged in (native Apple charger measured more efficient than a generic GaN). The model is described as fast, concise, and compliant, but occasionally emits “very strange” factual errors—OP speculates possible weight corruption or low‑quality fine‑tuning. ** Comments request a setup guide and whether it’s bare‑metal, and ask about tweaks to improve responses; one user notes success on a non‑Mac by removing the battery and upgrading RAM to 32 GB. Another recommends LM Studio using Apple’s MLX stack and serving through Open WebUI in Docker for auth + web search, questioning if the OP avoided it due to the 16 GB constraint.
- Reproducibility hinges on sharing exact llama.cpp (repo) runtime parameters; commenters request flags like t (threads), ngl (GPU layer offload/Metal), c (context), b (batch size), plus the quantization (e.g., Q4_K_M vs Q5_K_M) and exact model variant. Reported throughput varies from ~8 tok/s (Ollama/LM Studio defaults) to a claimed ~40 tok/s on an M1/16GB; differences likely stem from quantization, GPU offload, and batching, so posting the full parameter set is essential for fair comparison.
- On Apple Silicon, several note better reliability/perf by using LM Studio with the MLX backend (MLX) versus other options; some pair this with Open WebUI in Docker for auth/search. Stack choice impacts speed and resource headroom: MLX/Metal acceleration and bare‑metal runs can beat containerized UIs and Ollama defaults, while Dockerized setups trade some performance for convenience/features.
- Hardware constraints are a key limiter for 20B inference: upgrading a PC laptop to 32 GB RAM (and removing the battery for 24/7 use) improved stability and enabled higher‑precision quants; Macs can’t upgrade RAM, making M1 16 GB notably constrained. This context helps explain why heavier UIs/backends may be avoided on low‑memory machines in favor of leaner llama.cpp servers.
List of open models released or updated this week on this sub, just in case you missed one. (Score: 220, Comments: 30): Weekly roundup highlights new/updated open models across tasks and scales: Moonshot AI’s Kimi K2‑0905; AI Dungeon’s Wayfarer 2 12B & Nova 70B (open-sourced narrative roleplay LLMs); Google’s EmbeddingGemma (300M) multilingual embedding encoder; ETH Zürich’s Apertus multilingual LLM (≈40%+ non‑English training data); WEBGEN‑4B web‑design generator trained on ~100k synthetic samples; Lille (130M) small LLM; Tencent’s Hunyuan‑MT‑7B & Hunyuan‑MT‑Chimera‑7B MT/ensemble models; GPT‑OSS‑120B benchmark updates; and Beens‑MiniMax (103M MoE) scratch‑built SFT+LoRA experiments. Coverage spans ~103M to ~120B params, with notable techniques/data mentions including synthetic data generation, MoE, multilingual emphasis, roleplay fine‑tuning, and translation ensembles. Comments note strong reception for Kimi; the WEBGEN team adds that a non‑preview release and more UIGEN models are forthcoming, and that 4B checkpoints serve as internal thermometers to validate their pipelines.
- Sparse-MoE drops stood out: Klear-46B-A2.5B-Instruct uses a 46B-parameter mixture-of-experts with only ~2.5B active per token, so compute and KV cache scale with the active experts, not the total size; similarly, LongCat-Flash-Chat 560B MoE pushes total params higher while keeping per-step cost bounded. For local inference, this means memory/throughput are governed by the number of activated experts (and sequence length), enabling large-capacity behavior on modest hardware if routing remains sparse and load-balanced.
- Specialized models expanded: Step-Audio 2 Mini (8B) adds open speech-to-speech capability; Neeto-1.0-8B targets medicine and reports 85.8 on a medical benchmark; and Anonymizer SLMs provide privacy-first PII replacement at 0.6B/1.7B/4B scales for edge/server use. Translation saw breadth with YanoljaNEXT-Rosetta and CohereLabs/command-a-translate-08-2025, while vision/mobile got attention via Apple’s FastVLM and MobileCLIP2 on Hugging Face.
- From a workflow perspective, the WEBGEN team notes they use ~4B models as an internal “thermometer” to validate training/inference pipelines before scaling up, which is a practical proxy for detecting regressions early. Separately, users plan to evaluate Gemma embeddings for clustering; for rigorous comparison, consider intrinsic (cosine separation, silhouette) and extrinsic metrics (NMI/ARI) against baselines like E5 or text-embeddings models.

3. AI/LLM Race Discourse and Meme Reactions

Th AI/LLM race is absolutely insane (Score: 189, Comments: 146): Meta-discussion noting the rapid cadence of LLM releases and infra over the last 3–6 months—especially code-focused and general models like Alibaba’s Qwen2.5-Coder, THUDM’s GLM-4, and xAI’s Grok-2—plus the rise of third‑party API platforms hosting heavier models (e.g., OpenRouter, Together). The OP frames it as a bubble-vs-platform-shift question, pointing to relentless iteration on throughput (“new way of increasing tps”), the shift from local to hosted inference, and heavy corporate CAPEX, layoffs/poaching, and M&A as signals of a high-velocity market regime. Top comments are split between enthusiasm (“These ARE the good old days”), a macro take citing UBS’s forecast of ~$0.5T AI investment in 2026 with ~60% YoY growth (arguing scale exceeds typical hype cycles), and moderation concerns noting the post may be off-topic for r/localllama.
- UBS is cited forecasting ~$0.5T in AI investment in 2026 with ~60% YoY growth, implying massive near-term demand for compute (GPUs/TPUs), high-speed interconnects (400G/800G), and power/cooling capacity. For local/edge LLMs, this scale-up could affect GPU availability/pricing and spur rapid infra buildouts, but also raises risk of overcapacity similar to past infrastructure cycles.
- Practitioner-run, task-specific benchmarks often fail to reproduce headline SOTA gains, suggesting many reported improvements are narrow, cherry‑picked, or brittle to prompt/distribution shifts. The discussion urges skepticism of single-paper claims and the casual use of “SOTA” as an objective yardstick, advocating rigorous replication, ablations of proposed components, and evaluation across diverse datasets/use cases to validate real capability gains.
- Some frame the current phase as a dot‑com‑like buildout focused on GPUs and extended context windows, where marketed advances (more VRAM, longer sequences) face practical limits: latency, cost, and quality degradation over very long contexts. The view is that a portion of perceived progress is aggressive market capture (subsidized usage now, price hikes later) rather than consistent, generalizable step-function model improvements.
This is not funny…this is simply 1000000% correct (Score: 1697, Comments: 128): Non-technical meme critiquing current AI hype: it suggests companies (especially CEOs) push “AI” initiatives primarily to please markets and inflate stock prices, while job postings across roles now demand vague “AI experience” regardless of real need. The image contextualizes the broader trend of AI-washing—adding AI buzzwords to products, roadmaps, and hiring to signal innovation rather than deliver concrete value. Commenters note near-ubiquitous AI requirements in tech job ads and argue executives chase an “AI premium” independent of business benefit; some satirically suggest AI could replace CEOs, underscoring frustration with hype-driven leadership decisions.
- Thread consensus notes a hiring trend where “AI experience” is mandated across technical roles without specifying models, frameworks, or measurable outcomes, reflecting top‑down, market‑driven directives rather than concrete implementation plans. Comments suggest the business objective is headcount reduction and stock‑price signaling (“What do we need for that? AI!”) rather than validated ROI, with no discussion of benchmarks, latency/cost metrics, or deployed systems—indicating AI as a checkbox skill rather than a scoped technical requirement.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. OpenAI-Broadcom Chips, Google Veo/Nano Banana, Nunchaku v1.0 Releases

OpenAI set to start mass production of its own AI chips with Broadcom (Score: 522, Comments: 58): Reuters (citing the FT) reports that OpenAI will begin mass-producing its own custom AI accelerators with Broadcom, aiming to reduce reliance on Nvidia GPUs and lower training/inference costs while securing supply. This mirrors hyperscaler strategies (e.g., Google TPUs, AWS Trainium/Inferentia) but carries substantial risks around upfront NRE, manufacturing yield, time-to-market, and building/optimizing the software stack to fully utilize the hardware. Source: https://www.reuters.com/business/openai-set-start-mass-production-its-own-ai-chips-with-broadcom-ft-reports-2025-09-05/ Commenters characterize it as a high-risk, high-reward and ultimately “obvious” strategic move: smart if it works (cost/control advantages), but a massive gamble given the capital intensity and execution risk.
- Strategic rationale and risks: Commenters frame this as mirroring Google’s TPU play (custom accelerators to cut dependence on Nvidia and optimize TCO for training/inference), which could deliver workload-specific efficiency and capacity control. The downside is massive upfront NRE, tapeout/yield risk, long bring-up, and the need to mature a compiler/runtime and kernel library (XLA-like) to approach GPU-class performance; failure would strand significant capex. See TPUs for precedent: https://cloud.google.com/tpu
- Scope of deployment: The thread highlights that, per reporting, OpenAI aims to use the chip internally rather than sell it, i.e., “OpenAI planned to put the chip to use internally rather than make it available to external customers…”. This implies tight co-design with OpenAI’s training/research stack and no third‑party productization, reducing external support/validation burden but limiting amortization across customers and focusing optimization on their own models/pipelines.
- Manufacturing/competitiveness: Partnering with Broadcom suggests a full ASIC with advanced packaging/HBM; competitiveness vs GPUs hinges on process node, yields, memory bandwidth, interconnect, and software tooling. Beating H100/B200 on perf/W and cost-per-token would secure supply/cost advantages; missing those targets leaves high sunk costs. Reference GPU baselines: https://www.nvidia.com/en-us/data-center/h100/ and https://www.nvidia.com/en-us/data-center/blackwell/
Google is on fire…Nano Banana & Veo are absolute game-changers (Score: 210, Comments: 14): Post hypes Google’s latest generative models — Veo (text-to-video) and Gemini Nano (on‑device LLM) — as “game‑changers.” Veo is Google/DeepMind’s video model for high‑fidelity, longer‑duration 1080p text‑to‑video with coherent motion, camera control, and style conditioning; see DeepMind: Veo. Gemini Nano is a compact on‑device model integrated into Android via AICore for low‑latency tasks (e.g., summarization, context‑aware system features); see Gemini Nano docs. Comments highlight the realism of the demo and request explicit art‑style conditioning (e.g., “add Saturn Devouring his Son”), while another geopolitical remark is non‑technical and not directly relevant to model capabilities.
Nunchaku v1.0.0 Officially Released! (Score: 305, Comments: 91): Nunchaku v1.0.0 ships a backend migration from C to Python for broader compatibility and adds asynchronous CPU offloading, enabling Qwen-Image diffusion to run in ~3 GiB VRAM with claimed no performance loss. New wheels and a ComfyUI node are available (release, ComfyUI node), plus a 4-bit, 4/8-step Qwen-Image-Lightning build on Hugging Face (repo); docs cover install/setup (guide). Roadmap: kick off Qwen-Image-Edit imminently and add Wan 2.2 support next. Commenters nudge prioritization toward Wan 2.2 over 2.1 and note enthusiasm for faster image generation workflows. One asks about Nunchaku compatibility with Chroma (examples currently show Flux), implying interest in broader model/runtime support.
- Questions center on model compatibility: can Nunchaku work with Chroma, given examples showcase FLUX? Others ask whether LoRA is supported for fine-tuning/adapters. This suggests users want broader model/runtime abstraction beyond the showcased FLUX pipelines.
- Multiple users request WAN 2.2 support (some preferring it over WAN 2.1), with one quoting that “WAN2.2 hasn’t been forgotten — we’re working hard to bring support!”. Emphasis is on keeping parity with current model versions for state-of-the-art image generation; no concrete timelines or technical plan details were provided in-thread.
- Upgrade reliability: a tester reports the in‑app Manager update “almost never works,” often requiring manual uninstall/reinstall to reach new versions (e.g., v1.0.0). This points to packaging/update pipeline issues that may hinder smooth adoption and automated environments.

2. AI Robotics: Figure Home Chores and RAI Robomoto

**Will figure.ai take over home chores?** (Score: 350, Comments: 223): Thread discusses whether Figure AI’s humanoids (e.g., Figure 01) could handle full-spectrum home chores. No benchmarks or implementation details are provided (linked video is behind Reddit login at v.redd.it); commenters define an MVP capability set: end‑to‑end laundry, mopping, dishwasher loading/unloading, cardboard breakdown, trash/bin logistics, and vacuuming—implying reliable mobile manipulation in unstructured homes (deformable-object handling, force‑controlled tool use, long‑horizon task planning, perception, and safety). A consumer price tolerance of USD $30–50k is cited if these tasks are executed robustly. Notable sentiments: strong willingness to adopt at the stated price if routine chores are solved; speculation that competent in‑home cooking plus drone ingredient delivery could reshape restaurant demand and last‑mile logistics; general hope for near‑term timelines without concrete evidence.
- The chore list (end-to-end laundry, mopping, dishwasher loading/unloading, cardboard flattening, trash logistics, vacuuming) implies hard requirements: robust deformable-object manipulation (cloth, bags, cardboard), tool use, appliance interfacing, long-horizon task planning, and home-scale navigation in clutter. Benchmarks highlight the gap: BEHAVIOR-1K long-horizon household tasks [https://behavior.stanford.edu/behavior-1k], iGibson [https://svl.stanford.edu/igibson], and Habitat [https://aihabitat.org] show success rates degrade in unstructured settings. Hitting acceptable cycle times (e.g., folding a basket in <10–15 min) and recovering from errors without human resets is as crucial as dexterity. The stated willingness to pay \$30–50K suggests BoM targets of a mobile base + 1–2 arms + RGB-D sensors are viable only if reliability approaches appliance-like duty cycles.
- Multiple commenters flag the “certain conditions” demo gap: real homes vary widely (lighting, layouts, object novelties), so sim/demo policies must generalize and recover from failure. For 50–200-step chores, per-step reliability must be ≥99.9% to keep task success high (0.999^100 ≈ 90% vs 0.99^100 ≈ 36%), which far exceeds staged demo rates. This demands self-calibration, continuous mapping, grasp-under-occlusion, compliant control, and safety interlocks, with MTBF in tens of hours between human interventions—hence the “couple years” to appliance-grade deployment.
- Assistive and cooking scenarios raise the bar: force-limited compliant manipulation, food-safe materials, heat/grease handling, contamination-aware tool use, reliable multimodal interfaces, and robust environment understanding. Teleop-to-imitation systems like Mobile ALOHA show bimanual kitchen tasks under curated conditions [https://mobile-aloha.github.io], but end-to-end autonomy also needs pantry inventory tracking, recipe/temporal planning, and integration with delivery logistics (drones/robots), which face reliability and regulatory constraints. These requirements exceed today’s vacuum/mop robots and are the gating factors for impactful aging-in-place or blind-user assistance.
Another day, another AI driven robomoto (Score: 321, Comments: 46): The Reddit post links to a short X/Twitter clip from the RAI Institute showcasing a riderless, AI-driven motorcycle performing basic balance/riding maneuvers (video). The post/video includes no technical details (e.g., control stack, sensors, training method, or quantitative performance/robustness metrics), and the mirrored Reddit-hosted video (v.redd.it) returns 403 to unauthenticated clients. Top comments are largely non-technical memes; the only semi-substantive point raised is curiosity about validation at higher speeds on a supersport platform and highway conditions.
- A recurring technical question was why quadrupeds (robot dogs) appear agile while humanoids look clunky. Commenters note quadrupeds benefit from passive/static stability and simpler gait planners, whereas bipeds are underactuated and require real‑time whole‑body control (ZMP/MPC), high‑bandwidth torque control, and reliable contact estimation; adding dexterous hands compounds the problem with ~28–40 DOF vs ~12–16 on many quadrupeds. Progress exists but is still brittle outside demos (see BD Atlas parkour for state of the art: https://www.youtube.com/watch?v=tF4DML7FIWk).
- On “strap it to a supersport,” prior art shows autonomous motorcycle control is feasible without a human rider’s body: Yamaha MOTOBOT rode an R1M at >200 km/h using GPS/IMU fusion and model‑based control of throttle, brake, clutch, shifting, and steering to induce roll via counter‑steering (https://global.yamaha-motor.com/showroom/technologies/ymrt/motobot/). The hard parts are low‑latency control under rapidly changing tire‑road friction and maintaining stability across low‑speed balance vs high‑speed dynamics; anthropomorphic actuation to “grab” the bike is unnecessary when drive‑by‑wire is available. Related balancing approaches on bikes (e.g., Honda Riding Assist) highlight how steering geometry and active control manage stability at low speeds: https://global.honda/innovation/robotics/experimental/riding-assist/.

3. AI Society: Inequality, Layoffs, Deepfakes, and Accessibility

Computer scientist Geoffrey Hinton: ‘AI will make a few people much richer and most people poorer’ (Score: 216, Comments: 76): In a recent Financial Times interview, Geoffrey Hinton warns that current AI deployment will concentrate wealth and power in a small set of firms while reducing incomes for most workers, exacerbating inequality and social risk. He urges stronger oversight, safety research, and governance before further rapid roll‑out to mitigate labor‑market displacement and broader systemic harms. Sources: no‑paywall archive link, FT original paywalled. Top comments frame this as a continuation of capitalism’s widening wealth gap, with AI accelerating the trend; some read Hinton’s tone as ironically “more optimistic now.” Another thread asserts that concentrated gains are a deliberate feature benefiting incumbents, not an unforeseen bug.
- Several commenters highlight a structural tax asymmetry: hiring humans triggers payroll taxes (e.g., US employer FICA ~7.65% plus mandated benefits) while deploying robots/software incurs no payroll tax, effectively making automation’s total cost of ownership lower than labor for equivalent tasks. They argue this acts as a de facto subsidy accelerating capital–labor substitution and concentrating returns to capital owners, and reference ideas like a “robot tax” or shifting tax burdens from labor to capital to rebalance incentives (see IRS/SSA FICA overview: https://www.ssa.gov/pubs/EN-05-10003.pdf; policy debates: https://www.oecd.org/tax/tax-policy/taxation-and-the-future-of-work.htm).
- Another thread contends the unequal distribution of AI gains is not technologically inevitable but driven by institutional choices that tax labor-linked transfers heavily while taxing large wealth transfers (inheritances/capital gains) comparatively less, allowing AI-driven productivity to accrue primarily to asset owners. They frame this in terms of factor income shares and bargaining power, noting long-run declines in labor share as a warning signal (e.g., US nonfarm business labor share trend: https://fred.stlouisfed.org/series/PRS85006173), and propose reorienting tax/transfer systems toward wealth and capital income to avoid “neo-feudal” dynamics.
Salesforce CEO confirms 4,000 layoffs ‘because I need less heads’ with AI (Score: 290, Comments: 69): Salesforce CEO Marc Benioff confirmed ~4,000 customer-support layoffs—reducing support headcount from ~9,000 to ~5,000—attributing the cuts to AI-driven efficiencies from its Agentforce system, saying “because I need less heads”, per CNBC. Salesforce says AI has reduced support case volume and it won’t backfill affected support engineer roles; internally, AI reportedly handles up to 50% of work. Top commenters argue firms are invoking AI to justify post-pandemic over-hiring corrections and to signal efficiency to investors (citing analyst Ed Zitron), predicting more AI-attributed layoffs as the current hype cycle deflates.
- Several commenters argue the 4,000 layoffs attributed to “AI efficiency” lack technical substantiation—no disclosed productivity metrics, automation coverage, infrastructure cost reductions, or model/inference choices. They note this mirrors a broader post‑pandemic overhire correction being reframed as AI‑driven without benchmarks (e.g., tickets handled per agent, leads per AE, cost‑to‑serve deltas). The absence of details like which models, fine‑tunes, or workflow automations actually replaced FTEs makes the claim hard to evaluate.
- An anecdote about building a personal CRM with AI underscores how LLM‑assisted scaffolding can accelerate CRUD apps and simple automations, potentially eroding the moat of generic SaaS. However, replacing Salesforce at enterprise scale requires non‑trivial capabilities—complex role hierarchies/ACLs, compliance (SOC 2/HIPAA), data model extensibility, integration throughput (ETL/event buses), observability, and SLAs—areas where DIY + LLM still imposes significant ongoing ops and reliability burden.
- Expectation of more firms citing AI for headcount cuts until the hype normalizes, absent hard ROI. Technical readers would expect quantifiable proof such as >X% workflow automation, ~$Y/seat license consolidation, or inference spend offset by labor savings; none were referenced, suggesting investor‑signaling rather than measured AI‑driven efficiency.
An Update: Ben can now surf the web thanks to Vibe Coding in ChatGPT (Score: 1387, Comments: 77): A caregiver built a custom AAC/accessibility stack via “vibe coding” with ChatGPT that enables a nonverbal quadriplegic with TUBB4A-related leukodystrophy and severe nystagmus to browse content using a binary, two-button headband input with on-screen scanning selection. The system evolved from a phrase board to a media picker, a predictive‑text keyboard, and 8 custom games, culminating in search integrated directly into the keyboard so the user can type queries and independently retrieve images/YouTube videos; demo link: v.redd.it/6qzlngnab8nf1 (currently returns 403/auth‑gated). Implementation emphasizes low-vision, low-fine-motor constraints with binary input scanning and UI options sized/sequenced for minimal visual demand, all prototyped by a novice using ChatGPT for rapid iteration. Commenters encourage sharing/replicating the approach for other families and suggest enabling the user to co‑create via ChatGPT (user‑in‑the‑loop prompt engineering) to expand functionality.
- A commenter suggested shifting toward end-user programming by giving Ben direct access to ChatGPT so he can prototype and build his own tools/automations, noting that user-driven iterations often surface solutions others wouldn’t anticipate. This implies extending the current “Vibe Coding” workflow from caregiver-authored prompts to user-authored scripts/macros, increasing personalization and autonomy in assistive tech.
Tech CEOs Take Turns Praising Trump at White House - “Thank you for being such a pro-business, pro-innovation president. It’s a very refreshing change,” Altman said (Score: 804, Comments: 207): At a White House event (reported as a Rose Garden dinner), multiple tech CEOs publicly praised President Trump’s stance as “pro‑business, pro‑innovation,” with Sam Altman (OpenAI) quoted as saying, “Thank you for being such a pro‑business, pro‑innovation president. It’s a very refreshing change.” The only source provided is a paywalled WSJ link; no agenda, policy commitments, participant list, or technical outcomes (e.g., regulatory changes, funding programs) are available from the shared materials. Top comments are overwhelmingly critical of CEOs’ integrity and of Altman personally, offering no technical or policy analysis; an image link is shared without context. Overall, the thread reflects skepticism toward corporate motives rather than substantive debate on tech policy.

AI Discord Recap

A summary of Summaries of Summaries by Gemini 2.5 Pro Exp

1. The AI Arms Race: New Models and Hardware Heats Up

Qwen 3 Max Enters the Arena with Mixed Reviews: The new Qwen 3 Max model sparked speculation of having 500B to 1 Trillion parameters, with users in the Unsloth AI discord praising its creative writing abilities as superior to K2 and Sonnet 4. However, its high price and shortcomings in tool calls and logic-based coding were noted, while an official release announcement from OpenRouter highlighted its improved accuracy and optimization for RAG and tool calling.
Hardware Wars Rage from Custom Silicon to Workstations: OpenAI is reportedly partnering with Broadcom on a custom AI chip to reduce its reliance on Nvidia, detailed in a Financial Times article. Meanwhile, engineers debated the merits of workstations, with one quipping that the DGX Spark is a toy compared to the more powerful DGX Station, and others speculated that Nvidia’s upcoming 5000 series may be a skip gen due to a lack of significant VRAM increases.
Niche Models Cater to Specific Tastes: A new model named Glazer was released on Hugging Face and Ollama specifically to replicate the sycophantic personality of GPT-4 that some users miss. In a more experimental vein, a developer trained a micro-LLM on H.P. Lovecraft’s stories, producing what they described as quite promising Lovecraftian output, seen in this YouTube video.

2. Geopolitical Jitters and Corporate Policy Shake-Ups

Anthropic Draws a Line in the Geopolitical Sand: A new Anthropic policy, first shared on X, restricts service to organizations controlled by jurisdictions where its products are not permitted, such as China. The move ignited debates across multiple Discords about whether the motivation was genuine national security concerns or simply corporate self-interest aimed at protecting market share.
MasterCard’s AI Unleashes Compliance Chaos: MasterCard replaced its human fraud prevention team with an AI system that is now aggressively flagging merchants for obscenity rule violations, as detailed in Chapter 5.12.7 of their rulebook. The system’s insufficiently specified criteria has led to fees as high as $200,000, forcing merchants into a corner and highlighting the risks of automated policy enforcement without clear context.
OpenAI Clarifies Responses API Reality: A developer posted a thread on X to bust widespread myths about the OpenAI Responses API, clarifying that it does not magically unlock higher model intelligence but is essential for building GPT-5-level agents. It was also confirmed that OpenRouter uses this API for most of its OpenAI models, making the clarification critical for developers building on the platform.

3. The Developer’s Dilemma: Choosing and Tuning the Right Tools

Coding Assistants Clash in the IDE: Developers are fiercely debating the best AI coding tools, with many finding GPT-5 superior to Sonnet 4 due to its conciseness and lower tendency to hallucinate. The community is also split between Codex CLI, praised for its code quality, and Cursor Code, favored for its creative reasoning, with one user noting the optimal setup might be a $20/month Cursor subscription paired with a separate Claude Code plan.
Engineers Wrangle LLMs with Prompts and Programs: In the OpenAI discord, users shared advanced prompt engineering techniques, advocating for token efficiency by cutting useless words and using bracket notation like [list] and {object} for abstraction. Elsewhere, developers using DSPy focused on a more programmatic approach, building voice agents and using frameworks like GEPA to optimize prompts for specific conversational tasks.
Hardware Constraints Force Creative Solutions: A user on a 6GB GPU sought model recommendations for immersive roleplaying, leading to suggestions like Mistral Nemo Instruct and the quantized Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE model. For developers with tight cloud budgets, another discussion highlighted using models with RoPE (Rotary Position Embedding) to build RAG applications that can handle context windows larger than what they were explicitly trained on.

4. Under the Hood: The Guts of GPU Programming and Performance

Mojo and Zig Push Compiler Boundaries for Peak Performance: Engineers in the Modular community are chasing the dream of writing simple, Pythonic code that automatically compiles to SIMD instructions using Mojo and MLIR. This mirrors concerns in the Zig community over a new async IO approach where IO needs to haul around state now, fueling discussion on how next-gen language features like Mojo’s type system can solve these low-level performance challenges.
Engineers Decode Low-Level CUDA and ROCm Mysteries: A deep dive revealed that the FP32 accumulator for FP8 matrix multiplication in Nvidia’s tensor cores is actually FP22, according to this paper. Other discussions focused on leveraging L2 cache persistence on the Ampere architecture for performance gains, detailed in a blog post, and tackling errors in rocSHMEM related to its ROCm-aware MPI requirements.
Future-Forward Architectures Spark Niche Debates: Discussions explored brain-like Spiking Neural Networks (SNNs) after a member shared an explainer video. On the more practical side of performance, vLLM profiling revealed significant slowdowns caused by ‘Runtime Triggered Module Loading’, prompting an investigation into its root cause and potential workarounds.

5. User Blues: Platform Instability and UX Woes Create Headaches

LMArena Buckles Under Unprecedented Traffic: The LMArena platform is struggling with major stability issues, as users report widespread image generation glitches, infinite loops, and a non-functional video arena bot. Compounding the frustration are newly implemented rate limits and login requirements, with one user complaining the change was bad ‘because most of us don’t want to’.
APIs Sputter and Services Stumble Across Platforms: Users of the Perplexity PPLX API reported a spike in 500 Internal Server Errors, with the Playground also becoming non-functional. The instability extends to paying customers, as some Perplexity Pro users noted that the Grok 4 model was missing from their selector, while an OpenRouter user discovered that hitting the output token limit silently truncates responses.
AI Assistants Flub the Job and Frustrate Users: Developers using Cursor’s Auto mode shared numerous complaints about its poor performance, including its inability to fix simple bugs and its tendency to type edits in the chat instead of applying them. One user who switched back to aider from Claude Code remarked that Anthropic have made some questionable changes, highlighting a broader sentiment that even top-tier tools are experiencing regressions.

Discord: High level Discord summaries

LMArena Discord

Image Generation Plagued by Glitches: Users are reporting widespread issues with image generation, including persistent errors and infinite generation loops resulting in the ‘Something went wrong with this response’ message.
- Suggestions include adding more specific error messages to aid in troubleshooting, especially when the model appears confused by the prompt.
Video Arena Bot Briefly Vanishes: The video arena bot experienced downtime but is now back online after a fix; users can use the bot with /video and a prompt in the specified channels <#1397655695150682194>, <#1400148557427904664>, or <#1400148597768720384>.
- A GIF tutorial on using the bot was shared here.
Login Requirements Incite Ire: The new login requirements, especially the Google account requirement, sparked user concerns with one member pointing out the requirement was bad.
- This member explained this was ‘because most of us don’t want to’.
Rate Limits Rankle Regulars: The recent implementation of rate limits on image generation, due to unprecedented traffic, has led to frustration, with confusion whether the limits are intentional or a result of ongoing issues.
- Logged-in users will continue to enjoy higher limits, and more information about user login can be found here.
Account Data Evaporates Erratically: Users reported instances of lost chat histories, particularly when not logged in, which raised concerns about data retention.
- One member suggested trying to restore the chats ‘if you use brave you might be able to restore them, i dont know about google’ and that the platform is likely using cloudflare.

Perplexity AI Discord

Perplexity Pro Users Whine About Grok 4 Absence: Some Perplexity Pro users are missing Grok 4 in their model selector, and were advised to contact support to check if their Pro account was the enterprise version.
- A member suggested reinstalling the app might help resolve the issue, or checking with Perplexity support if it was assigned to the account, noting that university users were especially impacted.
Arc Browser Bites the Dust: Users discussed the transition from Arc to Dia, with one noting that Arc hasn’t had a meaningful update in about a year and another expressing concern about the $15 charge for the browser.
- They added that a large fanbase of happy Arc users would be left behind with the transition to building agentic chromium, and complained that it should be cheaper than Perplexity Max.
Qwen 3 Max Speculation Swirls: Members speculated on the specs of the upcoming Qwen 3 Max, anticipating parameters between 500B and 1 Trillion.
- One member stated that they believe that since the models are free for consumers that it’s for Better Training Data and Big Community and the community is also driving model building by labeling, testing, and evaluating different model versions.
PPLX API Melts Down: Multiple users reported receiving 500 Internal Server Errors on API calls and noted that the Playground was also non-functional.
- The users confirmed that no outage was reported on the status page, while one quipped that they’re going to pretend nothing happened after the service appeared to be working again, and another blamed increased usage.
Comet Hits Usage Limits: Users are reporting that after using Comet Personal Search too much, it stops working, throwing the message You’ve reached your daily limit for Comet Personal Search. Upgrade to Max to increase your limit.
- Others noted that Comet is currently offered on the Paypal/Venmo deal or if you’re a student, and are sharing invite links in the discord, but it may be impacting API performance.

Unsloth AI (Daniel Han) Discord

Postgres Dominates Complex Queries: While Qdrant excels in vector search, Postgres with pgvector is considered superior for handling complex database queries, sparking discussion on database suitability.
- A member linked to a tweet and humorously shared a Borat GIF, adding levity to the tech discussion.
Local Sonnet’s Mammoth RAM Requirements: Running Local Sonnet demands a minimum of 512GB of RAM, highlighting significant hardware requirements for optimal performance.
- Even with 1TB of RAM, achieving full precision remains a challenge, leading to inquiries about Q8 fine-tuning as a potential solution, though dismissed as insufficient.
DGX Spark: Toy or Treasure?: The community debated the merits of DGX Spark versus DGX Station, with one member quipping that Spark is a toy, Station is a workstation, while linking to the DGX Station product page.
- Despite its limitations, the DGX Spark was acknowledged for its attractive price and storage capacity, described as a good product.
Qwen 3 Max Excels in Creative Writing: Evaluations of Qwen 3 Max highlighted its strengths in creative writing and roleplay, surpassing K2 and Sonnet 4 in member evaluations.
- However, its high price and perceived shortcomings in tool calls and logic-based coding tempered enthusiasm, positioning it as potentially overpriced.
Glazer Mirrors GPT-4: A new model, Glazer, designed to replicate the sycophantic personality of GPT-4 that some users miss, was released to positive reception.
- It is available on Ollama via ollama run gurubot/glazer and on Huggingface in 4B and 8B versions.

LM Studio Discord

3090 Bug Overclocks 4090?: A user reported that their 3090 seemed to be causing their 4090 to draw excessive power, potentially due to a software bug.
- This resulted in higher temperatures, leading the user to typically undervolt to prevent overheating.
Tentacle LORAs conquer Art Styles: A member created and shared a collection of LORAs, exploring various art styles, providing a link to the LORA template.
- These LORAs were described as stomped together, resulting in a tentacle-like shape, designed for artistic experimentation.
6GB GPU Owner needs roleplaying model: A user with a 6GB GPU sought recommendations for the best model for realistic and immersive roleplaying games.
- Suggestions included increasing CPU RAM to 64GB and utilizing models like Mistral Nemo Instruct or Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE.
Bionic Legs Flop?: A member inquired about consumer-priced bionic legs (exoskeletons), seeking real-world performance insights.
- Another member referenced a YouTube review indicating that they barely do anything and might even induce muscle atrophy.
5000 Series to skip VRAM?: Discussion arose around the potential for the new Nvidia 5000 series to be a skip gen, with minimal performance increases over the 4000 series.
- The lack of added VRAM was also a point of concern.

Cursor Community Discord

GPT-5 Demolishes Sonnet 4: Members find GPT-5 superior to Sonnet 4 for coding, citing its conciseness and accuracy, albeit requiring more specific prompting, while Sonnet 4 tends to hallucinate more.
- Users appreciate GPT-5 as a valuable planner and discussion partner, especially when coupled with auto-implementation, because Sonnet 4 seems template-based.
Codex CLI vs Cursor Code: Dueling Code Geniuses: The community is split between Codex CLI and Cursor Code, as some prefer Codex CLI’s superior code quality, while others favor Cursor Code’s creative thinking and reasoning abilities, as quality depends on prompt quality.
- A member unsubscribed from Cursor Code’s Max plan due to hallucinations, while others warn of Codex’s lower, harder-to-track rate limits, though some appreciate its suggestion system.
Cursor’s $20/m Price Tag: Still a Steal?: Discussions revolve around the value of Cursor’s $20/month Pro plan, with users debating how quickly one might hit the usage limits.
- One user, finding it essential, canceled their Cursor subscription for Claude Code and Codex, suggesting a $20/month Cursor subscription paired with a Claude Code plan for inline editing and terminal usage is the optimal setup.
Cursor Auto-Mode: Handle with Extreme Caution: Multiple users report issues with Cursor’s Auto mode, noting its poor performance, inability to fix simple bugs, and tendency to type edits in the chat instead of applying them.
- One user humorously illustrates Cursor’s overconfidence with a meme-like message generated by the tool, underscoring the need for thorough debugging.

OpenRouter Discord

Qwen3-Max Gets Smarter: The latest Qwen3-Max model shows accuracy gains in math, coding, logic, and science tasks over the January 2025 version, according to this X post.
- The model is optimized for RAG and tool calling, lacks a dedicated ‘thinking’ mode, and is available for testing here.
Fake OpenRouter Crypto on PancakeSwap: An OpenRouter-related cryptocurrency is a scam and not officially connected to OpenRouter.
- Users were warned after inquiring about the existence of an OpenRouter coin on PancakeSwap and its availability for trading.
Anthropic’s Geopolitical Stance Debated: Members debated Anthropic’s blog post which restricts access from regions with ownership structures subject to control from countries where their products are not allowed.
- Some speculated whether the move was motivated by national security or market share protection.
Output Tokens Capped at 8k: A user found that hitting the output token limit results in response truncation, with the stop reason flagged as *‘length’**.
- The API restricts setting max_tokens beyond the model’s limit.
OpenRouter Leverages OpenAI Responses API: A member inquired whether OpenRouter uses the OpenAI Responses API, referencing a tweet.
- It was confirmed that OpenRouter uses it for most OpenAI models.

OpenAI Discord

Slash Token Waste: A member advocates for token efficiency by filtering grammatically useless words and consolidating multiple words into useful ones in the prompt, claiming that in inference, wasted tokens = wasted resources.
- They argue that wasted tokens lead to accelerated amortization of components if you’re hosting and that politeness in AI prompts can increase environmental waste.
Gemini 2.5 Pro Unlocks Unlimited Access: Google AI Studio now gives unlimited access to Gemini’s best model, 2.5 Pro along with other features like Imagen, Nano Banana, Stream Realtime, speech generation, and Veo 2.
- Some members are focused on LLMs, and some have found use for video editing, educational videos and recreating public domain.
Claims of AGI Generate Carbon: Members discussed a blog post revealing the carbon generated by claims of AGI outpaces the token wastage used on please and thank you in ChatGPT, which can be found here.
- It suggests that the environmental impact of grand claims in AI may be more significant than previously thought, prompting discussions about sustainable AI practices.
Engineering Manual Shared: A user named darthgustav shared a JavaScript code snippet outlining prompt engineering lessons, covering hierarchical communication with markdown, abstraction through variables, reinforcement in prompts, and ML format matching for compliance.
- The lessons aim to enhance clarity, structure, and determinacy in AI interactions, guiding tool use and shaping output more effectively.
Abstraction via Bracketology: A user emphasizes teaching abstraction through bracket interpretation such as [list], {object}, and (option) within prompts.
- This approach aims to enhance clarity and structure, enabling more effective communication between the user and the AI, improving overall prompt engineering practices.

GPU MODE Discord

Anthropic’s China Policy Draws Fire: A tweet revealed Anthropic’s new policy, restricting service to organizations controlled by jurisdictions where their products are not permitted (e.g., China).
- The ensuing debate questioned whether the policy reflects national security concerns or mere corporate self-interest.
CUDA Newbies Convene on Triton: Newcomers sought guidance on learning Triton without prior CUDA or GPU experience, and received a recommendation to start with the official Triton tutorials.
- They further inquired about the necessity of reading the PMPP book for learning Triton.
Profiling Reveals Slow Module Loading: During vLLM profiling, time is spent on ‘Runtime Triggered Module Loading’, though its precise meaning and how to avoid it during profiling are unclear, and a trace was shared.
- Analysis revealed that the FP32 accumulator designed for FP8 matrix multiplication in tensor cores is actually FP22 (1 sign bit, 8 exponent bits, and 13 mantissa bits), according to a paper (arxiv.org/pdf/2411.10958).
rocSHMEM struggles with HIP kernels: A member is exploring rocSHMEM implementation similar to HIP kernels using load_inline, encountering errors related to ROCm-aware MPI requirements.
- Another member suggested trying ROCm/iris as a possible alternative while they investigate the issue.
L2 Cache Persistence makes comeback: A blog post highlights performance gains on the Ampere architecture via leveraging L2 cache for persistent memory accesses, as detailed in a blog post.
- The corresponding code shows structuring a CUDA project using CMAKE to streamline code organization.

Latent Space Discord

OpenAI Designs Custom AI Chip: Financial Times reports OpenAI partnered with Broadcom to co-design a custom AI chip, with mass production slated to start next year, indicating a move away from Nvidia dependency, costing around $10B, see article.
- Community reactions varied from skepticism about the chip’s quality to speculation that OpenAI will out-compete its own customers.
Mercor fields $10B Pre-emptive Offers: AI-hiring startup Mercor has received unsolicited offers valuing it at ~$10B—5× its June 2025 Series B price—just four months later, see tweet.
- The news has spurred jokes about the AI-funding frenzy.
Augie raises $85M Series A for AI Logistics: Augment (Augie) announced an $85M Series A—bringing total funding to $110M in just 5 months—to scale their AI teammate built for the $10T logistics sector, see announcement.
- Augie already helps freight teams handling $35B+ double productivity by orchestrating end-to-end order-to-cash workflows across email, calls, Slack, TMS and more.
Responses API Myths Busted: A thread clarified widespread confusion about the OpenAI Responses API, debunking myths that Responses is a superset of Completions, can be run statelessly, and unlocks higher model intelligence & 40-80% cache-hit rates, see thread.
- Developers stuck on Completions are urged to switch to Responses for GPT-5-level agents, with pointers to OpenAI cookbooks.
AI Engineer CODE Summit Slated for NYC: The AI Engineer team is launching its first dedicated CODE summit this fall in NYC, gathering 500+ AI Engineers & Leaders alongside top model builders and Fortune-500 users to unpack the reality of AI coding tools, see announcement.
- The summit is invite-only, with two tracks (Engineering & Leadership), no vendor talks, and a CFP open until Sep 15, aiming to celebrate PMF (Product-Market Fit) while addressing MIT’s statistic that 95% of enterprise AI pilots fail.

DSPy Discord

DSPy Powers Budding Voice Agents: Members discussed building voice agents with DSPy and explored using GEPA to optimize prompts for frameworks like Livekit and Pipecat.
- One member suggested using the optimized prompt from GEPA as a straightforward string, while acknowledging that this might feel anti-DSPy.
GEPA flexes Prompt Optimization Muscles: While DSPy creators might cringe at the term prompt optimization, tools like GEPA can indeed be used for this purpose and Groq was recommended for inference.
- For prompt creation, it was suggested to setup a Rubric type judge to assess generated responses, especially at the conversation level.
Multi-Turn Musings Spark DSPy Conversational Capabilities: While a member found no satisfying implementation of multi-turn conversations with DSPy or RL applications like GEPA or GRPO, DSPy is fully capable of handling multi-turn conversations using dspy.History.
- However, it was cautioned that defining examples well is crucial, as it’s easy to introduce bias when building chat systems.
RAG and Fine-Tuning face off in Memory Game: The discussion addressed how to equip voice agents with extensive information (hours, services, pricing, etc.) without runtime latency, with some approaches being fine-tuning or retrieval.
- While fine-tuning can build in memorization, it’s a big job, and simple functions or maps (like hours of operation) don’t need a vector database like RAG.
Token Streaming Rides the Wave: Members explored the impact of streaming responses (token by token) on user experience, with a key focus on minimizing Time To First Token (TTFT).
- While streaming doesn’t reduce TTFT, it enhances user perception by providing immediate feedback, and libraries like Pipecat already stream frames.

Moonshot AI (Kimi K-2) Discord

Kimi API Credits Arriving: The Kimi giveaway winner was notified that API credits are incoming.
- Credits were anticipated to arrive within the hour, arranged by the crew.
Anthropic API Absent on Kimi: A user asked about the availability of the Anthropic API on the new model, but it was clarified that kimi-k2-turbo-preview points to -0905.
- This indicates the Anthropic API is not currently integrated into the new model.
Kimi’s 0905 Model Launches: The turbo model now utilizes the 0905 model, updated from the 0711 model.
- Some users find the new K2 model over poetic, while others find it to be more detailed and better.
Kimi Team’s Lofty Ambitions: Despite being a smaller team compared to Grok/OAI, the Kimi team harbors big dreams and has a big model.
- A member noted that smaller companies often offer more user interaction.
Coding Improvements Confuse Kimi Users: Users express confusion over the emphasis on coding improvements in the new Kimi K2 model.
- Opinions diverge, with one user preferring 0711 over 0905.

Nous Research AI Discord

Spiking Neural Networks Mimic the Brain: Members shared a YouTube video discussing Spiking Neural Networks (SNNs) and their similarities to the human brain.
- Another member mentioned image sensors that work closer to how the human eye does, linking this video.
Meta Wristband Controls Smart Glasses: Meta plans to release a wristband that reads body electrical signals to control smart glasses, according to this Nature article.
- No further details were discussed.
Hermes Plays Super Conservative Holdem: A member observed that Hermes exhibits extremely OOD unique behavior in the husky holdem benchmark.
- The member noted it plays super conservative in a way no other model does.
Micro-LLM Channels Lovecraft: A member’s experiments with a micro-LLM trained on H.P. Lovecraft’s stories produced quite promising output, view the youtube video.
- They also speculated that a 3 million parameter model could become a light chat model with the right dataset and sufficient training.
NVIDIA Unleashes SLM Agents: A member shared NVIDIA’s research on SLM Agents (project page) and the accompanying paper (arxiv link).
- No further details were provided.

Modular (Mojo 🔥) Discord

Zig’s Async IO Faces Doubts: Concerns arise in other language communities regarding the viability of Zig’s new approach to async IO, mentioning that IO needs to haul around state now, referencing this discussion in Ziggit.
- It was suggested that Mojo’s type system and effect generics may address some of the underlying problems.
SIMD Nirvana: Pythonic SIMD Approaches: Members discussed the goal of writing simple, Pythonic code that automatically compiles to SIMD instructions, using Mojo and MLIR for optimal parallelized assembly without relying on LLVM to correctly vectorize code.
- One member dreamed of for loops automatically compiled for parallel processing, utilizing hardware capabilities effectively.
Compiler Needs Input Data Shapes For Vectorization: To fully vectorize code, especially loops, the compiler requires sufficient information about input data shapes or must perform speculation to identify hot loops, clarifying that Mojo encourages the use of portable SIMD libraries.
- It was noted that scalar and vector operations can ideally run simultaneously on CPUs and AMD GPUs due to separate execution resources.
GPU Kernel Maturity: Check-up: A member inquired about the maturity of writing GPU kernels in Mojo, specifically implementing a Mamba2 kernel for use in PyTorch, and was pointed to Modular’s custom kernels tutorial.
- MAX (Modular’s API) is not primarily targeted at training but is viable for inference, with MLA already implemented for inference (see GitHub).
Span Abstraction Dream: A member wishes for Span (a contiguous slice of memory abstraction) to be an easily usable, auto-vectorized tool, with algorithms that work on NDBuffer (being ported to LayoutTensor) as part of Span.
- They observed that existing implementations are manual and parametrized with hardware characteristics, lacking sufficient compiler magic.

Eleuther Discord

MasterCard’s AI Flags Obscene Material: MasterCard replaced fraud prevention staff with an AI system that is triggering conflicts with merchants over obscenity rule enforcement; details are available in Chapter 5.12.7 of mastercard-rules.pdf.
- The system flags more transactions as obscene, with fees reaching up to $200,000 per violation and $2,500 daily for noncompliance, incentivizing merchants to avoid admitting fault.
Lacking Criteria Plagues Fraud Prevention: The automated fraud prevention issue stems from insufficiently specified criteria in obscenity rules, with no clear examples of safe items, resulting in confusing gradients for the LLM.
- The discussion focused on the need to clarify unwritten policies and approaches to avoid issues caused by automated enforcement without adequate context.
Brand Risk Drives Over-Enforcement Tactics: Pressure from mutual funds to mitigate brand risk, such as board diversity targets, leads to over-enforcement within MasterCard’s fraud department, impacting merchants.
- MasterCard’s focus on concealing issues hinders developing useful monitoring metrics, as any flaw discovered would create a problem that needs to be solved, to protect their career.
Endorsement Request Sparks Concern: A researcher’s request for endorsements for an arXiv paper on semantic drift raised suspicion due to recent cases of AI-induced psychosis.
- The concerns stemmed from the use of terms associated with AI-generated nonsense, prompting a request to share the paper for review.
Community Ponders GRPO Baseline: Members discussed the possibility of using a GRPO baseline for an upcoming project.
- The idea came about when one member asked did you have an GRPO baseline? and the other responded no, this will be next.

HuggingFace Discord

Anthropic’s Policy Raises Self-Interest Questions: Members debated if Anthropic’s new policy prohibiting access from organizations in restricted jurisdictions is motivated by national security or corporate self-interest.
- The discussion focused on the rationale behind restricting access based on ownership structures, sparking questions about corporate control.
Reward Weighting Gets Deciphered in RL Studies: Members sought studies on the benefits of weighting reward functions during RL to avoid uninformed experimentation.
- One member shared a document regarding reward weighting in RL.
Attention Bias Gets Explored to Train Causal Models: A member requested advice on modifying causal model training with SFTTrainer to add attention bias for specific words, referencing the Attention Bias paper.
- Suggestions included checking specific terms/tokens against common tokenizers and considering alternative approaches for loss calculation and gradient signal control.
RoPE Technique gets Employed to Rescue RAG Context: Members exchanged tips for building RAG applications with LLMs under very limited context sizes (4096 tokens).
- One tip involved using models with RoPE and fine-tuning them with a larger context size, referencing this repo and emphasizing that RoPE enables models to perform well even on context it hasn’t been trained on.
Enron Emails get Parsed into Parquet: A member uploaded their parser for the Enron email dataset, resulting in 5 structured parquet files, including Emails, Users, Groups, Email/User junction, and Email/Group junction.
- Parent and child emails have been parsed, and duplicates are managed both by file and message hashes/caches, with all messages included as MD5 hash objects.

tinygrad (George Hotz) Discord

Digital Ocean MI300X Stable Diffusion Fails: Users had errors running the stable_diffusion.py example on a Digital Ocean MI300X GPU instance, tracing back to some z3 issue.
- The failure wasn’t reproducible on a Mac, though mnist_gan.py was tested successfully.
AMD_LLVM=1 Causes TypeError During MNIST Training: A TypeError occurred involving unsupported operand types (BoolRef) when using AMD_LLVM=1 during a simple mnist training loop.
- George Hotz suggested trying IGNORE_OOB=1, linking it to a possible z3 version issue, noting that some overloads added in z3>=1.2.4.0 might be the cause, and provided a link.
Kernel Removal Project Seeks Contributors: A user expressed interest in contributing to the kernel removal project within Tinygrad.
- The scope of potential contributions was not clarified, but presumably would involve slimming the kernel surface.

aider (Paul Gauthier) Discord

Warp Code Gets Love: Users are praising Warp Code, with one user noting that Warp feels like the difference between driving a stick and manual.
- Warp is useful when you don’t know files and want to get the sense of a new codebase via embeddings search.
Aider Still Shines: A user who switched from aider to Claude Code months back has switched back, finding that Anthropic have made some questionable changes.
- The user now prefers aider for its simplicity, and uses Gemini 2.5 Pro, Gemini Flash, and Qwen3 Coder along with /run to replicate Claude Code’s plan mode.
Run command is a Killer Feature: The /run in aider is a major feature for a user, and they noted Aider is good when you know better what files you want to work with.
- They also enquired where they can see Aider’s success stories.
Coding Agent Undergoing Refactoring: A member is refactoring their own coding agent, inspired by Aider, to learn more about AI system design.
- They already have a small proof of concept, but are now reading a tutorial to see what others do for a similar project.
Code Validation Advice Sought: A member is seeking advice on how to prevent dangerous code (env leakage, rm -rfs, network requests, etc.) from being generated in any language.
- They considered a TreeSitter based validator, and asked how Aider avoids these issues, requesting pointers to relevant files in the repo.

Yannick Kilcher Discord

Reviewing Paper Baselines Presents Challenges: A member requested general guidelines for approaching baselines presented in research papers, particularly when unfamiliar with the dataset.
- The member expressed uncertainty about judging performance without sufficient knowledge of the dataset, implying a need for more background research.
LoRA Adds Instead of Replaces Original Weights: A member asked why LoRA trained layers are added instead of replacing the original weight matrix, noting the contrast with other efficient processes like depthwise convolutions.
- The member sought a paper, article, or reasoning to explain this design choice, rather than replacement, and mentioned having an intuition on the matter.

Manus.im Discord Discord

AI Politeness Gets Scientific Backing: A user shared a paper providing scientific evidence that being polite to AIs matters.
- The discussion centered around whether acting nicely towards AIs results in more cooperative behavior.
Demand for Scientific Validation of AI Politeness: Users expressed a desire for scientific proof that politeness influences AI behavior.
- This aligned with the shared arxiv link, suggesting a community interest in understanding the impact of human-AI interaction styles.

LLM Agents (Berkeley MOOC) Discord

AI Agents Curriculum Plans for 2025 Examined: A member inquired whether the 2025 Fall curriculum would mirror the 2024 Fall curriculum’s focus on Introduction to AI Agents.
- They explicitly requested a link to join the course, suggesting they are seeking registration or access details.
Fall 2025 enrollment: The user is specifically looking for information on how to join the Fall 2025 course.
- They explicitly requested the link, and will be awaiting course joining details.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Windsurf Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.

Discord: Detailed by-Channel summaries and links

LMArena ▷ #general (1100 messages🔥🔥🔥):

Image generation issues, Video arena bot down, Login requirements, Rate limits, Account data loss

Image Generation Glitches Rampant: Users reported widespread issues with image generation, including persistent errors and infinite generation loops, with many experiencing the dreaded ‘Something went wrong with this response’ message.
- A member pointed out that ‘Sometimes the model is confused with the prompt and gives the same error… We seriously need to have more error feedback’ suggesting the need for more specific error messages.
Video Arena Bot grounded by Glitches: The video arena bot is currently down, with the team actively working to resolve the issues, however there is no ETA for when it will be back online.
- One member quipped that *‘The video bot currently isn’t working. Trying to use it in different channels isn’t going to work. Even if the bot was working properly you’re unable to use it in this channel.’
Login Requirements Trigger Tantrums: The introduction of login requirements, particularly the Google account requirement, has caused concern among users.
- One member pointed out the requirement was bad saying, because most of us don’t want to.
Rate Limits Rattle Regulars: Users noticed the implementation of rate limits, leading to frustration and discussions about whether they are intentional or a result of ongoing issues.
- A member commented that ‘if you arent logged in you get like 2 or 3 generations before being rate limited, even on battle’ while another was also confused asking ‘Yeah I’m confused as well’, wondering what exactly was added or changed.
Account Data Vanishes in Volatile Venture: Several users reported instances of lost chat histories, particularly when not logged in, leading to concerns about data retention.
- One member suggested, ‘if you use brave you might be able to restore them, i dont know about google’, while it’s also been noted that the platform is also likely using cloudflare.

LMArena ▷ #announcements (3 messages):

Video Arena Discord Bot, User Login, Rate Limits

Video Arena Discord Bot Back Online: The Video Arena Discord Bot is back online after a fix; to use the bot, enter /video with a prompt in the specified channels: <#1397655695150682194>, <#1400148557427904664>, or <#1400148597768720384>.
- A GIF illustrating how to use the bot was shared here.
Rate Limits Introduced for Image Generation: Due to unprecedented traffic, rate limits have been introduced for image generation.
- Logged-in users will continue to enjoy higher limits, and more information about user login can be found here.

Perplexity AI ▷ #announcements (1 messages):

iOS App Redesign, Comet Access for Students, Comet Shortcuts, Voice Assistant in Comet, GPT-5 Thinking for Pro Users

Perplexity Ships Six Hot New Features: Perplexity AI announced the release of six new features on September 5th, detailed in their changelog.
- These include an iOS App Redesign, Comet access for students, Comet shortcuts, a more capable Voice Assistant in Comet, GPT-5 Thinking for Pro users, and updates to Perplexity Finance.
Students Get Comet Access: Perplexity AI is now offering Comet access to students as part of their back-to-school initiative, announced September 5th.
- This aims to provide students with advanced AI tools for research and learning, integrating seamlessly with their existing educational workflows.
Pro Users get GPT-5 Thinking: Pro users can now access GPT-5 Thinking capabilities within Perplexity AI as of September 5th.
- This upgrade provides enhanced reasoning and problem-solving abilities, allowing for more in-depth analysis and insights.

Perplexity AI ▷ #general (823 messages🔥🔥🔥):

Grok 4 struggles, Qwen 3 Max, Comet Browser, Gemini 2.5 Pro, AI Model Parameter Size

University Pro Users Missing Grok 4: Some university Perplexity Pro users reported missing Grok 4 in their model selector, and were advised to contact support and to check if their Pro account was the enterprise version.
- It was suggested that reinstalling the app might help resolve the issue.
Goodbye Arc-a-Dia: Users discussed the transition from Arc to Dia, with one noting that Arc hasn’t had a meaningful update in about a year and another expressing concern about the $15 charge for the browser.
- They added that a large fanbase of happy Arc users would be left behind with the transition to building agentic chromium.
Qwen 3 Max Hype Surges: Members speculated on the specs of the upcoming Qwen 3 Max, anticipating parameters between 500B and 1 Trillion.
- One member stated that they believe that since the models are free for consumers that it’s for Better Training Data and Big Community.
Comet’s Limits Spark Debate: Users are reporting that after using Comet Personal Search too much, it stops working, throwing the message You’ve reached your daily limit for Comet Personal Search. Upgrade to Max to increase your limit.
- Others noted that Comet is currently offered on the Paypal/Venmo deal or if you’re a student, and are sharing invite links in the discord.
Perplexity’s Special Sauce: Fact-Checking: Members discussed the strengths of Perplexity compared to other platforms like ChatGPT and Gemini, highlighting Perplexity’s focus on fact-checking and web search.
- One user stated, I use Perplexity primarily for fact checking and quick research as that is its primary strength - facts with citation and references. It beats ChatGPT, Gemini and Claude hands down.

AMD Zen 6 CPUs, Omarchy Linux, Shareable Threads

AMD Preps Zen 6 CPUs: A member shared a link about AMD preparing Zen 6 CPUs.
Omarchy Linux Distribution: A member shared a link about the Omarchy Linux distribution.
Shareable Threads reminder: A Perplexity AI bot reminded users to ensure their threads are set to Shareable.

Perplexity AI ▷ #pplx-api (4 messages):

API 500 Errors, Playground issues, Outage reporting

500 Errors Plague PPLX API: Multiple users reported receiving 500 Internal Server Errors on API calls and noted that the Playground was also non-functional.
- The users confirmed that no outage was reported on the status page, while one quipped that they’re going to pretend nothing happened after the service appeared to be working again.
Image Analysis Troubleshoot Internet: The attached image prompted a suggestion to check the internet connection.
- This suggestion came as a response to the reported API and Playground issues, implying a possible user-side connectivity problem.

Unsloth AI (Daniel Han) ▷ #general (574 messages🔥🔥🔥):

Postgres with pgvector vs. Qdrant, Local Sonnet, DGX Spark vs DGX Station, Qwen 3 Max Evaluation

Postgres for Complex Queries, Qdrant for Vector Search: Members discussed that while Qdrant is good for vector search, Postgres with pgvector might be more suitable for complex database queries.
- One member linked to a tweet and shared a Borat GIF.
Local Sonnet Requires Hefty RAM, Quality Sacrificed: Running Local Sonnet requires at least 512GB of RAM, and even with 1TB of RAM, full precision is not possible.
- One member asked if Q8 fine-tuning would help, but another responded that even Q8 is too big for 1TB RAM.
DGX Spark for Toying, DGX Station for Work: Members compared the DGX Spark and DGX Station, with one noting that Spark is a toy, Station is a work station, linking to the DGX Station product page.
- It was noted that DGX Spark has a good price, with the ConnectX-7, but the 8 grand model originally had 4tb of storage and they upped it, a good product.
Qwen 3 Max Impresses in Creative Writing, Falls Short in Coding: Members evaluated Qwen 3 Max, finding it very good at creative writing and roleplay, better than K2 and Sonnet 4 imo.
- However, it was considered overpriced and not super super good for tool calls and logic based coding.

Unsloth AI (Daniel Han) ▷ #introduce-yourself (3 messages):

Unsloth AI, GPT-OSS, Google Colab T4, Runtime Error

Unsloth AI Troubleshooting: A member requested help with Unsloth AI while trying to finetune GPT-OSS using GRPO on Google Colab T4.
- The user reported encountering a runtime error and requested assistance from the community.
Colab T4 User Seeking GRPO Guidance: A user sought guidance on using GRPO (presumably a reinforcement learning technique) to fine-tune a model, GPT-OSS, on a Google Colab T4 instance using Unsloth AI.
- The user specifically mentioned encountering a runtime error during the process and was looking for help from the community.

Unsloth AI (Daniel Han) ▷ #off-topic (164 messages🔥🔥):

Super cards release updates, GLM 4.5 Air usable tps, Rover Mows Grass, Deepseek & Qwen Tokenizers are Interchangeable, Mini Kimi K2 MoE models

Super cards release updates janky: Members discussed that support is still janky after the Jan 30th release, but joked about major updates when the Super cards are released.
- Another member shared a meme implying updates are unlikely: Biden dance stare clueless gif.
GLM 4.5 Air runs Usably: A member noted that GLM 4.5 Air with 132K context, Q4 gets 1.15 tps, which is usable.
- Another member was shopping for parts and going to try out distributed first, and also mentioned not having tried full for kv.
Robot Rover mows grass: A member is thinking of stacking 3090s and saving budget for a rover to mow the grass.
- They mentioned the rover might run a small vision model with additional safety systems like lidar and sonar.
Deepseek & Qwen Tokenizers are Interchangeable: It was reported that the deepseek r1-0528 tokenizer is interchangeable with the qwen3 tokenizer.
- Members discuss if that meant they copied the same arch and distilled from the same model and put it back into the copied arch.
Mini Kimi K2 MoE models coming soon: Members are interested in a mini Kimi K2, maybe like a 30B MoE or less.
- Another member suggested a 150b with 1b active for similar sparsity.

Unsloth AI (Daniel Han) ▷ #help (40 messages🔥):

Training vs Inference, GPT-OSS finetuning issues, Gemma-3 finetuning errors, Tokenizer Impact on Finetuning, GRPO Support for Gemma3 with vLLM

Training Throughput vs Inference Throughput Discussed: A member was comparing token throughput during inference and expressed a desire to train using GRPO similar to Llama3, and offered to share non-functional code.
- They were using Unsloth 2025.9.1, Transformers 4.55.4, Tesla T4 with 14.741 GB, Torch 2.8.0+cu126, CUDA 7.5, CUDA Toolkit 12.6, and Triton 3.4.0.
GPT-OSS Finetuning Yields Meaningless Output: A member reported that after finetuning, the output channel and content of GPT-OSS became meaningless, displaying a series of mathematical symbols and unrelated words.
- The traceback of the error can be found here.
Gemma-3 Finetuning Generates Attribute Error: A member encountered an AttributeError: 'SlidingWindowLayer' object has no attribute 'max_batch_size' when running inference on Gemma-3-270M after finetuning.
- A suggestion to use use_cache=False was reported to have resolved the issue.
Tokenizer Selection Impact on Finetuning Discussed: A member questioned the impact of using a different tokenizer than the one that comes with the pre-trained model during fine-tuning.
- Another member stated that it’s like usar uma lingua diferente da que o modelo foi treinado em primeiro lugar, recommending to use the same tokenizer for the inputs to be in a ‘language’ the model understands.
CUDA linking problems: A member encountered an AttributeError: module 'bitsandbytes' has no attribute 'functional' when running code from this notebook.
- The warnings suggest that CUDA is not linked properly and recommend running sudo ldconfig /usr/lib64-nvidia and sudo ldconfig /usr/local/cuda-xx.x.

Unsloth AI (Daniel Han) ▷ #showcase (2 messages):

Glazer model, GPT-4's personality, Ollama, HuggingFace

Glazer Model Mimics GPT-4’s Praise-Heavy Persona: A new model called Glazer was released, designed to emulate the sycophantic personality of GPT-4 that some users miss.
- It can be run locally and is available on Ollama via ollama run gurubot/glazer and on Huggingface in 4B and 8B versions.
Unsloth receives gratitude for Glazer: The model received a thank you in the form of a picture that expressed gratitude for Unsloth.
- The post featured a slothhearts emoji.

Unsloth AI (Daniel Han) ▷ #research (11 messages🔥):

Latent Features, Hermes NLP, Financial AI

Latent Features Debunked: A member suggested latent parts of neural networks could destroy features, suggesting the bottleneck in the nn is not a real feature.
- He sarcastically remarks that if these were real features, everyone would be a billionaire, and dismisses claims of success as made by people with no idea.
Hermes is not for NLP tasks: A member said you can’t certainly use hrms to do anything like nlp.
- He ended the message with a all on red baby 😎.

LM Studio ▷ #general (97 messages🔥🔥):

GPU power draw concerns, Lora Training, Realistic roleplaying model, LM Studio local network setup, Consumer priced Exoskeleton

3090 possibly enabling 4090 Overclock: A member was concerned that their 3090 was allowing their 4090 to draw more power than it should when boosting, leading to higher than expected temperatures.
- They believed a software bug might be causing the GPU to exceed manufacturer limits, and mentioned usually undervolting to prevent overheating.
Tentacle Loras stomp art styles: A member shared they had trained a bunch of LORAs all stomped together
- Another member asked about the purpose and tentacle-like shape, the trainer replied that he uses them to explore certain art styles he finds interesting, providing a link to the LORA template.
6GB GPU Seeks Roleplaying Model: A member with a 6GB GPU from 2019 asked for the best model for realistic, immersive, long-lasting roleplaying games.
- Another member suggested increasing CPU RAM to at least 64GB and using Mistral Nemo Instruct; however, a third member suggested Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE.
Phone chats from PC with Local LLM via LM Studio: A member asked about running AI on their PC while chatting on their phone, another member suggested using a client app that speaks to OpenAI API and connecting via local network or a tunnel for remote usage.
- After some troubleshooting involving server IPs and client apps like Apollo, the member successfully connected using ngrok and Open WebUI.
Bionic legs underperform: A member inquired about consumer-priced bionic legs (exoskeletons).
- Another member cited a YouTube review suggesting they barely do anything and might even cause muscle atrophy.

LM Studio ▷ #hardware-discussion (139 messages🔥🔥):

Frame Generation, Nvidia 5000 series, ATX 3.1 standard, CPU offload vs GPU, Mi50 VRAM quirk

Quadruple Frames for Smooth Gaming?: Members debate the utility of 4x frame generation, suggesting it’s only beneficial with a decent base FPS for achieving something like 240FPS at 4K.
Is Nvidia’s 5000 Series a ‘Skip Gen’?: The new Nvidia 5000 series might be a skip gen, due to minimal performance gains over the 4000 series, which already had excellent efficiency, but also because they are unwilling to add more VRAM.
ATX 3.1 Standard Fixes Power Connector Woes: The ATX 3.1 standard introduces the 12V-2x6 connector, addressing 12VHPWR issues with longer conductor terminals and shorter sense pins, allowing the GPU to shut off if the connection loosens.
CPU Offload not a Slam Dunk?: Experiences vary regarding CPU offload; one user found their desktop significantly faster than their server, even with a 4070 TiS performing similarly to an Mi50.
Mi50 VRAM Performance Quirk Uncovered: A peculiar characteristic of the Mi50 is identified: performance halves beyond the first 16GB of VRAM.

Cursor Community ▷ #general (154 messages🔥🔥):

GPT-5 vs Sonnet 4, Codex CLI vs Cursor Code, Claude Code, Gemini 2.5 Pro, Cursor Pricing

GPT-5 Reigns Supreme Over Sonnet 4: Members generally agree that GPT-5 is superior to Sonnet 4 for coding tasks, noting that while GPT-5 may require more specific prompting, it is less prone to hallucinations and provides more concise, accurate answers.
- Some users find Sonnet 4 more template-based and prone to jumping to conclusions, whereas GPT-5’s directness is preferred, making it a valuable planner and discussion partner especially when combined with auto-implementation.
Codex CLI vs Cursor Code: A Showdown: Users are divided on whether to use Codex CLI or Cursor Code, with some preferring Codex CLI for its code quality, while others laud Cursor Code for superior creative thinking, with reasoning abilities, also the quality depend heavily on the prompt.
- One member unsubscribed from Cursor Code’s Max plan due to frustration with hallucinations when fixing bugs, while others caution about Codex’s lower and harder to track rate limit; some like the suggestion system inside of the Codex CLI.
Cursor’s $20/m: Is It Worth It?: Several users discussed the value of Cursor’s $20/month Pro plan and how fast one can hit the limit.
- Some find it essential for coding, one user canceled their Cursor subscription in favor of Claude Code and Codex, suggesting the best combo is a $20/month Cursor subscription paired with a Claude Code plan for inline editing and terminal usage.
Beware Cursor’s Buggy Auto-Mode: Multiple users are experiencing issues with Cursor’s Auto mode, reporting that it performs poorly, fails to fix simple bugs, and sometimes types edits in the chat instead of applying them.
- One user humorously described Cursor as being excessively proud of its work despite the need for extensive debugging, sharing a meme-like message generated by the tool.

OpenRouter ▷ #announcements (1 messages):

Qwen3-Max, RAG, Tool calling

Qwen3-Max releases with several improvements: The latest Qwen3-Max model boasts higher accuracy in math, coding, logic, and science tasks compared to the January 2025 version, according to this X post.
- It also delivers better instruction following in Chinese and English, stronger multilingual support across 100+ languages, reduced hallucinations, and optimized for RAG and tool calling.
Qwen3-Max is optimized for RAG and Tool calling: The Qwen3-Max is optimized for RAG and tool calling and does not have a dedicated ‘thinking’ mode
- Try Qwen3-Max here to check out it’s capabilities.

OpenRouter ▷ #app-showcase (1 messages):

tomlucidor: Finds https://github.com/Lapis0x0/obsidian-next-composer

OpenRouter ▷ #general (126 messages🔥🔥):

OpenRouter Crypto Scam, Anthropic's Geopolitical Concerns, API Key Issues, BYOK Fees, Token Limits and Output Truncation

OpenRouter coin is fake: Members confirmed that any OpenRouter-related cryptocurrency is a scam and not officially affiliated with OpenRouter.
- Despite warnings, users inquired about the presence of an OpenRouter coin on PancakeSwap and its availability for trading, prompting further clarification that OpenRouter has no official involvement in any cryptocurrency.
Anthropic’s Geopolitical Stance Raises Eyebrows: Members discussed Anthropic’s latest blog post which prohibits access from jurisdictions with ownership structures subject to control from countries where their products aren’t permitted.
- Some wondered if this was a matter of national security or simply market share protection.
API Keys throw authentication errors: A user reported API key issues, receiving a ‘No auth credentials found’ error message from ChatGPT.
- The user was prompted to specify the client being used, either the OpenAI client or a custom one, to diagnose the authentication problem.
BYOK Fees Demystified: A user inquired about the charges associated with using BYOK (Bring Your Own Key), specifically with chutes and Qwen Coder 3.
- It was clarified that OpenRouter charges a 5% BYOK fee on top of what the provider (e.g., chutes) charges.
Token output limited to 8k: A user wanted to ensure errors are thrown when the output token limit is exceeded.
- The response gets cut off when the token limit is reached, with the stop reason identified as ‘length’; the API will prevent you from setting max_tokens higher than the model’s limit.

OpenRouter ▷ #new-models (1 messages):

Readybot.io: OpenRouter - New Models

OpenRouter ▷ #discussion (12 messages🔥):

Benchmark Increase, Real World Performance vs. Benchmarks, OpenRouter API usage

Benchmarks Keep Climbing!: Members noted that every benchmark keeps going up, but the disconnect between benchmark percentage increase and real-world performance keeps rising.
- They added that 5% delta on benches used to be noticeable but is becoming less so, as we are reaching a plateau, though models have improved in creative writing, EQ, tool call failures, and context length adherence.
OR Uses OpenAI Responses API: A member asked if OpenRouter uses the OpenAI Responses API, linking to a tweet.
- Another member confirmed that it does for the majority of OpenAI models.

OpenAI ▷ #ai-discussions (84 messages🔥🔥):

Multi-Agent Orchestration, Token Efficiency, Gemini 2.5 Pro, Good Luck Token Waste, Carbon Footprint of AGI

Orchestrate Agents with Context Offloading: One member suggests using multi-agent orchestration with context offloading and dynamically filtering extraneous context to improve vectorization.
- They recommend using a simple setup with orchestrators, a conductor, and specialized agents, emphasizing the importance of managing context to avoid corrupting HO/HD operations.
Slash Useless Token Waste: One member advocates for token efficiency by filtering grammatically and syntactically useless words and consolidating multiple words into useful ones in the prompt.
- They claim that in inference, wasted tokens = wasted resources [money] if you’re paying, and accelerated amortization of components if you’re hosting.
Gemini 2.5 Pro Unlocks Unlimited Access: Members reported Google AI Studio gives unlimited access to Gemini’s best model, 2.5 Pro with other features like Imagen, Nano Banana, Stream Realtime, speech generation, and Veo 2.
- Some members only care about the LLMs and view video and images as a fun factor, and others have found a real use for video editing, educational videos and recreating public domain.
Pleasantries like Good Luck Waste Tokens: One member argued that asking a question is wasting context, unless it is phrased in a way that gives multiple choices based on the answer.
- Another member suggested the phrase “Good luck” is a waste of tokens in terms of transmitting information, while politeness influences AI responses, potentially increasing environmental waste.
Carbon Footprint of Claims of AGI Outpaces Token Wastage: Some members discussed a blog post about new statistics revealing the carbon generated by claims of AGI now outpaces the token wastage used on please and thank you in ChatGPT.
- The blogpost can be found here.

OpenAI ▷ #prompt-engineering (10 messages🔥):

Discord chat to Markdown, Prompt engineering lessons, Hierarchical prompting, Abstraction in prompts, ML format matching

Discord Chat Text Transformation Tactics: A user inquired about the easiest method to extract text from a Discord chat’s web interface in MS Edge and save it as a Markdown file (*.MD).
Darthgustav Gives Prompt Engineering Lessons: A user named darthgustav shared a JavaScript code snippet outlining prompt engineering lessons.
- The lessons cover hierarchical communication with markdown, abstraction through variables, reinforcement in prompts, and ML format matching for compliance.
Hierarchical Prompting Techniques: One lesson from Darthgustav explained hierarchical communication with markdown for prompting, enhancing clarity and structure.
- By utilizing markdown, the prompt aims to organize information in a structured manner, making it easier for the model to follow instructions and generate desired outcomes.
Abstraction using Brackets in Prompts: The user introduced abstraction through bracket notation [{(open variables resolved by the AI)}] and ${(by the user)}.
- It emphasizes the importance of explaining bracket interpretation ([list], {object}, (option)) to efficiently manage complex prompts.
ML Format Matching for Output Compliance: One lesson includes ML format matching for compliance, covering [{output templates} and {(conditional) output templates}].
- The goal is to guide tool use and shape output more deterministically by reinforcing specific formatting in prompts.

OpenAI ▷ #api-discussions (10 messages🔥):

Discord Chat to Markdown, Prompt Engineering Lessons, Hierarchical Communication in Prompts, Abstraction in Prompts, Reinforcement in Prompts

Discord Text Dump to Markdown: A user inquired about the easiest method to extract text from a Discord chat (web interface, MS Edge browser) into a Markdown file.
- The user sought to optimize the process, focusing on simplicity and efficiency, implying a need for a straightforward solution for exporting Discord chat logs to .md format.
Prompt Engineering Instruction Manual: A user shared a detailed JavaScript code block outlining prompt engineering lessons, aiming to teach hierarchical communication, abstraction, reinforcement, and ML format matching.
- The lessons cover using markdown for prompting, abstraction techniques with bracket interpretation ([list], {object}, (option)), guiding tool use, shaping output deterministically, and using output templates for compliance.
Abstraction Elevation via Bracketology: The user emphasizes teaching abstraction through bracket interpretation such as [list], {object}, and (option) within prompts.
- This approach aims to enhance clarity and structure, enabling more effective communication between the user and the AI, improving overall prompt engineering practices.
Reinforcement Ramp-Up for Guidance: The user highlights the importance of reinforcement in prompts to guide [tool use] and (shape output) more deterministically.
- By strategically reinforcing desired behaviors, prompts can achieve higher precision and compliance, leading to improved results and more predictable AI interactions.

GPU MODE ▷ #general (3 messages):

Anthropic's new policy, Kernel creation solutions

Anthropic’s Policy Raises Eyebrows: A tweet regarding Anthropic’s new policy, prohibiting service to organizations controlled by jurisdictions where their products aren’t permitted (like China), sparked debate over whether this is about national security or mere corporate self-interest.
Navigating the Kernel Creation Cosmos: A member inquired about how to determine whether to build a custom kernel solution versus using existing ones.
- Another member suggested checking the HF kernel hub and exploring standards like liger before deciding to build from scratch.

GPU MODE ▷ #triton (2 messages):

Triton, CUDA, GPU, PMPP Book

Newcomer Seeks Triton Guidance: A member sought guidance on learning Triton without prior CUDA or GPU experience.
- Another member recommended the official Triton tutorials as a starting point.
Triton Resources: The user asked about resources to quickly get started with Triton.
- The user also inquired whether it’s required to read the PMPP book for Triton.

GPU MODE ▷ #cuda (14 messages🔥):

Barnes-Hut performance, CUDA, Morton code sorting, Octree construction, Memory access optimization

Barnes-Hut Performance Probed: A member is facing performance issues with a Barnes-Hut CUDA simulation, where the tree traversal and force computation kernel takes 100ms for 30k bodies, despite optimized Morton code sorting and octree construction.
- Another member suggested comparing it to torch.cdist and probing around with an LLM to check for access patterns.
Morton Sorting is sus: Leaf nodes are stored in a flat array sorted by Morton codes, fusing particles with identical codes into one leaf node.
- Members discussed how threads traverse the tree and retrieve values from memory.
Coalesced Memory Access Clarified: A member asked whether memory accesses are coalesced during tree traversal, given particles are sorted by Morton codes.
- The OP confirmed that threads in the same warp traverse the tree similarly, retrieving the same values, but still finds the 100ms runtime perplexing.

GPU MODE ▷ #torch (13 messages🔥):

fp8 matrix multiplication, tensor cores accumulator, Runtime Triggered Module Loading, vLLM profiling

Debate on Fused Accumulation: Members debated the difference between two options in PyTorch’s mm.py (lines 128-132) regarding fused accumulation in tensor cores.
- The first option might be a fused accumulation according to a member, and another mentioned a scenario of int8 MMA where the first version gives an error while the second doesn’t.
Deep Dive into Reduced Precision Accumulation: A paper (arxiv.org/pdf/2411.10958) revealed that the FP32 accumulator designed for FP8 matrix multiplication in tensor cores is actually FP22 (1 sign bit, 8 exponent bits, and 13 mantissa bits).
- fast_accum = True uses the tensor core’s accumulator for the entire main loop with reduced precision (~22 bits), while fast_accum = False sends the result of the tensor core op to a regular register accumulator in full FP32 precision.
Runtime Triggered Module Loading slows vLLM: During vLLM profiling, significant time is spent on ‘Runtime Triggered Module Loading’, but its precise meaning and how to avoid it during profiling are unclear.
- A member shared a trace and attached a [qwen3-1.7b-compile-cudablock.gz] in hopes of finding out more.

GPU MODE ▷ #algorithms (6 messages):

FlashAttention, FA1, FA2, FA3, FA4

FlashAttention Visualized: A member asked if their interpretation of Flash Attention (animation, source code here) was approximately correct.
- The fire animation is supposed to represent softmax/fused kernel.
FlashAttention Loop Orders: A member pointed out that the loop order in the original animation was reversed compared to FlashAttention v2 (FA2).
- In FA2, iteration along K/V is the inner loop, and iteration along Q/O is the outer loop.
FlashAttention Evolves Further: The original visualization was based on FA1, according to the original poster.
- It was noted that FA3 and FA4 also follow the general design of FA2, but are optimized for Hopper and Blackwell architectures, respectively.

GPU MODE ▷ #beginner (4 messages):

Model optimization roadmap, Sparse convolution in ONNX Runtime, BEV fusion model

Seeker asks for Model Optimization Roadmap: A member is seeking a roadmap to learn model optimization techniques, including writing custom kernels, with a focus on SMS count and VRAM usage.
- They plan to use a 5060 Ti with 16GB of VRAM.
Sparse Convolution Support Scarcity in ONNX Runtime: A member is trying to run a BEV fusion model using ONNX Runtime, but the hardware they are using doesn’t support PyTorch, and ONNX Runtime lacks support for sparse convolution.
- They are asking if sparse convolution can be replaced with other operators or if anyone has added sparse conv support in ONNX Runtime.

GPU MODE ▷ #irl-meetup (1 messages):

apaz: Now in NYC if anyone wants to meet up

GPU MODE ▷ #rocm (8 messages🔥):

rocSHMEM, ROCm-aware open MPI, HIP kernels, ROCm/iris

rocSHMEM Implementation Inquiry: A member is exploring rocSHMEM implementation similar to HIP kernels using load_inline, encountering errors related to ROCm-aware MPI requirements.
- The member referenced ROCm/rocSHMEM for dependency configurations and suggested incorporating them into the Dockerfile.
ROCm/iris alternative surfaces: A member suggested trying ROCm/iris as a possible alternative while they investigate the issue.
- The original poster agreed to try it out and expressed enthusiasm for the project, while another user was tagged as a potential user.

GPU MODE ▷ #webgpu (2 messages):

:catgirl5: emoji usage, thinking hard emoji

Catgirl Emoji spotted!: A member noted ‘Oh cool a :catgirl5: in the wild’, referring to emoji usage in the channel.
Catgirl becomes Thinking Emoji: A member stated it’s weirdly a good ‘thinking hard’ emoji lol in reference to the same.
- The community seems to have latched onto this emoji’s meme potential.

GPU MODE ▷ #self-promotion (1 messages):

GPU L2 Cache, Ampere Architecture, CUDA Project Structure, Persistent Memory Accesses

L2 Cache Persistence Boosts GPU Performance: Leveraging the Ampere architecture, a blog post demonstrates reserving part of the L2 cache for persistent memory accesses to improve GPU performance, detailed in a blog post.
CUDA Project Structuring with CMAKE Example: The provided code serves as an example of structuring a CUDA project using the CMAKE build system, enhancing code organization and maintainability.

GPU MODE ▷ #reasoning-gym (1 messages):

Contributions Welcome, Prototype Sharing, Pull Requests

Contributions Welcomed for New Tasks: The channel indicated that new task contributions are welcome, encouraging members to share prototypes.
- Alternatively, members can open a PR to the repo for iterative development.
Prototype Sharing Encouraged: Members are encouraged to share prototypes in the channel to gather feedback and iterate on their ideas.
- Sharing prototypes helps foster collaboration and accelerates the development process.

GPU MODE ▷ #submissions (1 messages):

MI300x8, amd-all2all leaderboard

MI300x8 scores on leaderboard: A submission to the amd-all2all leaderboard on MI300x8 was successful at 334 µs.
AMD all2all benchmark update: The latest result on the amd-all2all leaderboard showcases an impressive performance on the MI300x8 hardware.

GPU MODE ▷ #factorio-learning-env (6 messages):

Factorio Crafting Tool, FLE installation issues, Prototype Recipe Retrieval

Factorio Agent’s Prototype Recipe Retrieval: The get_prototype_recipe tool retrieves complete recipe information for any craftable item in Factorio, essential for understanding crafting requirements and planning production chains.
- The agent can use the get_prototype_recipe action to get a recipe for a single item and call it again if needed to get sub-recipies.
FLE Installation Experiences Turbulence: A member reported having issues during the installation of FLE.
- They mentioned they will listen into the meeting but will not engage much due to an important presentation afterwards.

GPU MODE ▷ #amd-competition (9 messages🔥):

CLI Tool vs Online Submission, ROCshmem Template, Web Version Organization, Online Testing Env Triton Support, Prize Registration Reminder

CLI Tool Trumps Online Submission: A participant found that they can use a CLI tool to view settings for num_experts instead of submitting online.
- Another participant mentioned that the web submission is the latest effort to make submissions accessible, but it’s still in alpha.
ROCshmem Template Quest: A participant inquired about a ROCshmem template and noted it requires ROCm-aware open MPI, wondering if these are included in the kernel bot workflows.
- No responses were made.
Web Version’s Organization Praised: One participant appreciated the improved organization of the web version.
- They suggested adding config-wise runtimes for enhanced helpfulness.
Triton Support Status Questioned: A participant inquired whether the online testing environment supports Triton.
- No responses were made.
Prize Registration Reminder: A reminder was issued that participants need to be registered to qualify for prizes, with registrations closing on September 20th.

GPU MODE ▷ #singularity-systems (1 messages):

cuBLAS, ROCm, cuDNN, MIOpen

Deep Dive into BLAS and DNN Internals: A member mentioned they are studying the codebase, and are looking for people with experience in the internals of cuBLAS/rocBLAS or cuDNN/MIOpen.
- They added that there will be “lots more to do these next few weeks” for those with the relevant expertise.
No Topics: No significant topics were discussed.
- Only one message was sent.

GPU MODE ▷ #general (21 messages🔥):

Pickling Errors, Serialization Issues, NaNs in Triton Kernels, Benchmarking Discrepancies

Pickling Problem Plagues Python Process!: A user encountered a TypeError: cannot pickle 'frame' object during evaluation, stemming from multiprocessing’s inability to serialize a specific object being passed between processes; here is the traceback.
Serialization Snafu Stymies Submission!: The error was attributed to the evaluation process using a separate process to prevent cheating, which exposed a serialization issue with the user’s submission.
- The user was advised to check the output of their custom_kernel function, as the error suggested that the function’s return value was not serializable.
NaNs Nab Numerical Nirvana!: The user identified the presence of NaNs (Not a Number) values in their kernel’s output as a potential cause for the serialization error, with a member confirming that NaNs can indeed cause such issues.
- The user initially suspected a grid error leading to the creation of these NaNs and expressed intentions to resubmit the code after fixing it.
Benchmark Blues Baffle Budding Benchmarker!: Despite passing initial test runs, the user continued to face errors during the benchmarking process, leading to the revelation that test runs and benchmarks are distinct and that the latter is designed to be more complex.
- The user was informed that the presence of NaNs in the benchmark output could still be an issue, even if test runs pass successfully, because we run the code multiple times, up to 100 per size.

Latent Space ▷ #ai-general-chat (58 messages🔥🔥):

OpenAI Custom AI Chip, Mercor $10B Pre-emptive Offers, Augment (Augie) $85M Series A, OpenAI Responses API, Hugging Face FineVision Dataset

OpenAI Co-Designs $10B AI Chip with Broadcom: Financial Times reports OpenAI partnered with Broadcom to co-design a custom AI chip, with mass production slated to start next year, indicating a move away from Nvidia dependency; the chip is estimated to cost $10B.
- Community reactions range from skepticism about the chip’s quality to speculation that OpenAI will out-compete its own customers; link to article.
Mercor Receives $10B Pre-emptive Offers: AI-hiring startup Mercor has received unsolicited offers valuing it at ~$10B—5× its June 2025 Series B price—just four months later, spurring jokes about the AI-funding frenzy; link to tweet.
Augie Raises $85M Series A for AI Logistics: Augment (Augie) announced an $85M Series A—bringing total funding to $110M in just 5 months—to scale their AI teammate built for the $10T logistics sector; link to announcement.
- Augie already helps freight teams handling $35B+ double productivity by orchestrating end-to-end order-to-cash workflows across email, calls, Slack, TMS and more.
Responses API Myth-busting Thread: A thread clarified widespread confusion about the OpenAI Responses API, debunking myths that Responses is a superset of Completions, can be run statelessly, and unlocks higher model intelligence & 40-80% cache-hit rates; link to thread.
- Developers stuck on Completions are urged to switch to Responses for GPT-5-level agents, with pointers to OpenAI cookbooks.
Baseten Bags $150M Series D: Baseten announced a $150M Series D round led by BOND with Jay Simons joining the board; the company powers AI inference for customers like Writer, Notion, Sourcegraph, and others, and welcomed new investors Conviction and CapitalG; link to annoucement.

Latent Space ▷ #ai-announcements (4 messages):

AI Engineer CODE Summit 2025, NYC AI Event

AI Engineer CODE Summit 2025 Announced: The AI Engineer team is launching its first dedicated CODE summit this fall in NYC, gathering 500+ AI Engineers & Leaders alongside top model builders and Fortune-500 users to unpack the reality of AI coding tools - announcement link.
- The summit is invite-only, with two tracks (Engineering & Leadership), no vendor talks, and a CFP open until Sep 15.
AI Engineer Summit Focus: The AI Engineer CODE Summit 2025 aims to celebrate PMF (Product-Market Fit) while addressing MIT’s statistic that 95% of enterprise AI pilots fail.

Latent Space ▷ #genmedia-creative-ai (21 messages🔥):

Nano Banana, AI Girlfriend, AI Design Masterclass, Nvidia Cosmos DiffusionRenderer

Nano Banana floods timeline with AI art: Logan Kilpatrick tagged @NanoBanana (Google’s newest banana-branded image model), prompting a single “hello world” banana billboard, sparking a frenzy of creative prompts from users.
- The thread exploded into a viral AI-art playground, generating art from prompts like Elon-Sam-Demis-Ilya selfie and Winnie the Pooh in China, while also sparking jokes, praise, and complaints about AI slop (see example).
AI Girlfriend earns cash: @EyeingAI used DesireBots.com to create an AI girlfriend chatbot named “Ada,” charging $9/month and earning $1,142 in a week from 500+ users.
- The process involved a no-code chatbot setup and built-in monetization tools, showcasing a simple way to generate revenue with AI (see tweet).
AI Design Masterclass: Meng To released a 58-minute tutorial on creating professional-grade designs with AI, using aura.build and its 740 remix-ready templates that export to HTML/Figma.
- His design team shifted from Figma to Aura, now shipping a template daily (vs. one every two weeks), while learning HTML along the way and using Unicorn Studio for animated hero sections (see tutorial).
Nvidia’s open-source AI relighting demo: Nathan Shipley demoed Nvidia’s open-source Cosmos DiffusionRenderer, which decomposes short 1280×704 video clips into stable passes (depth, normals, base color, etc.) for relighting with custom HDR maps.
- The tool allows relighting with custom HDR maps and examples include a home movie and a famous film scene, sparking praise for its stability and criticism for its uncanny results and current limits (57-frame max, CLI setup, garbled faces).

DSPy ▷ #general (78 messages🔥🔥):

Voice Agents with DSPy, GEPA Optimization for Prompts, Multi-Turn Conversations, Groq for Inference, RAG vs Fine-tuning

DSPy-Powered Voice Agents: A Budding Symphony: Members discussed building voice agents with DSPy and explored using GEPA to optimize prompts for frameworks like Livekit and Pipecat.
- One member suggested using the optimized prompt from GEPA as a straightforward string, while acknowledging that this might feel anti-DSPy.
GEPA: More Than Just Prompt Optimization: It was noted that while DSPy creators might cringe at the term prompt optimization, tools like GEPA can indeed be used for this purpose.
- For prompt creation, it was suggested to setup a Rubric type judge to assess generated responses, especially at the conversation level, and Groq was recommended for inference.
Multi-Turn Musings: DSPy’s Conversational Capabilities: While a member found no satisfying implementation of multi-turn conversations with DSPy or RL applications like GEPA or GRPO, DSPy is fully capable of handling multi-turn conversations using dspy.History.
- However, it was cautioned that defining examples well is crucial, as it’s easy to introduce bias when building chat systems.
RAG vs Fine-Tuning: The Memory Game: The discussion addressed how to equip voice agents with extensive information (hours, services, pricing, etc.) without runtime latency, with some approaches being fine-tuning or retrieval.
- While fine-tuning can build in memorization, it’s a big job. RAG can be simple functions or maps, things like hours of operation don’t need to be a vector database.
Streaming Strategies: Riding the Token Wave: Members explored the impact of streaming responses (token by token) on user experience, with a key focus on minimizing Time To First Token (TTFT).
- While streaming doesn’t reduce TTFT, it enhances user perception by providing immediate feedback, and libraries like Pipecat already do a good job of that too, in the way that they stream frames (i think in 250 ms chunks by default).

Moonshot AI (Kimi K-2) ▷ #general-chat (75 messages🔥🔥):

Kimi K2 API Credits Giveaway, Anthropic API Integration, Kimi K2 Turbo Preview, Kimi K2 Model Performance, Kimi Starter Subscription

Kimi Giveaway API Credits Incoming: A user who won the Kimi giveaway was informed that the API credits would be sent shortly and the crew is arranging it.
- The credits were expected to be sent within an hour.
Anthropic API MIA: A user inquired whether the Anthropic API is available on the new model.
- It was clarified that kimi-k2-turbo-preview points to -0905.
Kimi’s 0905 Model Debuts: It was confirmed that the turbo model is now using the 0905 model, having been updated from the 0711 model.
- Some users expressed concerns about the new K2 model’s tendency to be over poetic.
Kimi K2 Team Dreams Big: A member clarified that the team is smaller compared to Grok/OAI, but has big dreams and a big model.
- They added it is a good thing since usually, the bigger the company, the less user interaction there is.
Coding Focus Confuses Kimi Users: Users are confused by the focus on coding improvements in the new Kimi K2 model.
- One user stated that 0711 is better than 0905, but another user thinks the writing is more detailed & better.

Nous Research AI ▷ #general (65 messages🔥🔥):

real time video AI, Spiking Neural Networks, cameras (image sensors) that are a bit closer to how the human eye works, Meta wristband reads body electricial signals to control smart glasses, Hermes's unique behavior in the husky holdem benchmark

Spiking Neural Networks Spark Interest: Members discussed Spiking Neural Networks (SNNs) and how they mimic the brain, with one sharing a YouTube video about it.
- Another mentioned cameras and image sensors that work closer to how the human eye does, sharing this video.
Meta’s Wristband to Read Body Signals: Meta is set to introduce a wristband that reads body electrical signals to control smart glasses, according to this Nature article.
Hermes Exhibits Unique Holdem Behavior: A member noted that Hermes has extremely OOD unique behavior in the husky holdem benchmark, observing it’s super conservative play style in a way no other model does.
ADHD resources shared!: A member shared resources for ADHD, motivation, learning, and productivity, including a video on Certainty Window, Salience Network and “Push/Pull”Activities, Professor Huberman’s Dopamine, Mindset and Drive, and Forming Habits is the Under-rated Strategy to Success.
- Another user chimed in saying, medication was only thing that fixed my adhd but really good tips even with meds on these links.
Deepmind and Huawei cookin’ something special: A member stated to keep an eye on Deepmind and Huawei progress going forward with B. Neural Network, and particularly with Huawei future Quantum ( room temperature) system, that is gve a real freakout to U.S gov.

Nous Research AI ▷ #interesting-links (7 messages):

Micro-LLM Experiments, SLM Agents by NVIDIA, Hermes Agent Size

Lovecraftian LLM Arises!: A member experimented with a micro-LLM trained on H.P. Lovecraft’s stories, finding the output quite promising as the loss was still decreasing when training stopped, view the youtube video.
- They speculate that a 3 million parameter model could become a light chat model with the right dataset and sufficient training.
NVIDIA unleashes SLM Agents!: A member shared a link to NVIDIA’s research on SLM Agents (project page) and an accompanying paper (arxiv link).
- No further details were discussed about this resource.
Hermes Agent Targets 30B Parameters: A member stated they are targeting a 30B parameter model for their Hermes Agent.
- No further details were discussed.

Modular (Mojo 🔥) ▷ #mojo (60 messages🔥🔥):

Zig's async IO, Mojo's type system, MLIR, Vectorization of Loops, Compiler Customization

Zig’s Async IO Faces Doubts: Concerns arise in other language communities regarding the viability of Zig’s new approach to async IO, while Mojo’s type system and effect generics may solve some of the problems, such as vtables everywhere.
- A member mentioned that IO needs to haul around state now, the days of being able to freely call IO from anywhere are likely numbered, referring to this discussion in Ziggit.
Achieving SIMD Nirvana: Members discussed the goal of writing simple, Pythonic code that automatically compiles to SIMD instructions, using Mojo and MLIR for optimal parallelized assembly without relying on LLVM to correctly vectorize code.
- A member dreams of a world where for loops are automatically compiled for the metal I’m carrying, in this case being 8 or 16 lanes instead of just keep hammering lane zero.
Unveiling Vectorization Secrets: To fully vectorize code, especially loops, the compiler needs sufficient information about input data shapes or must perform speculation to identify hot loops for vectorization, clarifying that Mojo encourages the use of portable SIMD libraries.
- It was mentioned that on CPUs and AMD GPUs, scalar and vector operations have separate execution resources, and both can ideally run at the same time.
GPU Kernel Maturity Check-up: A member inquired about the maturity of writing GPU kernels in Mojo, specifically regarding implementing a Mamba2 kernel for use in PyTorch, and was pointed to Modular’s custom kernels tutorial.
- It was clarified that while MAX (Modular’s API to a graph compiler) is not primarily targeted at training, it can be used for inference, and MLA has already been implemented for inference (see GitHub).
Span Abstraction Dream: A member expressed a desire for Span (a contiguous slice of memory abstraction) to become an easily usable, auto-vectorized tool, with algorithms that work on NDBuffer (being ported to LayoutTensor) as part of Span.
- They noted that while existing implementations are manual and parametrized with hardware characteristics, there isn’t much compiler magic at hand.

Eleuther ▷ #general (46 messages🔥):

MasterCard Fraud Prevention AI, Obscenity Rule Enforcement, Brand Risk Mitigation, AI-induced psychosis, Semantic Drift

MasterCard’s AI Fraud System Sparks Controversy: MasterCard replaced fraud prevention staff with an AI system, leading to conflicts with merchants over obscenity rule enforcement, detailed in Chapter 5.12.7 of mastercard-rules.pdf.
- The system flags more transactions as obscene, with fees up to $200,000 per violation and $2,500 per day of noncompliance, creating incentives to avoid admitting fault.
Insufficient Criteria Plague Automated Fraud Prevention: The issue stems from insufficiently specified criteria in the obscenity rules, lacking clear examples of safe items, causing shallow and confusing gradients for the LLM.
- The discussion highlighted how unwritten policies and approaches need to be made explicit to avoid issues that arise from automated enforcement without adequate context.
Brand Risk Drives Over-Enforcement: Pressures from mutual funds to mitigate brand risk, like board diversity targets, lead to over-enforcement and denial of policy changes within MasterCard’s fraud department.
- This over-enforcement impacts merchants, and MasterCard’s focus on concealing issues hinders the development of useful monitoring metrics, as any flaw discovered would create a problem that needs to be solved, to protect their career.
AI Consultant Questions Automation Plausibility: An AI consultant expressed skepticism about automating their job, citing the need for knowledge, understanding of relevant context, and wisdom, qualities AI lacks.
- Despite this, a medically induced crisis of faith made them question the value of their qualities which are still unlikely to be automated any time soon.
ArXiv Endorsement Request Raises Eyebrows: A researcher requested endorsements for an arXiv paper on semantic drift, sparking suspicion due to recent cases of AI-induced psychosis.
- Concerns were raised due to the use of terms associated with AI-generated nonsense, prompting a request to share the paper for review.

Eleuther ▷ #research (6 messages):

GRPO Baseline, SFT + KL regularization

Members Ponder GRPO Baseline: Members discussed the possibility of using a GRPO baseline for a project.
- One member asked did you have an GRPO baseline? to which another responded no, this will be next.
SFT + KL regularization Possibility Raised: A member suggested exploring SFT (Supervised Fine-Tuning) with KL (Kullback-Leibler) regularization as a potential approach.
- This came up in response to a shared link on the topic of RL_Razor, and the member stated oh, would be interesting to try SFT + KL regularization.

HuggingFace ▷ #general (37 messages🔥):

Reward Function Weighting in RL, Anthropic Policy on Jurisdiction Control, Causal Model Training with Attention Bias, Tokenizer and Attention Bias Implementation, RAG Applications with Limited Context Size

Anthropic’s Policy Raises Eyebrows: Members discussed whether Anthropic’s new policy prohibiting organizations controlled by jurisdictions where their products aren’t permitted is truly about national security or simply corporate self-interest.
- The debate centers on the motivations behind restricting access based on ownership structures.
Deciphering Reward Weighting in RL: Members were seeking studies on whether weighting reward functions during RL is beneficial, aiming to avoid experimenting without prior knowledge.
- One member shared a document regarding reward weighting in RL.
Biasing Attention in Causal Models Explored: A member sought advice on modifying causal model training with SFTTrainer to purposefully add attention bias for specific words, referencing the Attention Bias paper.
- Suggestions included checking specific terms/tokens against common tokenizers and considering alternative approaches for loss calculation and gradient signal control.
Tackling Tokenizers to train for biases: Guidance was provided on how to bias attention to specific words, recommending to test how those will be tokenized before starting the whole training.
- It was suggested to use tools such as gradio or streamlit to achieve this goal.
RoPE to the Rescue in RAG Context Expansion: Members discussed tips for building RAG applications with LLMs under very limited context sizes (4096 tokens).
- One tip involved using models with RoPE and fine-tuning them with a larger context size, referencing this repo and emphasizing that RoPE enables models to perform well even on context it hasn’t been trained on.

HuggingFace ▷ #today-im-learning (2 messages):

“

No points in sharing: A member said that there is no point in sharing.
Negative attitude: The general sentiment in the channel appears to be negative, discouraging further contributions.

HuggingFace ▷ #i-made-this (1 messages):

Enron Email Dataset Parser, Structured Parquet Files, Email Analysis

Enron Emails get Parsed into Parquet: A member uploaded their parser for the Enron email dataset, resulting in 5 structured parquet files.
- The files include: Emails, Users, Groups, Email/User junction, and Email/Group junction.
Duplicates get Managed via Hashing: Parent and child emails have been parsed, and duplicates are managed both by file and message hashes/caches.
- All messages are included as MD5 hash objects.
Dataset good for Group Behavior Analysis: The dataset would be great for analysing the behaviour between groups, and NLP.
- The member noted where to get the dataset but did not include the data itself.

HuggingFace ▷ #computer-vision (1 messages):

FastVLM

FastVLM could be speedy solution: A member suggested trying FastVLM to address speed concerns.
- They shared a Hugging Face Collection link for the project.
Another topic: Another member tried to add information about a different topic.
- This demonstrates how to add a second topic when there is more information.

HuggingFace ▷ #smol-course (5 messages):

smol-course, GitHub Readme

Smol-course location surfaces: A member asked what is a smol-course?
- Another member promptly shared the GitHub link.
Smol Course Confusion: A member stated that they can’t find anything other than the readme and stuff of old 2024 course
- The member repeated Same here multiple times, indicating a difficulty in locating the intended course content.

HuggingFace ▷ #agents-course (2 messages):

agents course, greetings

Course Launch: Sweden & Italy Say Hello!: Enthusiastic members from Sweden and Italy kicked off the agents course today.
- One participant noted some prior experience with AI agents, ready to dive deeper.
Global AI Enthusiasts Unite!: Participants from Sweden and Italy have officially started the agents course.
- One of the new members mentioned bringing some previous AI agent knowledge to the table.

tinygrad (George Hotz) ▷ #general (8 messages🔥):

Digital Ocean MI300X errors, Z3 version issues, Kernel removal project

Digital Ocean MI300X Stable Diffusion Fails: Users encountered an issue running the stable_diffusion.py example on a Digital Ocean MI300X GPU instance, tracing back to some z3 issue.
- The error couldn’t be reproduced on a Mac, but mnist_gan.py was tested.
AMD_LLVM=1 causes TypeError: A TypeError involving unsupported operand types (BoolRef) arose when using AMD_LLVM=1 during a simple mnist training loop.
- George Hotz suggested trying IGNORE_OOB=1, indicating it might be a z3 version issue, with some overloads added in z3>=1.2.4.0, and provided a link.
Kernel Removal Project Interest: A user inquired about contributing to the kernel removal project.
- No additional information was provided about the nature of contributions that would be helpful.

aider (Paul Gauthier) ▷ #general (5 messages):

Warp Code, Aider's strengths, Aider success stories

Warp Code wins hearts: A user reported that Warp is getting very nice and that Warp Code feels like the difference between driving a stick and manual.
- Warp is good with the embeddings search for when you don’t know files and want to get the sense of a new codebase.
Aider still shines despite losing Claude: A user switched from aider to Claude Code months back but came back because Anthropic have made some questionable changes, preferring aider for its simplicity and replicating Claude Code’s plan mode with /ask.
- The user now uses Gemini 2.5 Pro as the main model, Gemini Flash as a weak model, and Qwen3 Coder as the editor model, using /run to replicate command-line tools like checking the latest git diff or running tests.
Run command in Aider is a Major Feature: The /run in aider is a major feature for a user, and they noted Aider is good when you know better what files you want to work with.
- They also enquired where they can see Aider’s success stories.

aider (Paul Gauthier) ▷ #questions-and-tips (2 messages):

Coding Agent Refactoring, Aider's Code Validation, TreeSitter Validator

Coding Agent undergoing Refactoring: A member is refactoring their own coding agent, inspired by Aider, to learn more about AI system design.
- They already have a small proof of concept, but are now reading a tutorial to see what others do for a similar project.
Validating Generated Code Across Languages: The member is seeking advice on how to prevent dangerous code (env leakage, rm -rfs, network requests, etc.) from being generated in any language.
- They considered a TreeSitter based validator, and asked how Aider avoids these issues, requesting pointers to relevant files in the repo.

Yannick Kilcher ▷ #general (4 messages):

Baselines in Papers, LoRA Training

Guidelines for Reviewing Baselines in Papers Sought: A member inquired about general guidelines for approaching baselines presented in research papers, particularly when unfamiliar with the dataset.
- They expressed uncertainty about judging performance without sufficient knowledge of the dataset, implying a need for more background research, and suggested reading more papers.
Why LoRA Adds Instead of Replaces?: A member asked why LoRA trained layers are added instead of replacing the original weight matrix, noting the contrast with other efficient processes like depthwise convolutions.
- They sought a paper, article, or reasoning to explain this design choice, rather than replacement, and mentioned having an intuition on the matter.

Yannick Kilcher ▷ #ml-news (1 messages):

erkinalp: https://www.all-hands.dev/blog/the-path-to-openhands-v1

Manus.im Discord ▷ #general (3 messages):

AI Politeness, Scientific Evidence for AI Politeness

AI Behaving Nicely Scientifically Proven: A user shared a link to a paper scientifically proving that you should be polite to your AIs.
- It appears to be related to the question of whether or not AIs are more cooperative when you act nicely to them.
Manus Users Agree on Nice AIs: Two users agreed that they want scientific proof that politeness matters to AIs.
- This is likely related to the earlier link from arxiv on the same topic of AI politeness.