Open models are all you need?
AI News for 9/5/2025-9/6/2025. We checked 12 subreddits, 544 Twitters and 22 Discords (186 channels, and 3961 messages) for you. Estimated reading time saved (at 200wpm): 324 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!
In July, we last commented on Kimi K2 being the largest SOTA OSS open model to be released, and today Moonshot AI updated their model weights again and released new benchmarks in their paper:
The big new entrant though is Qwen 3 Max, releasing a 1T param model for the first time, obviously beating its smaller siblings. They declined to release hparams, instead calling it âMaxâ, but it still seems that the model weights will be released in short order so itâs unclear why exactly they are breaking their own MoE naming schema.
China is overwhelmingly winning the open model war, it seems.
AI Twitter Recap
Chinaâs longâcontext coding surge: Kimi K2â0905 and Qwen3âMax preview
- Moonshotâs Kimi K2â0905 (open weights) ships a practical agents upgrade: Kimi doubled context to 256k, improved coding and toolâcalling, and tuned integration with agent scaffolds (Cline, Claude Code, Roo). Itâs already live on multiple stacks: Hugging Face weights/code, Together AI, vLLM deployment guide, LMSYS SGLang runtime (60â100+ TPS), Groq instant inference (200+ T/s, $1.50/M tokens), and Cline integration. Community reports emphasize that âagents really need ultraâlong contextâ for stability and tool orchestration (Teknium). Claims of âmeets or beats Sonnet 4â surfaced in demos, while Kimi engineers acknowledged SWEâBench remains challenging (@andrew_n_carr, @bigeagle_xd).
- Qwen3âMaxâPreview (Instruct): 1Tâparameter scale, agentâoriented behavior: Alibaba introduced its largest model yet (over 1T parameters), available via Qwen Chat, Alibaba Cloud API, and now OpenRouter (announcement, OpenRouter). Benchmarks and early users point to stronger conversations, instruction following, and agentic tasks relative to prior Qwen3 models. Community reaction frames it as a âUSâgrade frontier modelâ with competitive pricing and throughput (reaction, scale tease). Details on dense vs MoE remain unspecified in public channels.
Evals, agents, and what to measure
- âNo evalsâ vs âevals that matterâ: A widelyâshared thread argues many top codeâagent teams ship without formal evals, while vendors evangelize them; the nuance is that early 0â1 success often comes from dogfooding + error analysis before codifying evals (@swyx, receipts). Followâons advocate for richer, causal evals of longâhorizon capability (e.g., monthsâlong tasks, protocol replication, strategy games, realâworld setups) and domainâspecific enterprise workflows that todayâs leaderboards miss (@willdepue, ideas, @levie, @BEBischof). A pragmatic tip: use models as discriminators to rank outputsâgenerator/discriminator gaps can be leveraged in practice (@karpathy).
- Operationalizing evals and traces in agent stacks: CLIâfirst agents plus semantic search can outperform adâhoc RAG for document tasks; LlamaIndex shows SemTools handling 1,000 arXiv papers with UNIX tooling + fuzzy semantic search (post). For RL pipelines, THUDMâs slime provides a clean rollout abstraction integrating tool calls and state transitions, reducing glue code in agentic RL experiments (overview).
Inference and postâtraining advances
- Decoding and planning: Metaâs Set Block Decoding (SBD) samples multiple future tokens in parallel, cutting forward passes 3â5Ă with no architecture changes and KVâcache compatibility; trained models match standard NTP performance on nextâtoken prediction (summary). For agents, âalways reasoningâ (ReAct) isnât optimalânew work trains models to learn when to plan, dynamically allocating testâtime compute to balance cost and performance (thread, paper context).
- Postâtraining theory and results: âRLâs Razorâ argues onâpolicy RL forgets less than SFTâeven at matched accuracyâby biasing toward KLâminimal solutions, with toy + LLM experiments supporting reduced catastrophic forgetting (summary). A âUnified View of LLM PostâTrainingâ shows SFT and RL optimize the same rewardâwithâKL objective; Hybrid PostâTraining (HPT) switches between them via simple performance feedback and consistently beats strong baselines across scales/families (overview). On the empirical side, Microsoftâs rStar2âAgentâ14B uses agentic RL to reach frontierâlevel math (AIME24 80.6, AIME25 69.8) in just 510 RL steps, with shorter, more verifiable chains of thought (results).
GPU stacks, kernels, and platforms
- ROCm quality regression in PyTorch: Analysis alleges a growing deficit of ROCmâonly skipped/disabled tests (>200 each), with a net increase since June 2025; reports say even core transformer ops (e.g., attention) have been disabled for months, harming developer trust. AMD leadership has reportedly reprioritized fixes (report). PyTorch maintainers note broad testâskipping is endemic and requires sustained contributor attention across subsystems (context, quip). Separately, PyTorch published a kernel deepâdive on 2âsimplicial attention implemented in TLX (Triton lowâlevel extensions) (kernel post).
- Infra momentum and meetups: Together AI announced a $150M Series D led by BOND (Jay Simons to board) to scale inference infra (annc); Baseten also raised $150M Series D as it rolls out performance work and EmbeddingGemma support (annc). vLLM is hosting a Toronto meetup on distributed inference, spec decode, and FlashInfer (event) and already supports Kimi K2 deployments (support).
OpenAI ecosystem: ChatGPT branching, Responses API, and Codex
- Product/API shifts: ChatGPT now supports conversation branching (@gdb; @sama). OpenAIâs Responses API got an inâdepth explainer (thread); the AI SDK v5 now defaults the OpenAI provider to Responses (Completions remains available) (note). Some devs countered that Responses complicates context portability and stateless usage in practice (critique), while others observed improved âchainâofâthought preservationâ in ongoing conversations vs Chat Completions (anecdote).
- Coding agents and GPTâ5 Pro: Multiple practitioners report GPTâ5 Pro inside Codex can unblock gnarly engineering problems with deeper, slower passes; âsmarterâ beats âfasterâ was the sentiment in a public exchange with Sam Altman (experience, followâup, @sama). The Codex CLI/IDE continues shipping rapidly (changelog).
Embeddings and retrieval move onâdevice (and hit limits)
- Small, fast, local: Googleâs new openâsource EmbeddingGemma got dayâ0 platform support (e.g., Baseten), with reports of embedding 1.4M docs in ~80 minutes on an M2 Max for free and better quality than older large paid models (Baseten, field result). Onâdevice retrieval is getting easier: SQLiteâvec + EmbeddingGemma runs fully offline across languages/runtimes (guide).
- Singleâvector limits: New theory/benchmark âLIMITâ shows hard lower bounds on topâk retrieval under fixed embedding dimensions, with SOTA models failing on deliberately stressâtested simple tasksâevidence that some combinations of relevant documents are intrinsically unrecoverable with singleâvector embeddings, motivating multiâvector/lateâinteraction approaches (summary).
Top tweets (by engagement)
- âThe ability to predict the future is the best measure of intelligence.â â @elonmusk
- Kimi K2â0905 update (256k, coding/toolâcalling, agent integration) â @Kimi_Moonshot
- Qwen3âMaxâPreview (Instruct), âover 1T parameters,â now live via Qwen Chat/Alibaba Cloud â @Alibaba_Qwen
- ChatGPT conversation branching now live â @gdb
- GPTâ5 Pro in Codex praised for solving hard coding tasks with deeper passes â @karpathy
- âVery requested feature!â (ChatGPT branching) â @sama
- ROCm regression in PyTorch testing â @SemiAnalysis_
- DeepMindâs âDeep Loop Shapingâ improves LIGO gravitational wave detection â @demishassabis
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Kimi K2-0905 and Qwen 3 Max Launches + Early Demos
- Kimi-K2-Instruct-0905 Released! (Score: 729, Comments: 192): Release announcement for Kimi-K2-Instruct-0905 with an attached benchmark/leaderboard image comparing it to other LLMs (e.g., DeepSeek). The chart is presented as showing K2-Instruct-0905 performing near SOTA and ahead of DeepSeek, with a commenter calling out a â1t-a32bâ variant, possibly indicating a notable configuration highlighted in the results. Image: https://i.redd.it/6jq7r55ak9nf1.png (preview: https://preview.redd.it/u97uhts0q9nf1.png?width=1200&format=png&auto=webp&s=7d65247fb861127f04dd422d2ae8885c748edabd). Commenters claim itâs âvery close to SOTAâ and âclearly beats DeepSeek,â while noting it may be larger; discussion centers on sizeâperformance trade-offs and the strength of the â1t-a32bâ variant.
- Performance claims: commenters assert Kimi-K2-Instruct-0905 is âvery close to SOTAâ and âbeats DeepSeekâ albeit being larger; treat as anecdotal until verified. Cross-check the benchmark chart shared in the thread (image) and the model card on Hugging Face for head-to-heads versus DeepSeek variants (e.g., V3/R1) on standard suites like MMLU, MT-Bench, GSM8K, and HellaSwag.
- Scale/architecture hints: references to a âtrillion-parameterâ open-source model and a â1T-A32Bâ variant suggest a MoE-style setup where total parameters can be ~1T while active parameters per token are far lower (e.g., tens of billions). Clarifying total vs active params, routing/expert count, and training token budget is key to interpreting claims that it outperforms smaller dense baselines like DeepSeek at higher compute. â1T-A32Bâ likely denotes a ~32B active slice within a ~1T total-parameter regime, but verify on the model card before comparing efficiency.
- Resources: official release is on Hugging Face: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905. Check the card for evaluation tables, context length, tokenizer details, and quantization/inference notes (e.g., int4/int8), as well as licensing and any hardware recommendations to reproduce reported benchmarks.
- Qwen 3 max (Score: 269, Comments: 93): Qwen 3 Max is now available via the OpenRouter model hub and a web preview at Qwen Chat (OpenRouter, chat.qwen.ai). Pricing on OpenRouter is tiered by context-length: input
USD 1.2
(â€128K) /USD 3
(>128K) and outputUSD 6
(â€128K) /USD 15
(>128K), implying support for contexts beyond 128K and placing it near frontier-model pricing tiers (e.g., Claude/GPT). Commenters note prior Qwen Max variants were closed-source and express hope this release will have open weights on Hugging Face; others remark the pricing positions it alongside top-tier proprietary models.- Pricing details: Input is listed as $1.2 for contexts <
128K
and $3 for â„128K
; Output is $6 (<128K
) and $15 (â„128K
). Commenters note this places Qwen 3 Maxâs cost structure close to Claude and GPT tiers, implying a frontier-model pricing posture and a separate long-context SKU at the128K
cutoff. - Release/availability expectations: Prior âQwen Maxâ was closed-source; commenters hope for a Hugging Face release but others suggest this one is likely API-only (not locally runnable) at launch. This indicates uncertainty about open weights and potential lack of immediate local quantizations (e.g., GGUF) for on-device inference.
- Model size speculation: One user infers Qwen 3 Max âmust be bigger than
235B
,â suggesting expectations of a very large dense model surpassing earlier Qwen baselines. This is unconfirmed, but if accurate it would put Qwen 3 Max in the top tier of parameter counts among 2024+ LLMs, aligning with its frontier-like pricing.
- Pricing details: Input is listed as $1.2 for contexts <
- Iâve made some fun demos using the new kimi-k2-0905 (Score: 161, Comments: 24): OP showcases several demos built with the new kimiâk2â0905 using a singleâpass, AIâgenerated prompt workflow that pairs Claude Code with kimiâk2â0905. The shared prompt resources are published as gists: gist 1 and gist 2; the demo video link on v.redd.it returns
HTTP 403
without login, limiting independent verification (original link). A commenter proposes a capability stressâtest: ask the model to generate a full Game Boy emulator endâtoâend. (Another nonâtechnical comment was ignored.)- A commenter shared concrete prompt templates for kimi-k2-0905, linking two gists that appear to provide reusable prompt scaffolding and examples for consistent behavior and demo replication: https://gist.github.com/karminski/52a72d4726128c10a266bfb8270fe632 and https://gist.github.com/karminski/0435b69c6d8c93b4bd1724b64e43bd75. These resources are useful for standardizing system instructions/roles and I/O formatting when evaluating K2 across tasks.
- Thereâs a proposed stress-test: have K2 generate a full Game Boy emulator endâtoâend. This would probe longâhorizon code generation, multiâfile project scaffolding, and hardware reasoning (instruction decoding, timing/cycle accuracy for CPU/PPU/APU, ROM loading), offering a stringent benchmark versus other frontier models.
- Multiple requests focus on headâtoâhead evaluation and tooling: comparing kimiâk2â0905 to Claude Opus and guidance for using K2 with Claude Code. Useful axes for comparison would include code generation pass@k, longâcontext reliability, toolâuse quality, latency, and cost; integration with Claude Code would likely require an OpenAI/Anthropicâcompatible API layer or an adapter to map chat and tool-call schemas.
2. Open-Source LLMs: GPT-OSS 20B Home Server & Weekly Release Roundup
- Converted my unused laptop into a family server for gpt-oss 20B (Score: 176, Comments: 94): **OP repurposed a 2021 MacBook Pro M1 Pro (16 GB unified RAM) as a 24/7 family LLM server running âgpt-oss 20Bâ via the llama.cpp server, reporting
46â30 tok/s
,32K
context, ~1.7 W
idle and ~36 W
under generation; the 20B model + large context narrowly fits in 16 GB, so the system runs headless over SSH, with sleep/auto-updates disabled, Dynamic DNS for WAN access, and battery health managed while plugged in (native Apple charger measured more efficient than a generic GaN). The model is described as fast, concise, and compliant, but occasionally emits âvery strangeâ factual errorsâOP speculates possible weight corruption or lowâquality fineâtuning. ** Comments request a setup guide and whether itâs bareâmetal, and ask about tweaks to improve responses; one user notes success on a nonâMac by removing the battery and upgrading RAM to 32 GB. Another recommends LM Studio using Appleâs MLX stack and serving through Open WebUI in Docker for auth + web search, questioning if the OP avoided it due to the 16 GB constraint.- Reproducibility hinges on sharing exact llama.cpp (repo) runtime parameters; commenters request flags like
t
(threads),ngl
(GPU layer offload/Metal),c
(context),b
(batch size), plus the quantization (e.g.,Q4_K_M
vsQ5_K_M
) and exact model variant. Reported throughput varies from~8 tok/s
(Ollama/LM Studio defaults) to a claimed~40 tok/s
on an M1/16GB; differences likely stem from quantization, GPU offload, and batching, so posting the full parameter set is essential for fair comparison. - On Apple Silicon, several note better reliability/perf by using LM Studio with the MLX backend (MLX) versus other options; some pair this with Open WebUI in Docker for auth/search. Stack choice impacts speed and resource headroom: MLX/Metal acceleration and bareâmetal runs can beat containerized UIs and Ollama defaults, while Dockerized setups trade some performance for convenience/features.
- Hardware constraints are a key limiter for 20B inference: upgrading a PC laptop to
32 GB RAM
(and removing the battery for 24/7 use) improved stability and enabled higherâprecision quants; Macs canât upgrade RAM, makingM1 16 GB
notably constrained. This context helps explain why heavier UIs/backends may be avoided on lowâmemory machines in favor of leaner llama.cpp servers.
- Reproducibility hinges on sharing exact llama.cpp (repo) runtime parameters; commenters request flags like
- List of open models released or updated this week on this sub, just in case you missed one. (Score: 220, Comments: 30): Weekly roundup highlights new/updated open models across tasks and scales: Moonshot AIâs Kimi K2â0905; AI Dungeonâs Wayfarer 2 12B & Nova 70B (open-sourced narrative roleplay LLMs); Googleâs EmbeddingGemma (300M) multilingual embedding encoder; ETH ZĂŒrichâs Apertus multilingual LLM (â
40%+
nonâEnglish training data); WEBGENâ4B webâdesign generator trained on ~100k
synthetic samples; Lille (130M) small LLM; Tencentâs HunyuanâMTâ7B & HunyuanâMTâChimeraâ7B MT/ensemble models; GPTâOSSâ120B benchmark updates; and BeensâMiniMax (103M MoE) scratchâbuilt SFT+LoRA experiments. Coverage spans~103M
to~120B
params, with notable techniques/data mentions including synthetic data generation, MoE, multilingual emphasis, roleplay fineâtuning, and translation ensembles. Comments note strong reception for Kimi; the WEBGEN team adds that a nonâpreview release and more UIGEN models are forthcoming, and that4B
checkpoints serve as internal thermometers to validate their pipelines.- Sparse-MoE drops stood out: Klear-46B-A2.5B-Instruct uses a 46B-parameter mixture-of-experts with only
~2.5B
active per token, so compute and KV cache scale with the active experts, not the total size; similarly, LongCat-Flash-Chat 560B MoE pushes total params higher while keeping per-step cost bounded. For local inference, this means memory/throughput are governed by the number of activated experts (and sequence length), enabling large-capacity behavior on modest hardware if routing remains sparse and load-balanced. - Specialized models expanded: Step-Audio 2 Mini (8B) adds open speech-to-speech capability; Neeto-1.0-8B targets medicine and reports
85.8
on a medical benchmark; and Anonymizer SLMs provide privacy-first PII replacement at0.6B/1.7B/4B
scales for edge/server use. Translation saw breadth with YanoljaNEXT-Rosetta and CohereLabs/command-a-translate-08-2025, while vision/mobile got attention via Appleâs FastVLM and MobileCLIP2 on Hugging Face. - From a workflow perspective, the WEBGEN team notes they use
~4B
models as an internal âthermometerâ to validate training/inference pipelines before scaling up, which is a practical proxy for detecting regressions early. Separately, users plan to evaluate Gemma embeddings for clustering; for rigorous comparison, consider intrinsic (cosine separation, silhouette) and extrinsic metrics (NMI/ARI) against baselines like E5 or text-embeddings models.
- Sparse-MoE drops stood out: Klear-46B-A2.5B-Instruct uses a 46B-parameter mixture-of-experts with only
3. AI/LLM Race Discourse and Meme Reactions
- Th AI/LLM race is absolutely insane (Score: 189, Comments: 146): Meta-discussion noting the rapid cadence of LLM releases and infra over the last 3â6 monthsâespecially code-focused and general models like Alibabaâs Qwen2.5-Coder, THUDMâs GLM-4, and xAIâs Grok-2âplus the rise of thirdâparty API platforms hosting heavier models (e.g., OpenRouter, Together). The OP frames it as a bubble-vs-platform-shift question, pointing to relentless iteration on throughput (ânew way of increasing tpsâ), the shift from local to hosted inference, and heavy corporate CAPEX, layoffs/poaching, and M&A as signals of a high-velocity market regime. Top comments are split between enthusiasm (âThese ARE the good old daysâ), a macro take citing UBSâs forecast of
~$0.5T
AI investment in 2026 with~60%
YoY growth (arguing scale exceeds typical hype cycles), and moderation concerns noting the post may be off-topic for r/localllama.- UBS is cited forecasting
~$0.5T
in AI investment in 2026 with~60%
YoY growth, implying massive near-term demand for compute (GPUs/TPUs), high-speed interconnects (400G/800G), and power/cooling capacity. For local/edge LLMs, this scale-up could affect GPU availability/pricing and spur rapid infra buildouts, but also raises risk of overcapacity similar to past infrastructure cycles. - Practitioner-run, task-specific benchmarks often fail to reproduce headline SOTA gains, suggesting many reported improvements are narrow, cherryâpicked, or brittle to prompt/distribution shifts. The discussion urges skepticism of single-paper claims and the casual use of âSOTAâ as an objective yardstick, advocating rigorous replication, ablations of proposed components, and evaluation across diverse datasets/use cases to validate real capability gains.
- Some frame the current phase as a dotâcomâlike buildout focused on GPUs and extended context windows, where marketed advances (more VRAM, longer sequences) face practical limits: latency, cost, and quality degradation over very long contexts. The view is that a portion of perceived progress is aggressive market capture (subsidized usage now, price hikes later) rather than consistent, generalizable step-function model improvements.
- UBS is cited forecasting
- This is not funnyâŠthis is simply 1000000% correct (Score: 1697, Comments: 128): Non-technical meme critiquing current AI hype: it suggests companies (especially CEOs) push âAIâ initiatives primarily to please markets and inflate stock prices, while job postings across roles now demand vague âAI experienceâ regardless of real need. The image contextualizes the broader trend of AI-washingâadding AI buzzwords to products, roadmaps, and hiring to signal innovation rather than deliver concrete value. Commenters note near-ubiquitous AI requirements in tech job ads and argue executives chase an âAI premiumâ independent of business benefit; some satirically suggest AI could replace CEOs, underscoring frustration with hype-driven leadership decisions.
- Thread consensus notes a hiring trend where âAI experienceâ is mandated across technical roles without specifying models, frameworks, or measurable outcomes, reflecting topâdown, marketâdriven directives rather than concrete implementation plans. Comments suggest the business objective is headcount reduction and stockâprice signaling (âWhat do we need for that? AI!â) rather than validated ROI, with no discussion of benchmarks, latency/cost metrics, or deployed systemsâindicating AI as a checkbox skill rather than a scoped technical requirement.
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. OpenAI-Broadcom Chips, Google Veo/Nano Banana, Nunchaku v1.0 Releases
- OpenAI set to start mass production of its own AI chips with Broadcom (Score: 522, Comments: 58): Reuters (citing the FT) reports that OpenAI will begin mass-producing its own custom AI accelerators with Broadcom, aiming to reduce reliance on Nvidia GPUs and lower training/inference costs while securing supply. This mirrors hyperscaler strategies (e.g., Google TPUs, AWS Trainium/Inferentia) but carries substantial risks around upfront NRE, manufacturing yield, time-to-market, and building/optimizing the software stack to fully utilize the hardware. Source: https://www.reuters.com/business/openai-set-start-mass-production-its-own-ai-chips-with-broadcom-ft-reports-2025-09-05/ Commenters characterize it as a high-risk, high-reward and ultimately âobviousâ strategic move: smart if it works (cost/control advantages), but a massive gamble given the capital intensity and execution risk.
- Strategic rationale and risks: Commenters frame this as mirroring Googleâs TPU play (custom accelerators to cut dependence on Nvidia and optimize TCO for training/inference), which could deliver workload-specific efficiency and capacity control. The downside is massive upfront NRE, tapeout/yield risk, long bring-up, and the need to mature a compiler/runtime and kernel library (XLA-like) to approach GPU-class performance; failure would strand significant capex. See TPUs for precedent: https://cloud.google.com/tpu
- Scope of deployment: The thread highlights that, per reporting, OpenAI aims to use the chip internally rather than sell it, i.e., âOpenAI planned to put the chip to use internally rather than make it available to external customersâŠâ. This implies tight co-design with OpenAIâs training/research stack and no thirdâparty productization, reducing external support/validation burden but limiting amortization across customers and focusing optimization on their own models/pipelines.
- Manufacturing/competitiveness: Partnering with Broadcom suggests a full ASIC with advanced packaging/HBM; competitiveness vs GPUs hinges on process node, yields, memory bandwidth, interconnect, and software tooling. Beating
H100/B200
on perf/W and cost-per-token would secure supply/cost advantages; missing those targets leaves high sunk costs. Reference GPU baselines: https://www.nvidia.com/en-us/data-center/h100/ and https://www.nvidia.com/en-us/data-center/blackwell/
- Google is on fireâŠNano Banana & Veo are absolute game-changers (Score: 210, Comments: 14): Post hypes Googleâs latest generative models â Veo (text-to-video) and Gemini Nano (onâdevice LLM) â as âgameâchangers.â Veo is Google/DeepMindâs video model for highâfidelity, longerâduration 1080p textâtoâvideo with coherent motion, camera control, and style conditioning; see DeepMind: Veo. Gemini Nano is a compact onâdevice model integrated into Android via AICore for lowâlatency tasks (e.g., summarization, contextâaware system features); see Gemini Nano docs. Comments highlight the realism of the demo and request explicit artâstyle conditioning (e.g., âadd Saturn Devouring his Sonâ), while another geopolitical remark is nonâtechnical and not directly relevant to model capabilities.
- Nunchaku v1.0.0 Officially Released! (Score: 305, Comments: 91): Nunchaku v1.0.0 ships a backend migration from C to Python for broader compatibility and adds asynchronous CPU offloading, enabling Qwen-Image diffusion to run in
~3 GiB
VRAM with claimed no performance loss. New wheels and a ComfyUI node are available (release, ComfyUI node), plus a4-bit
,4/8-step
Qwen-Image-Lightning build on Hugging Face (repo); docs cover install/setup (guide). Roadmap: kick off Qwen-Image-Edit imminently and add Wan 2.2 support next. Commenters nudge prioritization toward Wan 2.2 over 2.1 and note enthusiasm for faster image generation workflows. One asks about Nunchaku compatibility with Chroma (examples currently show Flux), implying interest in broader model/runtime support.- Questions center on model compatibility: can Nunchaku work with
Chroma
, given examples showcaseFLUX
? Others ask whetherLoRA
is supported for fine-tuning/adapters. This suggests users want broader model/runtime abstraction beyond the showcasedFLUX
pipelines. - Multiple users request
WAN 2.2
support (some preferring it overWAN 2.1
), with one quoting that âWAN2.2 hasnât been forgotten â weâre working hard to bring support!â. Emphasis is on keeping parity with current model versions for state-of-the-art image generation; no concrete timelines or technical plan details were provided in-thread. - Upgrade reliability: a tester reports the inâapp Manager update âalmost never works,â often requiring manual uninstall/reinstall to reach new versions (e.g., v1.0.0). This points to packaging/update pipeline issues that may hinder smooth adoption and automated environments.
- Questions center on model compatibility: can Nunchaku work with
2. AI Robotics: Figure Home Chores and RAI Robomoto
- **Will figure.ai take over home chores?** (Score: 350, Comments: 223): Thread discusses whether Figure AIâs humanoids (e.g., Figure 01) could handle full-spectrum home chores. No benchmarks or implementation details are provided (linked video is behind Reddit login at v.redd.it); commenters define an MVP capability set: endâtoâend laundry, mopping, dishwasher loading/unloading, cardboard breakdown, trash/bin logistics, and vacuumingâimplying reliable mobile manipulation in unstructured homes (deformable-object handling, forceâcontrolled tool use, longâhorizon task planning, perception, and safety). A consumer price tolerance of
USD $30â50k
is cited if these tasks are executed robustly. Notable sentiments: strong willingness to adopt at the stated price if routine chores are solved; speculation that competent inâhome cooking plus drone ingredient delivery could reshape restaurant demand and lastâmile logistics; general hope for nearâterm timelines without concrete evidence.- The chore list (end-to-end laundry, mopping, dishwasher loading/unloading, cardboard flattening, trash logistics, vacuuming) implies hard requirements: robust deformable-object manipulation (cloth, bags, cardboard), tool use, appliance interfacing, long-horizon task planning, and home-scale navigation in clutter. Benchmarks highlight the gap: BEHAVIOR-1K long-horizon household tasks [https://behavior.stanford.edu/behavior-1k], iGibson [https://svl.stanford.edu/igibson], and Habitat [https://aihabitat.org] show success rates degrade in unstructured settings. Hitting acceptable cycle times (e.g., folding a basket in <10â15 min) and recovering from errors without human resets is as crucial as dexterity. The stated willingness to pay
\$30â50K
suggests BoM targets of a mobile base + 1â2 arms + RGB-D sensors are viable only if reliability approaches appliance-like duty cycles. - Multiple commenters flag the âcertain conditionsâ demo gap: real homes vary widely (lighting, layouts, object novelties), so sim/demo policies must generalize and recover from failure. For 50â200-step chores, per-step reliability must be
â„99.9%
to keep task success high (0.999^100 â 90%
vs0.99^100 â 36%
), which far exceeds staged demo rates. This demands self-calibration, continuous mapping, grasp-under-occlusion, compliant control, and safety interlocks, with MTBF in tens of hours between human interventionsâhence the âcouple yearsâ to appliance-grade deployment. - Assistive and cooking scenarios raise the bar: force-limited compliant manipulation, food-safe materials, heat/grease handling, contamination-aware tool use, reliable multimodal interfaces, and robust environment understanding. Teleop-to-imitation systems like Mobile ALOHA show bimanual kitchen tasks under curated conditions [https://mobile-aloha.github.io], but end-to-end autonomy also needs pantry inventory tracking, recipe/temporal planning, and integration with delivery logistics (drones/robots), which face reliability and regulatory constraints. These requirements exceed todayâs vacuum/mop robots and are the gating factors for impactful aging-in-place or blind-user assistance.
- The chore list (end-to-end laundry, mopping, dishwasher loading/unloading, cardboard flattening, trash logistics, vacuuming) implies hard requirements: robust deformable-object manipulation (cloth, bags, cardboard), tool use, appliance interfacing, long-horizon task planning, and home-scale navigation in clutter. Benchmarks highlight the gap: BEHAVIOR-1K long-horizon household tasks [https://behavior.stanford.edu/behavior-1k], iGibson [https://svl.stanford.edu/igibson], and Habitat [https://aihabitat.org] show success rates degrade in unstructured settings. Hitting acceptable cycle times (e.g., folding a basket in <10â15 min) and recovering from errors without human resets is as crucial as dexterity. The stated willingness to pay
- Another day, another AI driven robomoto (Score: 321, Comments: 46): The Reddit post links to a short X/Twitter clip from the RAI Institute showcasing a riderless, AI-driven motorcycle performing basic balance/riding maneuvers (video). The post/video includes no technical details (e.g., control stack, sensors, training method, or quantitative performance/robustness metrics), and the mirrored Reddit-hosted video (v.redd.it) returns
403
to unauthenticated clients. Top comments are largely non-technical memes; the only semi-substantive point raised is curiosity about validation at higher speeds on a supersport platform and highway conditions.- A recurring technical question was why quadrupeds (robot dogs) appear agile while humanoids look clunky. Commenters note quadrupeds benefit from passive/static stability and simpler gait planners, whereas bipeds are underactuated and require realâtime wholeâbody control (ZMP/MPC), highâbandwidth torque control, and reliable contact estimation; adding dexterous hands compounds the problem with
~28â40
DOF vs~12â16
on many quadrupeds. Progress exists but is still brittle outside demos (see BD Atlas parkour for state of the art: https://www.youtube.com/watch?v=tF4DML7FIWk). - On âstrap it to a supersport,â prior art shows autonomous motorcycle control is feasible without a human riderâs body: Yamaha MOTOBOT rode an R1M at
>200 km/h
using GPS/IMU fusion and modelâbased control of throttle, brake, clutch, shifting, and steering to induce roll via counterâsteering (https://global.yamaha-motor.com/showroom/technologies/ymrt/motobot/). The hard parts are lowâlatency control under rapidly changing tireâroad friction and maintaining stability across lowâspeed balance vs highâspeed dynamics; anthropomorphic actuation to âgrabâ the bike is unnecessary when driveâbyâwire is available. Related balancing approaches on bikes (e.g., Honda Riding Assist) highlight how steering geometry and active control manage stability at low speeds: https://global.honda/innovation/robotics/experimental/riding-assist/.
- A recurring technical question was why quadrupeds (robot dogs) appear agile while humanoids look clunky. Commenters note quadrupeds benefit from passive/static stability and simpler gait planners, whereas bipeds are underactuated and require realâtime wholeâbody control (ZMP/MPC), highâbandwidth torque control, and reliable contact estimation; adding dexterous hands compounds the problem with
3. AI Society: Inequality, Layoffs, Deepfakes, and Accessibility
- Computer scientist Geoffrey Hinton: âAI will make a few people much richer and most people poorerâ (Score: 216, Comments: 76): In a recent Financial Times interview, Geoffrey Hinton warns that current AI deployment will concentrate wealth and power in a small set of firms while reducing incomes for most workers, exacerbating inequality and social risk. He urges stronger oversight, safety research, and governance before further rapid rollâout to mitigate laborâmarket displacement and broader systemic harms. Sources: noâpaywall archive link, FT original paywalled. Top comments frame this as a continuation of capitalismâs widening wealth gap, with AI accelerating the trend; some read Hintonâs tone as ironically âmore optimistic now.â Another thread asserts that concentrated gains are a deliberate feature benefiting incumbents, not an unforeseen bug.
- Several commenters highlight a structural tax asymmetry: hiring humans triggers payroll taxes (e.g., US employer FICA
~7.65%
plus mandated benefits) while deploying robots/software incurs no payroll tax, effectively making automationâs total cost of ownership lower than labor for equivalent tasks. They argue this acts as a de facto subsidy accelerating capitalâlabor substitution and concentrating returns to capital owners, and reference ideas like a ârobot taxâ or shifting tax burdens from labor to capital to rebalance incentives (see IRS/SSA FICA overview: https://www.ssa.gov/pubs/EN-05-10003.pdf; policy debates: https://www.oecd.org/tax/tax-policy/taxation-and-the-future-of-work.htm). - Another thread contends the unequal distribution of AI gains is not technologically inevitable but driven by institutional choices that tax labor-linked transfers heavily while taxing large wealth transfers (inheritances/capital gains) comparatively less, allowing AI-driven productivity to accrue primarily to asset owners. They frame this in terms of factor income shares and bargaining power, noting long-run declines in labor share as a warning signal (e.g., US nonfarm business labor share trend: https://fred.stlouisfed.org/series/PRS85006173), and propose reorienting tax/transfer systems toward wealth and capital income to avoid âneo-feudalâ dynamics.
- Several commenters highlight a structural tax asymmetry: hiring humans triggers payroll taxes (e.g., US employer FICA
- Salesforce CEO confirms 4,000 layoffs âbecause I need less headsâ with AI (Score: 290, Comments: 69): Salesforce CEO Marc Benioff confirmed
~4,000
customer-support layoffsâreducing support headcount from~9,000
to~5,000
âattributing the cuts to AI-driven efficiencies from its Agentforce system, saying âbecause I need less headsâ, per CNBC. Salesforce says AI has reduced support case volume and it wonât backfill affected support engineer roles; internally, AI reportedly handles up to50%
of work. Top commenters argue firms are invoking AI to justify post-pandemic over-hiring corrections and to signal efficiency to investors (citing analyst Ed Zitron), predicting more AI-attributed layoffs as the current hype cycle deflates.- Several commenters argue the
4,000
layoffs attributed to âAI efficiencyâ lack technical substantiationâno disclosed productivity metrics, automation coverage, infrastructure cost reductions, or model/inference choices. They note this mirrors a broader postâpandemic overhire correction being reframed as AIâdriven without benchmarks (e.g., tickets handled per agent, leads per AE, costâtoâserve deltas). The absence of details like which models, fineâtunes, or workflow automations actually replaced FTEs makes the claim hard to evaluate. - An anecdote about building a personal CRM with AI underscores how LLMâassisted scaffolding can accelerate CRUD apps and simple automations, potentially eroding the moat of generic SaaS. However, replacing Salesforce at enterprise scale requires nonâtrivial capabilitiesâcomplex role hierarchies/ACLs, compliance (SOC 2/HIPAA), data model extensibility, integration throughput (ETL/event buses), observability, and SLAsâareas where DIY + LLM still imposes significant ongoing ops and reliability burden.
- Expectation of more firms citing AI for headcount cuts until the hype normalizes, absent hard ROI. Technical readers would expect quantifiable proof such as
>X%
workflow automation,~$Y
/seat license consolidation, or inference spend offset by labor savings; none were referenced, suggesting investorâsignaling rather than measured AIâdriven efficiency.
- Several commenters argue the
- An Update: Ben can now surf the web thanks to Vibe Coding in ChatGPT (Score: 1387, Comments: 77): A caregiver built a custom AAC/accessibility stack via âvibe codingâ with ChatGPT that enables a nonverbal quadriplegic with TUBB4A-related leukodystrophy and severe nystagmus to browse content using a binary, two-button headband input with on-screen scanning selection. The system evolved from a phrase board to a media picker, a predictiveâtext keyboard, and 8 custom games, culminating in search integrated directly into the keyboard so the user can type queries and independently retrieve images/YouTube videos; demo link: v.redd.it/6qzlngnab8nf1 (currently returns
403
/authâgated). Implementation emphasizes low-vision, low-fine-motor constraints with binary input scanning and UI options sized/sequenced for minimal visual demand, all prototyped by a novice using ChatGPT for rapid iteration. Commenters encourage sharing/replicating the approach for other families and suggest enabling the user to coâcreate via ChatGPT (userâinâtheâloop prompt engineering) to expand functionality.- A commenter suggested shifting toward end-user programming by giving Ben direct access to ChatGPT so he can prototype and build his own tools/automations, noting that user-driven iterations often surface solutions others wouldnât anticipate. This implies extending the current âVibe Codingâ workflow from caregiver-authored prompts to user-authored scripts/macros, increasing personalization and autonomy in assistive tech.
- Tech CEOs Take Turns Praising Trump at White House - âThank you for being such a pro-business, pro-innovation president. Itâs a very refreshing change,â Altman said (Score: 804, Comments: 207): At a White House event (reported as a Rose Garden dinner), multiple tech CEOs publicly praised President Trumpâs stance as âproâbusiness, proâinnovation,â with Sam Altman (OpenAI) quoted as saying, âThank you for being such a proâbusiness, proâinnovation president. Itâs a very refreshing change.â The only source provided is a paywalled WSJ link; no agenda, policy commitments, participant list, or technical outcomes (e.g., regulatory changes, funding programs) are available from the shared materials. Top comments are overwhelmingly critical of CEOsâ integrity and of Altman personally, offering no technical or policy analysis; an image link is shared without context. Overall, the thread reflects skepticism toward corporate motives rather than substantive debate on tech policy.
AI Discord Recap
A summary of Summaries of Summaries by Gemini 2.5 Pro Exp
1. The AI Arms Race: New Models and Hardware Heats Up
- Qwen 3 Max Enters the Arena with Mixed Reviews: The new Qwen 3 Max model sparked speculation of having 500B to 1 Trillion parameters, with users in the Unsloth AI discord praising its creative writing abilities as superior to K2 and Sonnet 4. However, its high price and shortcomings in tool calls and logic-based coding were noted, while an official release announcement from OpenRouter highlighted its improved accuracy and optimization for RAG and tool calling.
- Hardware Wars Rage from Custom Silicon to Workstations: OpenAI is reportedly partnering with Broadcom on a custom AI chip to reduce its reliance on Nvidia, detailed in a Financial Times article. Meanwhile, engineers debated the merits of workstations, with one quipping that the DGX Spark is a toy compared to the more powerful DGX Station, and others speculated that Nvidiaâs upcoming 5000 series may be a skip gen due to a lack of significant VRAM increases.
- Niche Models Cater to Specific Tastes: A new model named Glazer was released on Hugging Face and Ollama specifically to replicate the sycophantic personality of GPT-4 that some users miss. In a more experimental vein, a developer trained a micro-LLM on H.P. Lovecraftâs stories, producing what they described as quite promising Lovecraftian output, seen in this YouTube video.
2. Geopolitical Jitters and Corporate Policy Shake-Ups
- Anthropic Draws a Line in the Geopolitical Sand: A new Anthropic policy, first shared on X, restricts service to organizations controlled by jurisdictions where its products are not permitted, such as China. The move ignited debates across multiple Discords about whether the motivation was genuine national security concerns or simply corporate self-interest aimed at protecting market share.
- MasterCardâs AI Unleashes Compliance Chaos: MasterCard replaced its human fraud prevention team with an AI system that is now aggressively flagging merchants for obscenity rule violations, as detailed in Chapter 5.12.7 of their rulebook. The systemâs insufficiently specified criteria has led to fees as high as $200,000, forcing merchants into a corner and highlighting the risks of automated policy enforcement without clear context.
- OpenAI Clarifies Responses API Reality: A developer posted a thread on X to bust widespread myths about the OpenAI Responses API, clarifying that it does not magically unlock higher model intelligence but is essential for building GPT-5-level agents. It was also confirmed that OpenRouter uses this API for most of its OpenAI models, making the clarification critical for developers building on the platform.
3. The Developerâs Dilemma: Choosing and Tuning the Right Tools
- Coding Assistants Clash in the IDE: Developers are fiercely debating the best AI coding tools, with many finding GPT-5 superior to Sonnet 4 due to its conciseness and lower tendency to hallucinate. The community is also split between Codex CLI, praised for its code quality, and Cursor Code, favored for its creative reasoning, with one user noting the optimal setup might be a $20/month Cursor subscription paired with a separate Claude Code plan.
- Engineers Wrangle LLMs with Prompts and Programs: In the OpenAI discord, users shared advanced prompt engineering techniques, advocating for token efficiency by cutting useless words and using bracket notation like [list] and {object} for abstraction. Elsewhere, developers using DSPy focused on a more programmatic approach, building voice agents and using frameworks like GEPA to optimize prompts for specific conversational tasks.
- Hardware Constraints Force Creative Solutions: A user on a 6GB GPU sought model recommendations for immersive roleplaying, leading to suggestions like Mistral Nemo Instruct and the quantized Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE model. For developers with tight cloud budgets, another discussion highlighted using models with RoPE (Rotary Position Embedding) to build RAG applications that can handle context windows larger than what they were explicitly trained on.
4. Under the Hood: The Guts of GPU Programming and Performance
- Mojo and Zig Push Compiler Boundaries for Peak Performance: Engineers in the Modular community are chasing the dream of writing simple, Pythonic code that automatically compiles to SIMD instructions using Mojo and MLIR. This mirrors concerns in the Zig community over a new async IO approach where IO needs to haul around state now, fueling discussion on how next-gen language features like Mojoâs type system can solve these low-level performance challenges.
- Engineers Decode Low-Level CUDA and ROCm Mysteries: A deep dive revealed that the FP32 accumulator for FP8 matrix multiplication in Nvidiaâs tensor cores is actually FP22, according to this paper. Other discussions focused on leveraging L2 cache persistence on the Ampere architecture for performance gains, detailed in a blog post, and tackling errors in rocSHMEM related to its ROCm-aware MPI requirements.
- Future-Forward Architectures Spark Niche Debates: Discussions explored brain-like Spiking Neural Networks (SNNs) after a member shared an explainer video. On the more practical side of performance, vLLM profiling revealed significant slowdowns caused by âRuntime Triggered Module Loadingâ, prompting an investigation into its root cause and potential workarounds.
5. User Blues: Platform Instability and UX Woes Create Headaches
- LMArena Buckles Under Unprecedented Traffic: The LMArena platform is struggling with major stability issues, as users report widespread image generation glitches, infinite loops, and a non-functional video arena bot. Compounding the frustration are newly implemented rate limits and login requirements, with one user complaining the change was bad âbecause most of us donât want toâ.
- APIs Sputter and Services Stumble Across Platforms: Users of the Perplexity PPLX API reported a spike in 500 Internal Server Errors, with the Playground also becoming non-functional. The instability extends to paying customers, as some Perplexity Pro users noted that the Grok 4 model was missing from their selector, while an OpenRouter user discovered that hitting the output token limit silently truncates responses.
- AI Assistants Flub the Job and Frustrate Users: Developers using Cursorâs Auto mode shared numerous complaints about its poor performance, including its inability to fix simple bugs and its tendency to type edits in the chat instead of applying them. One user who switched back to aider from Claude Code remarked that Anthropic have made some questionable changes, highlighting a broader sentiment that even top-tier tools are experiencing regressions.
Discord: High level Discord summaries
LMArena Discord
- Image Generation Plagued by Glitches: Users are reporting widespread issues with image generation, including persistent errors and infinite generation loops resulting in the âSomething went wrong with this responseâ message.
- Suggestions include adding more specific error messages to aid in troubleshooting, especially when the model appears confused by the prompt.
- Video Arena Bot Briefly Vanishes: The video arena bot experienced downtime but is now back online after a fix; users can use the bot with
/video
and a prompt in the specified channels <#1397655695150682194>, <#1400148557427904664>, or <#1400148597768720384>.- A GIF tutorial on using the bot was shared here.
- Login Requirements Incite Ire: The new login requirements, especially the Google account requirement, sparked user concerns with one member pointing out the requirement was bad.
- This member explained this was âbecause most of us donât want toâ.
- Rate Limits Rankle Regulars: The recent implementation of rate limits on image generation, due to unprecedented traffic, has led to frustration, with confusion whether the limits are intentional or a result of ongoing issues.
- Logged-in users will continue to enjoy higher limits, and more information about user login can be found here.
- Account Data Evaporates Erratically: Users reported instances of lost chat histories, particularly when not logged in, which raised concerns about data retention.
- One member suggested trying to restore the chats âif you use brave you might be able to restore them, i dont know about googleâ and that the platform is likely using cloudflare.
Perplexity AI Discord
- Perplexity Pro Users Whine About Grok 4 Absence: Some Perplexity Pro users are missing Grok 4 in their model selector, and were advised to contact support to check if their Pro account was the enterprise version.
- A member suggested reinstalling the app might help resolve the issue, or checking with Perplexity support if it was assigned to the account, noting that university users were especially impacted.
- Arc Browser Bites the Dust: Users discussed the transition from Arc to Dia, with one noting that Arc hasnât had a meaningful update in about a year and another expressing concern about the $15 charge for the browser.
- They added that a large fanbase of happy Arc users would be left behind with the transition to building agentic chromium, and complained that it should be cheaper than Perplexity Max.
- Qwen 3 Max Speculation Swirls: Members speculated on the specs of the upcoming Qwen 3 Max, anticipating parameters between 500B and 1 Trillion.
- One member stated that they believe that since the models are free for consumers that itâs for Better Training Data and Big Community and the community is also driving model building by labeling, testing, and evaluating different model versions.
- PPLX API Melts Down: Multiple users reported receiving 500 Internal Server Errors on API calls and noted that the Playground was also non-functional.
- The users confirmed that no outage was reported on the status page, while one quipped that theyâre going to pretend nothing happened after the service appeared to be working again, and another blamed increased usage.
- Comet Hits Usage Limits: Users are reporting that after using Comet Personal Search too much, it stops working, throwing the message Youâve reached your daily limit for Comet Personal Search. Upgrade to Max to increase your limit.
- Others noted that Comet is currently offered on the Paypal/Venmo deal or if youâre a student, and are sharing invite links in the discord, but it may be impacting API performance.
Unsloth AI (Daniel Han) Discord
- Postgres Dominates Complex Queries: While Qdrant excels in vector search, Postgres with pgvector is considered superior for handling complex database queries, sparking discussion on database suitability.
- A member linked to a tweet and humorously shared a Borat GIF, adding levity to the tech discussion.
- Local Sonnetâs Mammoth RAM Requirements: Running Local Sonnet demands a minimum of 512GB of RAM, highlighting significant hardware requirements for optimal performance.
- Even with 1TB of RAM, achieving full precision remains a challenge, leading to inquiries about Q8 fine-tuning as a potential solution, though dismissed as insufficient.
- DGX Spark: Toy or Treasure?: The community debated the merits of DGX Spark versus DGX Station, with one member quipping that Spark is a toy, Station is a workstation, while linking to the DGX Station product page.
- Despite its limitations, the DGX Spark was acknowledged for its attractive price and storage capacity, described as a good product.
- Qwen 3 Max Excels in Creative Writing: Evaluations of Qwen 3 Max highlighted its strengths in creative writing and roleplay, surpassing K2 and Sonnet 4 in member evaluations.
- However, its high price and perceived shortcomings in tool calls and logic-based coding tempered enthusiasm, positioning it as potentially overpriced.
- Glazer Mirrors GPT-4: A new model, Glazer, designed to replicate the sycophantic personality of GPT-4 that some users miss, was released to positive reception.
- It is available on Ollama via
ollama run gurubot/glazer
and on Huggingface in 4B and 8B versions.
- It is available on Ollama via
LM Studio Discord
- 3090 Bug Overclocks 4090?: A user reported that their 3090 seemed to be causing their 4090 to draw excessive power, potentially due to a software bug.
- This resulted in higher temperatures, leading the user to typically undervolt to prevent overheating.
- Tentacle LORAs conquer Art Styles: A member created and shared a collection of LORAs, exploring various art styles, providing a link to the LORA template.
- These LORAs were described as stomped together, resulting in a tentacle-like shape, designed for artistic experimentation.
- 6GB GPU Owner needs roleplaying model: A user with a 6GB GPU sought recommendations for the best model for realistic and immersive roleplaying games.
- Suggestions included increasing CPU RAM to 64GB and utilizing models like Mistral Nemo Instruct or Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE.
- Bionic Legs Flop?: A member inquired about consumer-priced bionic legs (exoskeletons), seeking real-world performance insights.
- Another member referenced a YouTube review indicating that they barely do anything and might even induce muscle atrophy.
- 5000 Series to skip VRAM?: Discussion arose around the potential for the new Nvidia 5000 series to be a skip gen, with minimal performance increases over the 4000 series.
- The lack of added VRAM was also a point of concern.
Cursor Community Discord
- GPT-5 Demolishes Sonnet 4: Members find GPT-5 superior to Sonnet 4 for coding, citing its conciseness and accuracy, albeit requiring more specific prompting, while Sonnet 4 tends to hallucinate more.
- Users appreciate GPT-5 as a valuable planner and discussion partner, especially when coupled with auto-implementation, because Sonnet 4 seems template-based.
- Codex CLI vs Cursor Code: Dueling Code Geniuses: The community is split between Codex CLI and Cursor Code, as some prefer Codex CLIâs superior code quality, while others favor Cursor Codeâs creative thinking and reasoning abilities, as quality depends on prompt quality.
- A member unsubscribed from Cursor Codeâs Max plan due to hallucinations, while others warn of Codexâs lower, harder-to-track rate limits, though some appreciate its suggestion system.
- Cursorâs $20/m Price Tag: Still a Steal?: Discussions revolve around the value of Cursorâs $20/month Pro plan, with users debating how quickly one might hit the usage limits.
- One user, finding it essential, canceled their Cursor subscription for Claude Code and Codex, suggesting a $20/month Cursor subscription paired with a Claude Code plan for inline editing and terminal usage is the optimal setup.
- Cursor Auto-Mode: Handle with Extreme Caution: Multiple users report issues with Cursorâs Auto mode, noting its poor performance, inability to fix simple bugs, and tendency to type edits in the chat instead of applying them.
- One user humorously illustrates Cursorâs overconfidence with a meme-like message generated by the tool, underscoring the need for thorough debugging.
OpenRouter Discord
- Qwen3-Max Gets Smarter: The latest Qwen3-Max model shows accuracy gains in math, coding, logic, and science tasks over the January 2025 version, according to this X post.
- The model is optimized for RAG and tool calling, lacks a dedicated âthinkingâ mode, and is available for testing here.
- Fake OpenRouter Crypto on PancakeSwap: An OpenRouter-related cryptocurrency is a scam and not officially connected to OpenRouter.
- Users were warned after inquiring about the existence of an OpenRouter coin on PancakeSwap and its availability for trading.
- Anthropicâs Geopolitical Stance Debated: Members debated Anthropicâs blog post which restricts access from regions with ownership structures subject to control from countries where their products are not allowed.
- Some speculated whether the move was motivated by national security or market share protection.
- Output Tokens Capped at 8k: A user found that hitting the output token limit results in response truncation, with the stop reason flagged as *âlengthâ**.
- The API restricts setting
max_tokens
beyond the modelâs limit.
- The API restricts setting
- OpenRouter Leverages OpenAI Responses API: A member inquired whether OpenRouter uses the OpenAI Responses API, referencing a tweet.
- It was confirmed that OpenRouter uses it for most OpenAI models.
OpenAI Discord
- Slash Token Waste: A member advocates for token efficiency by filtering grammatically useless words and consolidating multiple words into useful ones in the prompt, claiming that in inference, wasted tokens = wasted resources.
- They argue that wasted tokens lead to accelerated amortization of components if youâre hosting and that politeness in AI prompts can increase environmental waste.
- Gemini 2.5 Pro Unlocks Unlimited Access: Google AI Studio now gives unlimited access to Geminiâs best model, 2.5 Pro along with other features like Imagen, Nano Banana, Stream Realtime, speech generation, and Veo 2.
- Some members are focused on LLMs, and some have found use for video editing, educational videos and recreating public domain.
- Claims of AGI Generate Carbon: Members discussed a blog post revealing the carbon generated by claims of AGI outpaces the token wastage used on please and thank you in ChatGPT, which can be found here.
- It suggests that the environmental impact of grand claims in AI may be more significant than previously thought, prompting discussions about sustainable AI practices.
- Engineering Manual Shared: A user named darthgustav shared a JavaScript code snippet outlining prompt engineering lessons, covering hierarchical communication with markdown, abstraction through variables, reinforcement in prompts, and ML format matching for compliance.
- The lessons aim to enhance clarity, structure, and determinacy in AI interactions, guiding tool use and shaping output more effectively.
- Abstraction via Bracketology: A user emphasizes teaching abstraction through bracket interpretation such as [list], {object}, and (option) within prompts.
- This approach aims to enhance clarity and structure, enabling more effective communication between the user and the AI, improving overall prompt engineering practices.
GPU MODE Discord
- Anthropicâs China Policy Draws Fire: A tweet revealed Anthropicâs new policy, restricting service to organizations controlled by jurisdictions where their products are not permitted (e.g., China).
- The ensuing debate questioned whether the policy reflects national security concerns or mere corporate self-interest.
- CUDA Newbies Convene on Triton: Newcomers sought guidance on learning Triton without prior CUDA or GPU experience, and received a recommendation to start with the official Triton tutorials.
- They further inquired about the necessity of reading the PMPP book for learning Triton.
- Profiling Reveals Slow Module Loading: During vLLM profiling, time is spent on âRuntime Triggered Module Loadingâ, though its precise meaning and how to avoid it during profiling are unclear, and a trace was shared.
- Analysis revealed that the FP32 accumulator designed for FP8 matrix multiplication in tensor cores is actually FP22 (1 sign bit, 8 exponent bits, and 13 mantissa bits), according to a paper (arxiv.org/pdf/2411.10958).
- rocSHMEM struggles with HIP kernels: A member is exploring rocSHMEM implementation similar to HIP kernels using load_inline, encountering errors related to ROCm-aware MPI requirements.
- Another member suggested trying ROCm/iris as a possible alternative while they investigate the issue.
- L2 Cache Persistence makes comeback: A blog post highlights performance gains on the Ampere architecture via leveraging L2 cache for persistent memory accesses, as detailed in a blog post.
- The corresponding code shows structuring a CUDA project using CMAKE to streamline code organization.
Latent Space Discord
- OpenAI Designs Custom AI Chip: Financial Times reports OpenAI partnered with Broadcom to co-design a custom AI chip, with mass production slated to start next year, indicating a move away from Nvidia dependency, costing around $10B, see article.
- Community reactions varied from skepticism about the chipâs quality to speculation that OpenAI will out-compete its own customers.
- Mercor fields $10B Pre-emptive Offers: AI-hiring startup Mercor has received unsolicited offers valuing it at ~$10Bâ5Ă its June 2025 Series B priceâjust four months later, see tweet.
- The news has spurred jokes about the AI-funding frenzy.
- Augie raises $85M Series A for AI Logistics: Augment (Augie) announced an $85M Series Aâbringing total funding to $110M in just 5 monthsâto scale their AI teammate built for the $10T logistics sector, see announcement.
- Augie already helps freight teams handling $35B+ double productivity by orchestrating end-to-end order-to-cash workflows across email, calls, Slack, TMS and more.
- Responses API Myths Busted: A thread clarified widespread confusion about the OpenAI Responses API, debunking myths that Responses is a superset of Completions, can be run statelessly, and unlocks higher model intelligence & 40-80% cache-hit rates, see thread.
- Developers stuck on Completions are urged to switch to Responses for GPT-5-level agents, with pointers to OpenAI cookbooks.
- AI Engineer CODE Summit Slated for NYC: The AI Engineer team is launching its first dedicated CODE summit this fall in NYC, gathering 500+ AI Engineers & Leaders alongside top model builders and Fortune-500 users to unpack the reality of AI coding tools, see announcement.
- The summit is invite-only, with two tracks (Engineering & Leadership), no vendor talks, and a CFP open until Sep 15, aiming to celebrate PMF (Product-Market Fit) while addressing MITâs statistic that 95% of enterprise AI pilots fail.
DSPy Discord
- DSPy Powers Budding Voice Agents: Members discussed building voice agents with DSPy and explored using GEPA to optimize prompts for frameworks like Livekit and Pipecat.
- One member suggested using the optimized prompt from GEPA as a straightforward string, while acknowledging that this might feel anti-DSPy.
- GEPA flexes Prompt Optimization Muscles: While DSPy creators might cringe at the term prompt optimization, tools like GEPA can indeed be used for this purpose and Groq was recommended for inference.
- For prompt creation, it was suggested to setup a Rubric type judge to assess generated responses, especially at the conversation level.
- Multi-Turn Musings Spark DSPy Conversational Capabilities: While a member found no satisfying implementation of multi-turn conversations with DSPy or RL applications like GEPA or GRPO, DSPy is fully capable of handling multi-turn conversations using
dspy.History
.- However, it was cautioned that defining examples well is crucial, as itâs easy to introduce bias when building chat systems.
- RAG and Fine-Tuning face off in Memory Game: The discussion addressed how to equip voice agents with extensive information (hours, services, pricing, etc.) without runtime latency, with some approaches being fine-tuning or retrieval.
- While fine-tuning can build in memorization, itâs a big job, and simple functions or maps (like hours of operation) donât need a vector database like RAG.
- Token Streaming Rides the Wave: Members explored the impact of streaming responses (token by token) on user experience, with a key focus on minimizing Time To First Token (TTFT).
- While streaming doesnât reduce TTFT, it enhances user perception by providing immediate feedback, and libraries like Pipecat already stream frames.
Moonshot AI (Kimi K-2) Discord
- Kimi API Credits Arriving: The Kimi giveaway winner was notified that API credits are incoming.
- Credits were anticipated to arrive within the hour, arranged by the crew.
- Anthropic API Absent on Kimi: A user asked about the availability of the Anthropic API on the new model, but it was clarified that kimi-k2-turbo-preview points to -0905.
- This indicates the Anthropic API is not currently integrated into the new model.
- Kimiâs 0905 Model Launches: The turbo model now utilizes the 0905 model, updated from the 0711 model.
- Some users find the new K2 model over poetic, while others find it to be more detailed and better.
- Kimi Teamâs Lofty Ambitions: Despite being a smaller team compared to Grok/OAI, the Kimi team harbors big dreams and has a big model.
- A member noted that smaller companies often offer more user interaction.
- Coding Improvements Confuse Kimi Users: Users express confusion over the emphasis on coding improvements in the new Kimi K2 model.
- Opinions diverge, with one user preferring 0711 over 0905.
Nous Research AI Discord
- Spiking Neural Networks Mimic the Brain: Members shared a YouTube video discussing Spiking Neural Networks (SNNs) and their similarities to the human brain.
- Another member mentioned image sensors that work closer to how the human eye does, linking this video.
- Meta Wristband Controls Smart Glasses: Meta plans to release a wristband that reads body electrical signals to control smart glasses, according to this Nature article.
- No further details were discussed.
- Hermes Plays Super Conservative Holdem: A member observed that Hermes exhibits extremely OOD unique behavior in the husky holdem benchmark.
- The member noted it plays super conservative in a way no other model does.
- Micro-LLM Channels Lovecraft: A memberâs experiments with a micro-LLM trained on H.P. Lovecraftâs stories produced quite promising output, view the youtube video.
- They also speculated that a 3 million parameter model could become a light chat model with the right dataset and sufficient training.
- NVIDIA Unleashes SLM Agents: A member shared NVIDIAâs research on SLM Agents (project page) and the accompanying paper (arxiv link).
- No further details were provided.
Modular (Mojo đ„) Discord
- Zigâs Async IO Faces Doubts: Concerns arise in other language communities regarding the viability of Zigâs new approach to async IO, mentioning that IO needs to haul around state now, referencing this discussion in Ziggit.
- It was suggested that Mojoâs type system and effect generics may address some of the underlying problems.
- SIMD Nirvana: Pythonic SIMD Approaches: Members discussed the goal of writing simple, Pythonic code that automatically compiles to SIMD instructions, using Mojo and MLIR for optimal parallelized assembly without relying on LLVM to correctly vectorize code.
- One member dreamed of for loops automatically compiled for parallel processing, utilizing hardware capabilities effectively.
- Compiler Needs Input Data Shapes For Vectorization: To fully vectorize code, especially loops, the compiler requires sufficient information about input data shapes or must perform speculation to identify hot loops, clarifying that Mojo encourages the use of portable SIMD libraries.
- It was noted that scalar and vector operations can ideally run simultaneously on CPUs and AMD GPUs due to separate execution resources.
- GPU Kernel Maturity: Check-up: A member inquired about the maturity of writing GPU kernels in Mojo, specifically implementing a Mamba2 kernel for use in PyTorch, and was pointed to Modularâs custom kernels tutorial.
- MAX (Modularâs API) is not primarily targeted at training but is viable for inference, with MLA already implemented for inference (see GitHub).
- Span Abstraction Dream: A member wishes for Span (a contiguous slice of memory abstraction) to be an easily usable, auto-vectorized tool, with algorithms that work on NDBuffer (being ported to LayoutTensor) as part of Span.
- They observed that existing implementations are manual and parametrized with hardware characteristics, lacking sufficient compiler magic.
Eleuther Discord
- MasterCardâs AI Flags Obscene Material: MasterCard replaced fraud prevention staff with an AI system that is triggering conflicts with merchants over obscenity rule enforcement; details are available in Chapter 5.12.7 of mastercard-rules.pdf.
- The system flags more transactions as obscene, with fees reaching up to $200,000 per violation and $2,500 daily for noncompliance, incentivizing merchants to avoid admitting fault.
- Lacking Criteria Plagues Fraud Prevention: The automated fraud prevention issue stems from insufficiently specified criteria in obscenity rules, with no clear examples of safe items, resulting in confusing gradients for the LLM.
- The discussion focused on the need to clarify unwritten policies and approaches to avoid issues caused by automated enforcement without adequate context.
- Brand Risk Drives Over-Enforcement Tactics: Pressure from mutual funds to mitigate brand risk, such as board diversity targets, leads to over-enforcement within MasterCardâs fraud department, impacting merchants.
- MasterCardâs focus on concealing issues hinders developing useful monitoring metrics, as any flaw discovered would create a problem that needs to be solved, to protect their career.
- Endorsement Request Sparks Concern: A researcherâs request for endorsements for an arXiv paper on semantic drift raised suspicion due to recent cases of AI-induced psychosis.
- The concerns stemmed from the use of terms associated with AI-generated nonsense, prompting a request to share the paper for review.
- Community Ponders GRPO Baseline: Members discussed the possibility of using a GRPO baseline for an upcoming project.
- The idea came about when one member asked did you have an GRPO baseline? and the other responded no, this will be next.
HuggingFace Discord
- Anthropicâs Policy Raises Self-Interest Questions: Members debated if Anthropicâs new policy prohibiting access from organizations in restricted jurisdictions is motivated by national security or corporate self-interest.
- The discussion focused on the rationale behind restricting access based on ownership structures, sparking questions about corporate control.
- Reward Weighting Gets Deciphered in RL Studies: Members sought studies on the benefits of weighting reward functions during RL to avoid uninformed experimentation.
- One member shared a document regarding reward weighting in RL.
- Attention Bias Gets Explored to Train Causal Models: A member requested advice on modifying causal model training with SFTTrainer to add attention bias for specific words, referencing the Attention Bias paper.
- Suggestions included checking specific terms/tokens against common tokenizers and considering alternative approaches for loss calculation and gradient signal control.
- RoPE Technique gets Employed to Rescue RAG Context: Members exchanged tips for building RAG applications with LLMs under very limited context sizes (4096 tokens).
- One tip involved using models with RoPE and fine-tuning them with a larger context size, referencing this repo and emphasizing that RoPE enables models to perform well even on context it hasnât been trained on.
- Enron Emails get Parsed into Parquet: A member uploaded their parser for the Enron email dataset, resulting in 5 structured parquet files, including Emails, Users, Groups, Email/User junction, and Email/Group junction.
- Parent and child emails have been parsed, and duplicates are managed both by file and message hashes/caches, with all messages included as MD5 hash objects.
tinygrad (George Hotz) Discord
- Digital Ocean MI300X Stable Diffusion Fails: Users had errors running the stable_diffusion.py example on a Digital Ocean MI300X GPU instance, tracing back to some z3 issue.
- The failure wasnât reproducible on a Mac, though mnist_gan.py was tested successfully.
- AMD_LLVM=1 Causes TypeError During MNIST Training: A
TypeError
occurred involving unsupported operand types (BoolRef) when using AMD_LLVM=1 during a simple mnist training loop.- George Hotz suggested trying IGNORE_OOB=1, linking it to a possible z3 version issue, noting that some overloads added in z3>=1.2.4.0 might be the cause, and provided a link.
- Kernel Removal Project Seeks Contributors: A user expressed interest in contributing to the kernel removal project within Tinygrad.
- The scope of potential contributions was not clarified, but presumably would involve slimming the kernel surface.
aider (Paul Gauthier) Discord
- Warp Code Gets Love: Users are praising Warp Code, with one user noting that Warp feels like the difference between driving a stick and manual.
- Warp is useful when you donât know files and want to get the sense of a new codebase via embeddings search.
- Aider Still Shines: A user who switched from aider to Claude Code months back has switched back, finding that Anthropic have made some questionable changes.
- The user now prefers aider for its simplicity, and uses Gemini 2.5 Pro, Gemini Flash, and Qwen3 Coder along with
/run
to replicate Claude Codeâs plan mode.
- The user now prefers aider for its simplicity, and uses Gemini 2.5 Pro, Gemini Flash, and Qwen3 Coder along with
- Run command is a Killer Feature: The
/run
in aider is a major feature for a user, and they noted Aider is good when you know better what files you want to work with.- They also enquired where they can see Aiderâs success stories.
- Coding Agent Undergoing Refactoring: A member is refactoring their own coding agent, inspired by Aider, to learn more about AI system design.
- They already have a small proof of concept, but are now reading a tutorial to see what others do for a similar project.
- Code Validation Advice Sought: A member is seeking advice on how to prevent dangerous code (env leakage, rm -rfs, network requests, etc.) from being generated in any language.
- They considered a TreeSitter based validator, and asked how Aider avoids these issues, requesting pointers to relevant files in the repo.
Yannick Kilcher Discord
- Reviewing Paper Baselines Presents Challenges: A member requested general guidelines for approaching baselines presented in research papers, particularly when unfamiliar with the dataset.
- The member expressed uncertainty about judging performance without sufficient knowledge of the dataset, implying a need for more background research.
- LoRA Adds Instead of Replaces Original Weights: A member asked why LoRA trained layers are added instead of replacing the original weight matrix, noting the contrast with other efficient processes like depthwise convolutions.
- The member sought a paper, article, or reasoning to explain this design choice, rather than replacement, and mentioned having an intuition on the matter.
Manus.im Discord Discord
- AI Politeness Gets Scientific Backing: A user shared a paper providing scientific evidence that being polite to AIs matters.
- The discussion centered around whether acting nicely towards AIs results in more cooperative behavior.
- Demand for Scientific Validation of AI Politeness: Users expressed a desire for scientific proof that politeness influences AI behavior.
- This aligned with the shared arxiv link, suggesting a community interest in understanding the impact of human-AI interaction styles.
LLM Agents (Berkeley MOOC) Discord
- AI Agents Curriculum Plans for 2025 Examined: A member inquired whether the 2025 Fall curriculum would mirror the 2024 Fall curriculumâs focus on Introduction to AI Agents.
- They explicitly requested a link to join the course, suggesting they are seeking registration or access details.
- Fall 2025 enrollment: The user is specifically looking for information on how to join the Fall 2025 course.
- They explicitly requested the link, and will be awaiting course joining details.
The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The Windsurf Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
You are receiving this email because you opted in via our site.
Want to change how you receive these emails? You can unsubscribe from this list.
Discord: Detailed by-Channel summaries and links
LMArena â· #general (1100 messagesđ„đ„đ„):
Image generation issues, Video arena bot down, Login requirements, Rate limits, Account data loss
- Image Generation Glitches Rampant: Users reported widespread issues with image generation, including persistent errors and infinite generation loops, with many experiencing the dreaded âSomething went wrong with this responseâ message.
- A member pointed out that âSometimes the model is confused with the prompt and gives the same error⊠We seriously need to have more error feedbackâ suggesting the need for more specific error messages.
- Video Arena Bot grounded by Glitches: The video arena bot is currently down, with the team actively working to resolve the issues, however there is no ETA for when it will be back online.
- One member quipped that *âThe video bot currently isnât working. Trying to use it in different channels isnât going to work. Even if the bot was working properly youâre unable to use it in this channel.â
- Login Requirements Trigger Tantrums: The introduction of login requirements, particularly the Google account requirement, has caused concern among users.
- One member pointed out the requirement was bad saying, because most of us donât want to.
- Rate Limits Rattle Regulars: Users noticed the implementation of rate limits, leading to frustration and discussions about whether they are intentional or a result of ongoing issues.
- A member commented that âif you arent logged in you get like 2 or 3 generations before being rate limited, even on battleâ while another was also confused asking âYeah Iâm confused as wellâ, wondering what exactly was added or changed.
- Account Data Vanishes in Volatile Venture: Several users reported instances of lost chat histories, particularly when not logged in, leading to concerns about data retention.
- One member suggested, âif you use brave you might be able to restore them, i dont know about googleâ, while itâs also been noted that the platform is also likely using cloudflare.
LMArena â· #announcements (3 messages):
Video Arena Discord Bot, User Login, Rate Limits
- Video Arena Discord Bot Back Online: The Video Arena Discord Bot is back online after a fix; to use the bot, enter
/video
with a prompt in the specified channels: <#1397655695150682194>, <#1400148557427904664>, or <#1400148597768720384>.- A GIF illustrating how to use the bot was shared here.
- Rate Limits Introduced for Image Generation: Due to unprecedented traffic, rate limits have been introduced for image generation.
- Logged-in users will continue to enjoy higher limits, and more information about user login can be found here.
Perplexity AI â· #announcements (1 messages):
iOS App Redesign, Comet Access for Students, Comet Shortcuts, Voice Assistant in Comet, GPT-5 Thinking for Pro Users
- Perplexity Ships Six Hot New Features: Perplexity AI announced the release of six new features on September 5th, detailed in their changelog.
- These include an iOS App Redesign, Comet access for students, Comet shortcuts, a more capable Voice Assistant in Comet, GPT-5 Thinking for Pro users, and updates to Perplexity Finance.
- Students Get Comet Access: Perplexity AI is now offering Comet access to students as part of their back-to-school initiative, announced September 5th.
- This aims to provide students with advanced AI tools for research and learning, integrating seamlessly with their existing educational workflows.
- Pro Users get GPT-5 Thinking: Pro users can now access GPT-5 Thinking capabilities within Perplexity AI as of September 5th.
- This upgrade provides enhanced reasoning and problem-solving abilities, allowing for more in-depth analysis and insights.
Perplexity AI â· #general (823 messagesđ„đ„đ„):
Grok 4 struggles, Qwen 3 Max, Comet Browser, Gemini 2.5 Pro, AI Model Parameter Size
- University Pro Users Missing Grok 4: Some university Perplexity Pro users reported missing Grok 4 in their model selector, and were advised to contact support and to check if their Pro account was the enterprise version.
- It was suggested that reinstalling the app might help resolve the issue.
- Goodbye Arc-a-Dia: Users discussed the transition from Arc to Dia, with one noting that Arc hasnât had a meaningful update in about a year and another expressing concern about the $15 charge for the browser.
- They added that a large fanbase of happy Arc users would be left behind with the transition to building agentic chromium.
- Qwen 3 Max Hype Surges: Members speculated on the specs of the upcoming Qwen 3 Max, anticipating parameters between 500B and 1 Trillion.
- One member stated that they believe that since the models are free for consumers that itâs for Better Training Data and Big Community.
- Cometâs Limits Spark Debate: Users are reporting that after using Comet Personal Search too much, it stops working, throwing the message Youâve reached your daily limit for Comet Personal Search. Upgrade to Max to increase your limit.
- Others noted that Comet is currently offered on the Paypal/Venmo deal or if youâre a student, and are sharing invite links in the discord.
- Perplexityâs Special Sauce: Fact-Checking: Members discussed the strengths of Perplexity compared to other platforms like ChatGPT and Gemini, highlighting Perplexityâs focus on fact-checking and web search.
- One user stated, I use Perplexity primarily for fact checking and quick research as that is its primary strength - facts with citation and references. It beats ChatGPT, Gemini and Claude hands down.
Perplexity AI â· #sharing (3 messages):
AMD Zen 6 CPUs, Omarchy Linux, Shareable Threads
- AMD Preps Zen 6 CPUs: A member shared a link about AMD preparing Zen 6 CPUs.
- Omarchy Linux Distribution: A member shared a link about the Omarchy Linux distribution.
- Shareable Threads reminder: A Perplexity AI bot reminded users to ensure their threads are set to
Shareable
.
Perplexity AI â· #pplx-api (4 messages):
API 500 Errors, Playground issues, Outage reporting
- 500 Errors Plague PPLX API: Multiple users reported receiving 500 Internal Server Errors on API calls and noted that the Playground was also non-functional.
- The users confirmed that no outage was reported on the status page, while one quipped that theyâre going to pretend nothing happened after the service appeared to be working again.
- Image Analysis Troubleshoot Internet: The attached image prompted a suggestion to check the internet connection.
- This suggestion came as a response to the reported API and Playground issues, implying a possible user-side connectivity problem.
Unsloth AI (Daniel Han) â· #general (574 messagesđ„đ„đ„):
Postgres with pgvector vs. Qdrant, Local Sonnet, DGX Spark vs DGX Station, Qwen 3 Max Evaluation
- Postgres for Complex Queries, Qdrant for Vector Search: Members discussed that while Qdrant is good for vector search, Postgres with pgvector might be more suitable for complex database queries.
- One member linked to a tweet and shared a Borat GIF.
- Local Sonnet Requires Hefty RAM, Quality Sacrificed: Running Local Sonnet requires at least 512GB of RAM, and even with 1TB of RAM, full precision is not possible.
- One member asked if Q8 fine-tuning would help, but another responded that even Q8 is too big for 1TB RAM.
- DGX Spark for Toying, DGX Station for Work: Members compared the DGX Spark and DGX Station, with one noting that Spark is a toy, Station is a work station, linking to the DGX Station product page.
- It was noted that DGX Spark has a good price, with the ConnectX-7, but the 8 grand model originally had 4tb of storage and they upped it, a good product.
- Qwen 3 Max Impresses in Creative Writing, Falls Short in Coding: Members evaluated Qwen 3 Max, finding it very good at creative writing and roleplay, better than K2 and Sonnet 4 imo.
- However, it was considered overpriced and not super super good for tool calls and logic based coding.
Unsloth AI (Daniel Han) â· #introduce-yourself (3 messages):
Unsloth AI, GPT-OSS, Google Colab T4, Runtime Error
- Unsloth AI Troubleshooting: A member requested help with Unsloth AI while trying to finetune GPT-OSS using GRPO on Google Colab T4.
- The user reported encountering a runtime error and requested assistance from the community.
- Colab T4 User Seeking GRPO Guidance: A user sought guidance on using GRPO (presumably a reinforcement learning technique) to fine-tune a model, GPT-OSS, on a Google Colab T4 instance using Unsloth AI.
- The user specifically mentioned encountering a runtime error during the process and was looking for help from the community.
Unsloth AI (Daniel Han) â· #off-topic (164 messagesđ„đ„):
Super cards release updates, GLM 4.5 Air usable tps, Rover Mows Grass, Deepseek & Qwen Tokenizers are Interchangeable, Mini Kimi K2 MoE models
- Super cards release updates janky: Members discussed that support is still janky after the Jan 30th release, but joked about major updates when the Super cards are released.
- Another member shared a meme implying updates are unlikely: Biden dance stare clueless gif.
- GLM 4.5 Air runs Usably: A member noted that GLM 4.5 Air with 132K context, Q4 gets 1.15 tps, which is usable.
- Another member was shopping for parts and going to try out distributed first, and also mentioned not having tried full for kv.
- Robot Rover mows grass: A member is thinking of stacking 3090s and saving budget for a rover to mow the grass.
- They mentioned the rover might run a small vision model with additional safety systems like lidar and sonar.
- Deepseek & Qwen Tokenizers are Interchangeable: It was reported that the deepseek r1-0528 tokenizer is interchangeable with the qwen3 tokenizer.
- Members discuss if that meant they copied the same arch and distilled from the same model and put it back into the copied arch.
- Mini Kimi K2 MoE models coming soon: Members are interested in a mini Kimi K2, maybe like a 30B MoE or less.
- Another member suggested a 150b with 1b active for similar sparsity.
Unsloth AI (Daniel Han) â· #help (40 messagesđ„):
Training vs Inference, GPT-OSS finetuning issues, Gemma-3 finetuning errors, Tokenizer Impact on Finetuning, GRPO Support for Gemma3 with vLLM
- Training Throughput vs Inference Throughput Discussed: A member was comparing token throughput during inference and expressed a desire to train using GRPO similar to Llama3, and offered to share non-functional code.
- They were using Unsloth 2025.9.1, Transformers 4.55.4, Tesla T4 with 14.741 GB, Torch 2.8.0+cu126, CUDA 7.5, CUDA Toolkit 12.6, and Triton 3.4.0.
- GPT-OSS Finetuning Yields Meaningless Output: A member reported that after finetuning, the output channel and content of GPT-OSS became meaningless, displaying a series of mathematical symbols and unrelated words.
- The traceback of the error can be found here.
- Gemma-3 Finetuning Generates Attribute Error: A member encountered an
AttributeError: 'SlidingWindowLayer' object has no attribute 'max_batch_size'
when running inference on Gemma-3-270M after finetuning.- A suggestion to use
use_cache=False
was reported to have resolved the issue.
- A suggestion to use
- Tokenizer Selection Impact on Finetuning Discussed: A member questioned the impact of using a different tokenizer than the one that comes with the pre-trained model during fine-tuning.
- Another member stated that itâs like usar uma lingua diferente da que o modelo foi treinado em primeiro lugar, recommending to use the same tokenizer for the inputs to be in a âlanguageâ the model understands.
- CUDA linking problems: A member encountered an
AttributeError: module 'bitsandbytes' has no attribute 'functional'
when running code from this notebook.- The warnings suggest that CUDA is not linked properly and recommend running
sudo ldconfig /usr/lib64-nvidia
andsudo ldconfig /usr/local/cuda-xx.x
.
- The warnings suggest that CUDA is not linked properly and recommend running
Unsloth AI (Daniel Han) â· #showcase (2 messages):
Glazer model, GPT-4's personality, Ollama, HuggingFace
- Glazer Model Mimics GPT-4âs Praise-Heavy Persona: A new model called Glazer was released, designed to emulate the sycophantic personality of GPT-4 that some users miss.
- It can be run locally and is available on Ollama via
ollama run gurubot/glazer
and on Huggingface in 4B and 8B versions.
- It can be run locally and is available on Ollama via
- Unsloth receives gratitude for Glazer: The model received a thank you in the form of a picture that expressed gratitude for Unsloth.
- The post featured a slothhearts emoji.
Unsloth AI (Daniel Han) â· #research (11 messagesđ„):
Latent Features, Hermes NLP, Financial AI
- Latent Features Debunked: A member suggested latent parts of neural networks could destroy features, suggesting the bottleneck in the nn is not a real feature.
- He sarcastically remarks that if these were real features, everyone would be a billionaire, and dismisses claims of success as made by people with no idea.
- Hermes is not for NLP tasks: A member said you canât certainly use hrms to do anything like nlp.
- He ended the message with a all on red baby đ.
LM Studio â· #general (97 messagesđ„đ„):
GPU power draw concerns, Lora Training, Realistic roleplaying model, LM Studio local network setup, Consumer priced Exoskeleton
- 3090 possibly enabling 4090 Overclock: A member was concerned that their 3090 was allowing their 4090 to draw more power than it should when boosting, leading to higher than expected temperatures.
- They believed a software bug might be causing the GPU to exceed manufacturer limits, and mentioned usually undervolting to prevent overheating.
- Tentacle Loras stomp art styles: A member shared they had trained a bunch of LORAs all stomped together
- Another member asked about the purpose and tentacle-like shape, the trainer replied that he uses them to explore certain art styles he finds interesting, providing a link to the LORA template.
- 6GB GPU Seeks Roleplaying Model: A member with a 6GB GPU from 2019 asked for the best model for realistic, immersive, long-lasting roleplaying games.
- Another member suggested increasing CPU RAM to at least 64GB and using Mistral Nemo Instruct; however, a third member suggested Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE.
- Phone chats from PC with Local LLM via LM Studio: A member asked about running AI on their PC while chatting on their phone, another member suggested using a client app that speaks to OpenAI API and connecting via local network or a tunnel for remote usage.
- After some troubleshooting involving server IPs and client apps like Apollo, the member successfully connected using ngrok and Open WebUI.
- Bionic legs underperform: A member inquired about consumer-priced bionic legs (exoskeletons).
- Another member cited a YouTube review suggesting they barely do anything and might even cause muscle atrophy.
LM Studio â· #hardware-discussion (139 messagesđ„đ„):
Frame Generation, Nvidia 5000 series, ATX 3.1 standard, CPU offload vs GPU, Mi50 VRAM quirk
- Quadruple Frames for Smooth Gaming?: Members debate the utility of 4x frame generation, suggesting itâs only beneficial with a decent base FPS for achieving something like 240FPS at 4K.
- Is Nvidiaâs 5000 Series a âSkip Genâ?: The new Nvidia 5000 series might be a skip gen, due to minimal performance gains over the 4000 series, which already had excellent efficiency, but also because they are unwilling to add more VRAM.
- ATX 3.1 Standard Fixes Power Connector Woes: The ATX 3.1 standard introduces the 12V-2x6 connector, addressing 12VHPWR issues with longer conductor terminals and shorter sense pins, allowing the GPU to shut off if the connection loosens.
- CPU Offload not a Slam Dunk?: Experiences vary regarding CPU offload; one user found their desktop significantly faster than their server, even with a 4070 TiS performing similarly to an Mi50.
- Mi50 VRAM Performance Quirk Uncovered: A peculiar characteristic of the Mi50 is identified: performance halves beyond the first 16GB of VRAM.
Cursor Community â· #general (154 messagesđ„đ„):
GPT-5 vs Sonnet 4, Codex CLI vs Cursor Code, Claude Code, Gemini 2.5 Pro, Cursor Pricing
- GPT-5 Reigns Supreme Over Sonnet 4: Members generally agree that GPT-5 is superior to Sonnet 4 for coding tasks, noting that while GPT-5 may require more specific prompting, it is less prone to hallucinations and provides more concise, accurate answers.
- Some users find Sonnet 4 more template-based and prone to jumping to conclusions, whereas GPT-5âs directness is preferred, making it a valuable planner and discussion partner especially when combined with auto-implementation.
- Codex CLI vs Cursor Code: A Showdown: Users are divided on whether to use Codex CLI or Cursor Code, with some preferring Codex CLI for its code quality, while others laud Cursor Code for superior creative thinking, with reasoning abilities, also the quality depend heavily on the prompt.
- One member unsubscribed from Cursor Codeâs Max plan due to frustration with hallucinations when fixing bugs, while others caution about Codexâs lower and harder to track rate limit; some like the suggestion system inside of the Codex CLI.
- Cursorâs $20/m: Is It Worth It?: Several users discussed the value of Cursorâs $20/month Pro plan and how fast one can hit the limit.
- Some find it essential for coding, one user canceled their Cursor subscription in favor of Claude Code and Codex, suggesting the best combo is a $20/month Cursor subscription paired with a Claude Code plan for inline editing and terminal usage.
- Beware Cursorâs Buggy Auto-Mode: Multiple users are experiencing issues with Cursorâs Auto mode, reporting that it performs poorly, fails to fix simple bugs, and sometimes types edits in the chat instead of applying them.
- One user humorously described Cursor as being excessively proud of its work despite the need for extensive debugging, sharing a meme-like message generated by the tool.
OpenRouter â· #announcements (1 messages):
Qwen3-Max, RAG, Tool calling
- Qwen3-Max releases with several improvements: The latest Qwen3-Max model boasts higher accuracy in math, coding, logic, and science tasks compared to the January 2025 version, according to this X post.
- It also delivers better instruction following in Chinese and English, stronger multilingual support across 100+ languages, reduced hallucinations, and optimized for RAG and tool calling.
- Qwen3-Max is optimized for RAG and Tool calling: The Qwen3-Max is optimized for RAG and tool calling and does not have a dedicated âthinkingâ mode
- Try Qwen3-Max here to check out itâs capabilities.
OpenRouter â· #app-showcase (1 messages):
tomlucidor: Finds https://github.com/Lapis0x0/obsidian-next-composer
OpenRouter â· #general (126 messagesđ„đ„):
OpenRouter Crypto Scam, Anthropic's Geopolitical Concerns, API Key Issues, BYOK Fees, Token Limits and Output Truncation
- OpenRouter coin is fake: Members confirmed that any OpenRouter-related cryptocurrency is a scam and not officially affiliated with OpenRouter.
- Despite warnings, users inquired about the presence of an OpenRouter coin on PancakeSwap and its availability for trading, prompting further clarification that OpenRouter has no official involvement in any cryptocurrency.
- Anthropicâs Geopolitical Stance Raises Eyebrows: Members discussed Anthropicâs latest blog post which prohibits access from jurisdictions with ownership structures subject to control from countries where their products arenât permitted.
- Some wondered if this was a matter of national security or simply market share protection.
- API Keys throw authentication errors: A user reported API key issues, receiving a âNo auth credentials foundâ error message from ChatGPT.
- The user was prompted to specify the client being used, either the OpenAI client or a custom one, to diagnose the authentication problem.
- BYOK Fees Demystified: A user inquired about the charges associated with using BYOK (Bring Your Own Key), specifically with chutes and Qwen Coder 3.
- It was clarified that OpenRouter charges a 5% BYOK fee on top of what the provider (e.g., chutes) charges.
- Token output limited to 8k: A user wanted to ensure errors are thrown when the output token limit is exceeded.
- The response gets cut off when the token limit is reached, with the stop reason identified as âlengthâ; the API will prevent you from setting
max_tokens
higher than the modelâs limit.
- The response gets cut off when the token limit is reached, with the stop reason identified as âlengthâ; the API will prevent you from setting
OpenRouter â· #new-models (1 messages):
Readybot.io: OpenRouter - New Models
OpenRouter â· #discussion (12 messagesđ„):
Benchmark Increase, Real World Performance vs. Benchmarks, OpenRouter API usage
- Benchmarks Keep Climbing!: Members noted that every benchmark keeps going up, but the disconnect between benchmark percentage increase and real-world performance keeps rising.
- They added that 5% delta on benches used to be noticeable but is becoming less so, as we are reaching a plateau, though models have improved in creative writing, EQ, tool call failures, and context length adherence.
- OR Uses OpenAI Responses API: A member asked if OpenRouter uses the OpenAI Responses API, linking to a tweet.
- Another member confirmed that it does for the majority of OpenAI models.
OpenAI â· #ai-discussions (84 messagesđ„đ„):
Multi-Agent Orchestration, Token Efficiency, Gemini 2.5 Pro, Good Luck Token Waste, Carbon Footprint of AGI
- Orchestrate Agents with Context Offloading: One member suggests using multi-agent orchestration with context offloading and dynamically filtering extraneous context to improve vectorization.
- They recommend using a simple setup with orchestrators, a conductor, and specialized agents, emphasizing the importance of managing context to avoid corrupting HO/HD operations.
- Slash Useless Token Waste: One member advocates for token efficiency by filtering grammatically and syntactically useless words and consolidating multiple words into useful ones in the prompt.
- They claim that in inference, wasted tokens = wasted resources [money] if youâre paying, and accelerated amortization of components if youâre hosting.
- Gemini 2.5 Pro Unlocks Unlimited Access: Members reported Google AI Studio gives unlimited access to Geminiâs best model, 2.5 Pro with other features like Imagen, Nano Banana, Stream Realtime, speech generation, and Veo 2.
- Some members only care about the LLMs and view video and images as a fun factor, and others have found a real use for video editing, educational videos and recreating public domain.
- Pleasantries like Good Luck Waste Tokens: One member argued that asking a question is wasting context, unless it is phrased in a way that gives multiple choices based on the answer.
- Another member suggested the phrase âGood luckâ is a waste of tokens in terms of transmitting information, while politeness influences AI responses, potentially increasing environmental waste.
- Carbon Footprint of Claims of AGI Outpaces Token Wastage: Some members discussed a blog post about new statistics revealing the carbon generated by claims of AGI now outpaces the token wastage used on please and thank you in ChatGPT.
- The blogpost can be found here.
OpenAI â· #prompt-engineering (10 messagesđ„):
Discord chat to Markdown, Prompt engineering lessons, Hierarchical prompting, Abstraction in prompts, ML format matching
- Discord Chat Text Transformation Tactics: A user inquired about the easiest method to extract text from a Discord chatâs web interface in MS Edge and save it as a Markdown file (*.MD).
- Darthgustav Gives Prompt Engineering Lessons: A user named darthgustav shared a JavaScript code snippet outlining prompt engineering lessons.
- The lessons cover hierarchical communication with markdown, abstraction through variables, reinforcement in prompts, and ML format matching for compliance.
- Hierarchical Prompting Techniques: One lesson from Darthgustav explained hierarchical communication with markdown for prompting, enhancing clarity and structure.
- By utilizing markdown, the prompt aims to organize information in a structured manner, making it easier for the model to follow instructions and generate desired outcomes.
- Abstraction using Brackets in Prompts: The user introduced abstraction through bracket notation [{(open variables resolved by the AI)}] and ${(by the user)}.
- It emphasizes the importance of explaining bracket interpretation ([list], {object}, (option)) to efficiently manage complex prompts.
- ML Format Matching for Output Compliance: One lesson includes ML format matching for compliance, covering [{output templates} and {(conditional) output templates}].
- The goal is to guide tool use and shape output more deterministically by reinforcing specific formatting in prompts.
OpenAI â· #api-discussions (10 messagesđ„):
Discord Chat to Markdown, Prompt Engineering Lessons, Hierarchical Communication in Prompts, Abstraction in Prompts, Reinforcement in Prompts
- Discord Text Dump to Markdown: A user inquired about the easiest method to extract text from a Discord chat (web interface, MS Edge browser) into a Markdown file.
- The user sought to optimize the process, focusing on simplicity and efficiency, implying a need for a straightforward solution for exporting Discord chat logs to .md format.
- Prompt Engineering Instruction Manual: A user shared a detailed JavaScript code block outlining prompt engineering lessons, aiming to teach hierarchical communication, abstraction, reinforcement, and ML format matching.
- The lessons cover using markdown for prompting, abstraction techniques with bracket interpretation ([list], {object}, (option)), guiding tool use, shaping output deterministically, and using output templates for compliance.
- Abstraction Elevation via Bracketology: The user emphasizes teaching abstraction through bracket interpretation such as [list], {object}, and (option) within prompts.
- This approach aims to enhance clarity and structure, enabling more effective communication between the user and the AI, improving overall prompt engineering practices.
- Reinforcement Ramp-Up for Guidance: The user highlights the importance of reinforcement in prompts to guide [tool use] and (shape output) more deterministically.
- By strategically reinforcing desired behaviors, prompts can achieve higher precision and compliance, leading to improved results and more predictable AI interactions.
GPU MODE â· #general (3 messages):
Anthropic's new policy, Kernel creation solutions
- Anthropicâs Policy Raises Eyebrows: A tweet regarding Anthropicâs new policy, prohibiting service to organizations controlled by jurisdictions where their products arenât permitted (like China), sparked debate over whether this is about national security or mere corporate self-interest.
- Navigating the Kernel Creation Cosmos: A member inquired about how to determine whether to build a custom kernel solution versus using existing ones.
- Another member suggested checking the HF kernel hub and exploring standards like liger before deciding to build from scratch.
GPU MODE â· #triton (2 messages):
Triton, CUDA, GPU, PMPP Book
- Newcomer Seeks Triton Guidance: A member sought guidance on learning Triton without prior CUDA or GPU experience.
- Another member recommended the official Triton tutorials as a starting point.
- Triton Resources: The user asked about resources to quickly get started with Triton.
- The user also inquired whether itâs required to read the PMPP book for Triton.
GPU MODE â· #cuda (14 messagesđ„):
Barnes-Hut performance, CUDA, Morton code sorting, Octree construction, Memory access optimization
- Barnes-Hut Performance Probed: A member is facing performance issues with a Barnes-Hut CUDA simulation, where the tree traversal and force computation kernel takes 100ms for 30k bodies, despite optimized Morton code sorting and octree construction.
- Another member suggested comparing it to
torch.cdist
and probing around with an LLM to check for access patterns.
- Another member suggested comparing it to
- Morton Sorting is sus: Leaf nodes are stored in a flat array sorted by Morton codes, fusing particles with identical codes into one leaf node.
- Members discussed how threads traverse the tree and retrieve values from memory.
- Coalesced Memory Access Clarified: A member asked whether memory accesses are coalesced during tree traversal, given particles are sorted by Morton codes.
- The OP confirmed that threads in the same warp traverse the tree similarly, retrieving the same values, but still finds the 100ms runtime perplexing.
GPU MODE â· #torch (13 messagesđ„):
fp8 matrix multiplication, tensor cores accumulator, Runtime Triggered Module Loading, vLLM profiling
- Debate on Fused Accumulation: Members debated the difference between two options in PyTorchâs
mm.py
(lines 128-132) regarding fused accumulation in tensor cores.- The first option might be a fused accumulation according to a member, and another mentioned a scenario of int8 MMA where the first version gives an error while the second doesnât.
- Deep Dive into Reduced Precision Accumulation: A paper (arxiv.org/pdf/2411.10958) revealed that the FP32 accumulator designed for FP8 matrix multiplication in tensor cores is actually FP22 (1 sign bit, 8 exponent bits, and 13 mantissa bits).
- fast_accum = True uses the tensor coreâs accumulator for the entire main loop with reduced precision (~22 bits), while fast_accum = False sends the result of the tensor core op to a regular register accumulator in full FP32 precision.
- Runtime Triggered Module Loading slows vLLM: During vLLM profiling, significant time is spent on âRuntime Triggered Module Loadingâ, but its precise meaning and how to avoid it during profiling are unclear.
- A member shared a trace and attached a [qwen3-1.7b-compile-cudablock.gz] in hopes of finding out more.
GPU MODE â· #algorithms (6 messages):
FlashAttention, FA1, FA2, FA3, FA4
- FlashAttention Visualized: A member asked if their interpretation of Flash Attention (animation, source code here) was approximately correct.
- The fire animation is supposed to represent softmax/fused kernel.
- FlashAttention Loop Orders: A member pointed out that the loop order in the original animation was reversed compared to FlashAttention v2 (FA2).
- In FA2, iteration along K/V is the inner loop, and iteration along Q/O is the outer loop.
- FlashAttention Evolves Further: The original visualization was based on FA1, according to the original poster.
- It was noted that FA3 and FA4 also follow the general design of FA2, but are optimized for Hopper and Blackwell architectures, respectively.
GPU MODE â· #beginner (4 messages):
Model optimization roadmap, Sparse convolution in ONNX Runtime, BEV fusion model
- Seeker asks for Model Optimization Roadmap: A member is seeking a roadmap to learn model optimization techniques, including writing custom kernels, with a focus on SMS count and VRAM usage.
- They plan to use a 5060 Ti with 16GB of VRAM.
- Sparse Convolution Support Scarcity in ONNX Runtime: A member is trying to run a BEV fusion model using ONNX Runtime, but the hardware they are using doesnât support PyTorch, and ONNX Runtime lacks support for sparse convolution.
- They are asking if sparse convolution can be replaced with other operators or if anyone has added sparse conv support in ONNX Runtime.
GPU MODE â· #irl-meetup (1 messages):
apaz: Now in NYC if anyone wants to meet up
GPU MODE â· #rocm (8 messagesđ„):
rocSHMEM, ROCm-aware open MPI, HIP kernels, ROCm/iris
- rocSHMEM Implementation Inquiry: A member is exploring rocSHMEM implementation similar to HIP kernels using load_inline, encountering errors related to ROCm-aware MPI requirements.
- The member referenced ROCm/rocSHMEM for dependency configurations and suggested incorporating them into the Dockerfile.
- ROCm/iris alternative surfaces: A member suggested trying ROCm/iris as a possible alternative while they investigate the issue.
- The original poster agreed to try it out and expressed enthusiasm for the project, while another user was tagged as a potential user.
GPU MODE â· #webgpu (2 messages):
:catgirl5: emoji usage, thinking hard emoji
- Catgirl Emoji spotted!: A member noted âOh cool a :catgirl5: in the wildâ, referring to emoji usage in the channel.
- Catgirl becomes Thinking Emoji: A member stated itâs weirdly a good âthinking hardâ emoji lol in reference to the same.
- The community seems to have latched onto this emojiâs meme potential.
GPU MODE â· #self-promotion (1 messages):
GPU L2 Cache, Ampere Architecture, CUDA Project Structure, Persistent Memory Accesses
- L2 Cache Persistence Boosts GPU Performance: Leveraging the Ampere architecture, a blog post demonstrates reserving part of the L2 cache for persistent memory accesses to improve GPU performance, detailed in a blog post.
- CUDA Project Structuring with CMAKE Example: The provided code serves as an example of structuring a CUDA project using the CMAKE build system, enhancing code organization and maintainability.
GPU MODE â· #reasoning-gym (1 messages):
Contributions Welcome, Prototype Sharing, Pull Requests
- Contributions Welcomed for New Tasks: The channel indicated that new task contributions are welcome, encouraging members to share prototypes.
- Alternatively, members can open a PR to the repo for iterative development.
- Prototype Sharing Encouraged: Members are encouraged to share prototypes in the channel to gather feedback and iterate on their ideas.
- Sharing prototypes helps foster collaboration and accelerates the development process.
GPU MODE â· #submissions (1 messages):
MI300x8, amd-all2all leaderboard
- MI300x8 scores on leaderboard: A submission to the
amd-all2all
leaderboard on MI300x8 was successful at 334 ”s. - AMD all2all benchmark update: The latest result on the
amd-all2all
leaderboard showcases an impressive performance on the MI300x8 hardware.
GPU MODE â· #factorio-learning-env (6 messages):
Factorio Crafting Tool, FLE installation issues, Prototype Recipe Retrieval
- Factorio Agentâs Prototype Recipe Retrieval: The
get_prototype_recipe
tool retrieves complete recipe information for any craftable item in Factorio, essential for understanding crafting requirements and planning production chains.- The agent can use the
get_prototype_recipe
action to get a recipe for a single item and call it again if needed to get sub-recipies.
- The agent can use the
- FLE Installation Experiences Turbulence: A member reported having issues during the installation of FLE.
- They mentioned they will listen into the meeting but will not engage much due to an important presentation afterwards.
GPU MODE â· #amd-competition (9 messagesđ„):
CLI Tool vs Online Submission, ROCshmem Template, Web Version Organization, Online Testing Env Triton Support, Prize Registration Reminder
- CLI Tool Trumps Online Submission: A participant found that they can use a CLI tool to view settings for
num_experts
instead of submitting online.- Another participant mentioned that the web submission is the latest effort to make submissions accessible, but itâs still in alpha.
- ROCshmem Template Quest: A participant inquired about a ROCshmem template and noted it requires ROCm-aware open MPI, wondering if these are included in the kernel bot workflows.
- No responses were made.
- Web Versionâs Organization Praised: One participant appreciated the improved organization of the web version.
- They suggested adding config-wise runtimes for enhanced helpfulness.
- Triton Support Status Questioned: A participant inquired whether the online testing environment supports Triton.
- No responses were made.
- Prize Registration Reminder: A reminder was issued that participants need to be registered to qualify for prizes, with registrations closing on September 20th.
GPU MODE â· #singularity-systems (1 messages):
cuBLAS, ROCm, cuDNN, MIOpen
- Deep Dive into BLAS and DNN Internals: A member mentioned they are studying the codebase, and are looking for people with experience in the internals of cuBLAS/rocBLAS or cuDNN/MIOpen.
- They added that there will be âlots more to do these next few weeksâ for those with the relevant expertise.
- No Topics: No significant topics were discussed.
- Only one message was sent.
GPU MODE â· #general (21 messagesđ„):
Pickling Errors, Serialization Issues, NaNs in Triton Kernels, Benchmarking Discrepancies
- Pickling Problem Plagues Python Process!: A user encountered a
TypeError: cannot pickle 'frame' object
during evaluation, stemming from multiprocessingâs inability to serialize a specific object being passed between processes; here is the traceback. - Serialization Snafu Stymies Submission!: The error was attributed to the evaluation process using a separate process to prevent cheating, which exposed a serialization issue with the userâs submission.
- The user was advised to check the output of their custom_kernel function, as the error suggested that the functionâs return value was not serializable.
- NaNs Nab Numerical Nirvana!: The user identified the presence of NaNs (Not a Number) values in their kernelâs output as a potential cause for the serialization error, with a member confirming that NaNs can indeed cause such issues.
- The user initially suspected a grid error leading to the creation of these NaNs and expressed intentions to resubmit the code after fixing it.
- Benchmark Blues Baffle Budding Benchmarker!: Despite passing initial test runs, the user continued to face errors during the benchmarking process, leading to the revelation that test runs and benchmarks are distinct and that the latter is designed to be more complex.
- The user was informed that the presence of NaNs in the benchmark output could still be an issue, even if test runs pass successfully, because we run the code multiple times, up to 100 per size.
Latent Space â· #ai-general-chat (58 messagesđ„đ„):
OpenAI Custom AI Chip, Mercor $10B Pre-emptive Offers, Augment (Augie) $85M Series A, OpenAI Responses API, Hugging Face FineVision Dataset
- OpenAI Co-Designs $10B AI Chip with Broadcom: Financial Times reports OpenAI partnered with Broadcom to co-design a custom AI chip, with mass production slated to start next year, indicating a move away from Nvidia dependency; the chip is estimated to cost $10B.
- Community reactions range from skepticism about the chipâs quality to speculation that OpenAI will out-compete its own customers; link to article.
- Mercor Receives $10B Pre-emptive Offers: AI-hiring startup Mercor has received unsolicited offers valuing it at ~$10Bâ5Ă its June 2025 Series B priceâjust four months later, spurring jokes about the AI-funding frenzy; link to tweet.
- Augie Raises $85M Series A for AI Logistics: Augment (Augie) announced an $85M Series Aâbringing total funding to $110M in just 5 monthsâto scale their AI teammate built for the $10T logistics sector; link to announcement.
- Augie already helps freight teams handling $35B+ double productivity by orchestrating end-to-end order-to-cash workflows across email, calls, Slack, TMS and more.
- Responses API Myth-busting Thread: A thread clarified widespread confusion about the OpenAI Responses API, debunking myths that Responses is a superset of Completions, can be run statelessly, and unlocks higher model intelligence & 40-80% cache-hit rates; link to thread.
- Developers stuck on Completions are urged to switch to Responses for GPT-5-level agents, with pointers to OpenAI cookbooks.
- Baseten Bags $150M Series D: Baseten announced a $150M Series D round led by BOND with Jay Simons joining the board; the company powers AI inference for customers like Writer, Notion, Sourcegraph, and others, and welcomed new investors Conviction and CapitalG; link to annoucement.
Latent Space â· #ai-announcements (4 messages):
AI Engineer CODE Summit 2025, NYC AI Event
- AI Engineer CODE Summit 2025 Announced: The AI Engineer team is launching its first dedicated CODE summit this fall in NYC, gathering 500+ AI Engineers & Leaders alongside top model builders and Fortune-500 users to unpack the reality of AI coding tools - announcement link.
- The summit is invite-only, with two tracks (Engineering & Leadership), no vendor talks, and a CFP open until Sep 15.
- AI Engineer Summit Focus: The AI Engineer CODE Summit 2025 aims to celebrate PMF (Product-Market Fit) while addressing MITâs statistic that 95% of enterprise AI pilots fail.
Latent Space â· #genmedia-creative-ai (21 messagesđ„):
Nano Banana, AI Girlfriend, AI Design Masterclass, Nvidia Cosmos DiffusionRenderer
- Nano Banana floods timeline with AI art: Logan Kilpatrick tagged @NanoBanana (Googleâs newest banana-branded image model), prompting a single âhello worldâ banana billboard, sparking a frenzy of creative prompts from users.
- The thread exploded into a viral AI-art playground, generating art from prompts like Elon-Sam-Demis-Ilya selfie and Winnie the Pooh in China, while also sparking jokes, praise, and complaints about AI slop (see example).
- AI Girlfriend earns cash: @EyeingAI used DesireBots.com to create an AI girlfriend chatbot named âAda,â charging $9/month and earning $1,142 in a week from 500+ users.
- The process involved a no-code chatbot setup and built-in monetization tools, showcasing a simple way to generate revenue with AI (see tweet).
- AI Design Masterclass: Meng To released a 58-minute tutorial on creating professional-grade designs with AI, using aura.build and its 740 remix-ready templates that export to HTML/Figma.
- His design team shifted from Figma to Aura, now shipping a template daily (vs. one every two weeks), while learning HTML along the way and using Unicorn Studio for animated hero sections (see tutorial).
- Nvidiaâs open-source AI relighting demo: Nathan Shipley demoed Nvidiaâs open-source Cosmos DiffusionRenderer, which decomposes short 1280Ă704 video clips into stable passes (depth, normals, base color, etc.) for relighting with custom HDR maps.
- The tool allows relighting with custom HDR maps and examples include a home movie and a famous film scene, sparking praise for its stability and criticism for its uncanny results and current limits (57-frame max, CLI setup, garbled faces).
DSPy â· #general (78 messagesđ„đ„):
Voice Agents with DSPy, GEPA Optimization for Prompts, Multi-Turn Conversations, Groq for Inference, RAG vs Fine-tuning
- DSPy-Powered Voice Agents: A Budding Symphony: Members discussed building voice agents with DSPy and explored using GEPA to optimize prompts for frameworks like Livekit and Pipecat.
- One member suggested using the optimized prompt from GEPA as a straightforward string, while acknowledging that this might feel anti-DSPy.
- GEPA: More Than Just Prompt Optimization: It was noted that while DSPy creators might cringe at the term prompt optimization, tools like GEPA can indeed be used for this purpose.
- For prompt creation, it was suggested to setup a Rubric type judge to assess generated responses, especially at the conversation level, and Groq was recommended for inference.
- Multi-Turn Musings: DSPyâs Conversational Capabilities: While a member found no satisfying implementation of multi-turn conversations with DSPy or RL applications like GEPA or GRPO, DSPy is fully capable of handling multi-turn conversations using
dspy.History
.- However, it was cautioned that defining examples well is crucial, as itâs easy to introduce bias when building chat systems.
- RAG vs Fine-Tuning: The Memory Game: The discussion addressed how to equip voice agents with extensive information (hours, services, pricing, etc.) without runtime latency, with some approaches being fine-tuning or retrieval.
- While fine-tuning can build in memorization, itâs a big job. RAG can be simple functions or maps, things like hours of operation donât need to be a vector database.
- Streaming Strategies: Riding the Token Wave: Members explored the impact of streaming responses (token by token) on user experience, with a key focus on minimizing Time To First Token (TTFT).
- While streaming doesnât reduce TTFT, it enhances user perception by providing immediate feedback, and libraries like Pipecat already do a good job of that too, in the way that they stream frames (i think in 250 ms chunks by default).
Moonshot AI (Kimi K-2) â· #general-chat (75 messagesđ„đ„):
Kimi K2 API Credits Giveaway, Anthropic API Integration, Kimi K2 Turbo Preview, Kimi K2 Model Performance, Kimi Starter Subscription
- Kimi Giveaway API Credits Incoming: A user who won the Kimi giveaway was informed that the API credits would be sent shortly and the crew is arranging it.
- The credits were expected to be sent within an hour.
- Anthropic API MIA: A user inquired whether the Anthropic API is available on the new model.
- It was clarified that kimi-k2-turbo-preview points to -0905.
- Kimiâs 0905 Model Debuts: It was confirmed that the turbo model is now using the 0905 model, having been updated from the 0711 model.
- Some users expressed concerns about the new K2 modelâs tendency to be over poetic.
- Kimi K2 Team Dreams Big: A member clarified that the team is smaller compared to Grok/OAI, but has big dreams and a big model.
- They added it is a good thing since usually, the bigger the company, the less user interaction there is.
- Coding Focus Confuses Kimi Users: Users are confused by the focus on coding improvements in the new Kimi K2 model.
- One user stated that 0711 is better than 0905, but another user thinks the writing is more detailed & better.
Nous Research AI â· #general (65 messagesđ„đ„):
real time video AI, Spiking Neural Networks, cameras (image sensors) that are a bit closer to how the human eye works, Meta wristband reads body electricial signals to control smart glasses, Hermes's unique behavior in the husky holdem benchmark
- Spiking Neural Networks Spark Interest: Members discussed Spiking Neural Networks (SNNs) and how they mimic the brain, with one sharing a YouTube video about it.
- Another mentioned cameras and image sensors that work closer to how the human eye does, sharing this video.
- Metaâs Wristband to Read Body Signals: Meta is set to introduce a wristband that reads body electrical signals to control smart glasses, according to this Nature article.
- Hermes Exhibits Unique Holdem Behavior: A member noted that Hermes has extremely OOD unique behavior in the husky holdem benchmark, observing itâs super conservative play style in a way no other model does.
- ADHD resources shared!: A member shared resources for ADHD, motivation, learning, and productivity, including a video on Certainty Window, Salience Network and âPush/PullâActivities, Professor Hubermanâs Dopamine, Mindset and Drive, and Forming Habits is the Under-rated Strategy to Success.
- Another user chimed in saying, medication was only thing that fixed my adhd but really good tips even with meds on these links.
- Deepmind and Huawei cookinâ something special: A member stated to keep an eye on Deepmind and Huawei progress going forward with B. Neural Network, and particularly with Huawei future Quantum ( room temperature) system, that is gve a real freakout to U.S gov.
Nous Research AI â· #interesting-links (7 messages):
Micro-LLM Experiments, SLM Agents by NVIDIA, Hermes Agent Size
- Lovecraftian LLM Arises!: A member experimented with a micro-LLM trained on H.P. Lovecraftâs stories, finding the output quite promising as the loss was still decreasing when training stopped, view the youtube video.
- They speculate that a 3 million parameter model could become a light chat model with the right dataset and sufficient training.
- NVIDIA unleashes SLM Agents!: A member shared a link to NVIDIAâs research on SLM Agents (project page) and an accompanying paper (arxiv link).
- No further details were discussed about this resource.
- Hermes Agent Targets 30B Parameters: A member stated they are targeting a 30B parameter model for their Hermes Agent.
- No further details were discussed.
Modular (Mojo đ„) â· #mojo (60 messagesđ„đ„):
Zig's async IO, Mojo's type system, MLIR, Vectorization of Loops, Compiler Customization
- Zigâs Async IO Faces Doubts: Concerns arise in other language communities regarding the viability of Zigâs new approach to async IO, while Mojoâs type system and effect generics may solve some of the problems, such as vtables everywhere.
- A member mentioned that IO needs to haul around state now, the days of being able to freely call IO from anywhere are likely numbered, referring to this discussion in Ziggit.
- Achieving SIMD Nirvana: Members discussed the goal of writing simple, Pythonic code that automatically compiles to SIMD instructions, using Mojo and MLIR for optimal parallelized assembly without relying on LLVM to correctly vectorize code.
- A member dreams of a world where for loops are automatically compiled for the metal Iâm carrying, in this case being 8 or 16 lanes instead of just keep hammering lane zero.
- Unveiling Vectorization Secrets: To fully vectorize code, especially loops, the compiler needs sufficient information about input data shapes or must perform speculation to identify hot loops for vectorization, clarifying that Mojo encourages the use of portable SIMD libraries.
- It was mentioned that on CPUs and AMD GPUs, scalar and vector operations have separate execution resources, and both can ideally run at the same time.
- GPU Kernel Maturity Check-up: A member inquired about the maturity of writing GPU kernels in Mojo, specifically regarding implementing a Mamba2 kernel for use in PyTorch, and was pointed to Modularâs custom kernels tutorial.
- It was clarified that while MAX (Modularâs API to a graph compiler) is not primarily targeted at training, it can be used for inference, and MLA has already been implemented for inference (see GitHub).
- Span Abstraction Dream: A member expressed a desire for Span (a contiguous slice of memory abstraction) to become an easily usable, auto-vectorized tool, with algorithms that work on NDBuffer (being ported to LayoutTensor) as part of Span.
- They noted that while existing implementations are manual and parametrized with hardware characteristics, there isnât much compiler magic at hand.
Eleuther â· #general (46 messagesđ„):
MasterCard Fraud Prevention AI, Obscenity Rule Enforcement, Brand Risk Mitigation, AI-induced psychosis, Semantic Drift
- MasterCardâs AI Fraud System Sparks Controversy: MasterCard replaced fraud prevention staff with an AI system, leading to conflicts with merchants over obscenity rule enforcement, detailed in Chapter 5.12.7 of mastercard-rules.pdf.
- The system flags more transactions as obscene, with fees up to $200,000 per violation and $2,500 per day of noncompliance, creating incentives to avoid admitting fault.
- Insufficient Criteria Plague Automated Fraud Prevention: The issue stems from insufficiently specified criteria in the obscenity rules, lacking clear examples of safe items, causing shallow and confusing gradients for the LLM.
- The discussion highlighted how unwritten policies and approaches need to be made explicit to avoid issues that arise from automated enforcement without adequate context.
- Brand Risk Drives Over-Enforcement: Pressures from mutual funds to mitigate brand risk, like board diversity targets, lead to over-enforcement and denial of policy changes within MasterCardâs fraud department.
- This over-enforcement impacts merchants, and MasterCardâs focus on concealing issues hinders the development of useful monitoring metrics, as any flaw discovered would create a problem that needs to be solved, to protect their career.
- AI Consultant Questions Automation Plausibility: An AI consultant expressed skepticism about automating their job, citing the need for knowledge, understanding of relevant context, and wisdom, qualities AI lacks.
- Despite this, a medically induced crisis of faith made them question the value of their qualities which are still unlikely to be automated any time soon.
- ArXiv Endorsement Request Raises Eyebrows: A researcher requested endorsements for an arXiv paper on semantic drift, sparking suspicion due to recent cases of AI-induced psychosis.
- Concerns were raised due to the use of terms associated with AI-generated nonsense, prompting a request to share the paper for review.
Eleuther â· #research (6 messages):
GRPO Baseline, SFT + KL regularization
- Members Ponder GRPO Baseline: Members discussed the possibility of using a GRPO baseline for a project.
- One member asked did you have an GRPO baseline? to which another responded no, this will be next.
- SFT + KL regularization Possibility Raised: A member suggested exploring SFT (Supervised Fine-Tuning) with KL (Kullback-Leibler) regularization as a potential approach.
- This came up in response to a shared link on the topic of RL_Razor, and the member stated oh, would be interesting to try SFT + KL regularization.
HuggingFace â· #general (37 messagesđ„):
Reward Function Weighting in RL, Anthropic Policy on Jurisdiction Control, Causal Model Training with Attention Bias, Tokenizer and Attention Bias Implementation, RAG Applications with Limited Context Size
- Anthropicâs Policy Raises Eyebrows: Members discussed whether Anthropicâs new policy prohibiting organizations controlled by jurisdictions where their products arenât permitted is truly about national security or simply corporate self-interest.
- The debate centers on the motivations behind restricting access based on ownership structures.
- Deciphering Reward Weighting in RL: Members were seeking studies on whether weighting reward functions during RL is beneficial, aiming to avoid experimenting without prior knowledge.
- One member shared a document regarding reward weighting in RL.
- Biasing Attention in Causal Models Explored: A member sought advice on modifying causal model training with SFTTrainer to purposefully add attention bias for specific words, referencing the Attention Bias paper.
- Suggestions included checking specific terms/tokens against common tokenizers and considering alternative approaches for loss calculation and gradient signal control.
- Tackling Tokenizers to train for biases: Guidance was provided on how to bias attention to specific words, recommending to test how those will be tokenized before starting the whole training.
- It was suggested to use tools such as gradio or streamlit to achieve this goal.
- RoPE to the Rescue in RAG Context Expansion: Members discussed tips for building RAG applications with LLMs under very limited context sizes (4096 tokens).
- One tip involved using models with RoPE and fine-tuning them with a larger context size, referencing this repo and emphasizing that RoPE enables models to perform well even on context it hasnât been trained on.
HuggingFace â· #today-im-learning (2 messages):
â
- No points in sharing: A member said that there is no point in sharing.
- Negative attitude: The general sentiment in the channel appears to be negative, discouraging further contributions.
HuggingFace â· #i-made-this (1 messages):
Enron Email Dataset Parser, Structured Parquet Files, Email Analysis
- Enron Emails get Parsed into Parquet: A member uploaded their parser for the Enron email dataset, resulting in 5 structured parquet files.
- The files include: Emails, Users, Groups, Email/User junction, and Email/Group junction.
- Duplicates get Managed via Hashing: Parent and child emails have been parsed, and duplicates are managed both by file and message hashes/caches.
- All messages are included as MD5 hash objects.
- Dataset good for Group Behavior Analysis: The dataset would be great for analysing the behaviour between groups, and NLP.
- The member noted where to get the dataset but did not include the data itself.
HuggingFace â· #computer-vision (1 messages):
FastVLM
- FastVLM could be speedy solution: A member suggested trying FastVLM to address speed concerns.
- They shared a Hugging Face Collection link for the project.
- Another topic: Another member tried to add information about a different topic.
- This demonstrates how to add a second topic when there is more information.
HuggingFace â· #smol-course (5 messages):
smol-course, GitHub Readme
- Smol-course location surfaces: A member asked what is a smol-course?
- Another member promptly shared the GitHub link.
- Smol Course Confusion: A member stated that they canât find anything other than the readme and stuff of old 2024 course
- The member repeated Same here multiple times, indicating a difficulty in locating the intended course content.
HuggingFace â· #agents-course (2 messages):
agents course, greetings
- Course Launch: Sweden & Italy Say Hello!: Enthusiastic members from Sweden and Italy kicked off the agents course today.
- One participant noted some prior experience with AI agents, ready to dive deeper.
- Global AI Enthusiasts Unite!: Participants from Sweden and Italy have officially started the agents course.
- One of the new members mentioned bringing some previous AI agent knowledge to the table.
tinygrad (George Hotz) â· #general (8 messagesđ„):
Digital Ocean MI300X errors, Z3 version issues, Kernel removal project
- Digital Ocean MI300X Stable Diffusion Fails: Users encountered an issue running the stable_diffusion.py example on a Digital Ocean MI300X GPU instance, tracing back to some z3 issue.
- The error couldnât be reproduced on a Mac, but mnist_gan.py was tested.
- AMD_LLVM=1 causes TypeError: A
TypeError
involving unsupported operand types (BoolRef) arose when using AMD_LLVM=1 during a simple mnist training loop.- George Hotz suggested trying IGNORE_OOB=1, indicating it might be a z3 version issue, with some overloads added in z3>=1.2.4.0, and provided a link.
- Kernel Removal Project Interest: A user inquired about contributing to the kernel removal project.
- No additional information was provided about the nature of contributions that would be helpful.
aider (Paul Gauthier) â· #general (5 messages):
Warp Code, Aider's strengths, Aider success stories
- Warp Code wins hearts: A user reported that Warp is getting very nice and that Warp Code feels like the difference between driving a stick and manual.
- Warp is good with the embeddings search for when you donât know files and want to get the sense of a new codebase.
- Aider still shines despite losing Claude: A user switched from aider to Claude Code months back but came back because Anthropic have made some questionable changes, preferring aider for its simplicity and replicating Claude Codeâs plan mode with
/ask
.- The user now uses Gemini 2.5 Pro as the main model, Gemini Flash as a weak model, and Qwen3 Coder as the editor model, using
/run
to replicate command-line tools like checking the latest git diff or running tests.
- The user now uses Gemini 2.5 Pro as the main model, Gemini Flash as a weak model, and Qwen3 Coder as the editor model, using
- Run command in Aider is a Major Feature: The
/run
in aider is a major feature for a user, and they noted Aider is good when you know better what files you want to work with.- They also enquired where they can see Aiderâs success stories.
aider (Paul Gauthier) â· #questions-and-tips (2 messages):
Coding Agent Refactoring, Aider's Code Validation, TreeSitter Validator
- Coding Agent undergoing Refactoring: A member is refactoring their own coding agent, inspired by Aider, to learn more about AI system design.
- They already have a small proof of concept, but are now reading a tutorial to see what others do for a similar project.
- Validating Generated Code Across Languages: The member is seeking advice on how to prevent dangerous code (env leakage, rm -rfs, network requests, etc.) from being generated in any language.
- They considered a TreeSitter based validator, and asked how Aider avoids these issues, requesting pointers to relevant files in the repo.
Yannick Kilcher â· #general (4 messages):
Baselines in Papers, LoRA Training
- Guidelines for Reviewing Baselines in Papers Sought: A member inquired about general guidelines for approaching baselines presented in research papers, particularly when unfamiliar with the dataset.
- They expressed uncertainty about judging performance without sufficient knowledge of the dataset, implying a need for more background research, and suggested reading more papers.
- Why LoRA Adds Instead of Replaces?: A member asked why LoRA trained layers are added instead of replacing the original weight matrix, noting the contrast with other efficient processes like depthwise convolutions.
- They sought a paper, article, or reasoning to explain this design choice, rather than replacement, and mentioned having an intuition on the matter.
Yannick Kilcher â· #ml-news (1 messages):
erkinalp: https://www.all-hands.dev/blog/the-path-to-openhands-v1
Manus.im Discord â· #general (3 messages):
AI Politeness, Scientific Evidence for AI Politeness
- AI Behaving Nicely Scientifically Proven: A user shared a link to a paper scientifically proving that you should be polite to your AIs.
- It appears to be related to the question of whether or not AIs are more cooperative when you act nicely to them.
- Manus Users Agree on Nice AIs: Two users agreed that they want scientific proof that politeness matters to AIs.
- This is likely related to the earlier link from arxiv on the same topic of AI politeness.