a quiet day

AI News for 10/1/2025-10/2/2025. We checked 12 subreddits, 544 Twitters and 23 Discords (196 channels, and 8860 messages) for you. Estimated reading time saved (at 200wpm): 629 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

It’s a quiet day so you can check out the latest Latent Space with Dylan Field pod!

Also, invites for the first AI Engineer Code Summit have started going out.

AI Twitter Recap

Video generation: Sora 2, Kling 2.5 Turbo, and Google’s “Nano Banana” GA

Kling 2.5 Turbo (Text/Image→Video): The latest from Kling tops the Artificial Analysis Video Arena for both text-to-video and image-to-video, edging Hailuo 02 Pro, Google’s Veo 3, and Luma Ray 3. It generates 5s/10s clips up to 1080p. Notable economics: ~$4.20/min on FAL API vs $4.90 for Hailuo 02 Pro and ~$7.32 for Seedance 1.0, and ~15¢ per video on Kling’s Ultra plan via app credits. See model comparisons and pricing in the Arena thread from @ArtificialAnlys and Kling’s announcement @Kling_ai.
OpenAI Sora 2: capability vs. correctness: Live usage shows impressive instruction-following and in-app remixing, but critical evaluations flag physics inconsistencies and marketing polish. See a broad demo roundup @altryne, critiques on “people-pleasing” over physical fidelity @teortaxesTex, and targeted tests where Sora 2 fails physics scenarios that Veo 3 handles better (audio narration correct) @fofrAI, plus a sober overview @Tim_Dettmers.
Google Gemini 2.5 Flash Image (“Nano Banana”) GA: Now production-ready with 10 aspect ratios, multi-image blending, and image-only output. Pricing: $0.039/image on Gemini API (AI Studio + Vertex). Announcements from @sundarpichai, @GoogleAIStudio, and @OfficialLoganK. Also integrated into partner products (e.g., Cartwheel’s new motion pipeline) @andrew_n_carr and showcased by Google’s developer account @googleaidevs.
Ecosystem: Synthesia 3.0 adds “video agents” and new workflows @synthesiaIO.

Open-weight model releases: IBM Granite 4.0 and Qwen updates

IBM Granite 4.0 (Apache 2.0, hybrid Mamba/Transformer): IBM’s new family mixes a minority of standard attention layers with majority Mamba layers to cut memory without large accuracy hits. Sizes include Granite 4.0 H Small (MoE 32B/9B active), H Tiny (7B/1B), H Micro (3B/3B) and a 3B dense Micro variant. Key specs: 128K context, Apache 2.0, strong token efficiency. Artificial Analysis measures H Small at 23 on its Intelligence Index (non-reasoning), ahead of Gemma 3 27B (22) and behind Mistral Small 3.2 (29), EXAONE 4.0 32B (30), and Qwen3 30B A3B (37). Micro scores 16, edging Gemma 3 4B (15). Granite is on HuggingFace and Replicate (H Small at $0.06/$0.25 per 1M in/out tokens). Benchmarks: @ArtificialAnlys. Ollama released runnable images for Micro/Micro-H/Tiny-H/Small-H @ollama. IBM Granite is also added to LM Arena @arena, and HF’s @ClementDelangue highlights browser/WebGPU demos and HF Enterprise onboarding.
Qwen updates: Qwen models are among the first supported by Tinker’s fine-tuning API @wzhao_nlp, and the Qwen team notes expanded support and open releases @Alibaba_Qwen. Qwen-Image-2509 improves consistency @Alibaba_Qwen; Qwen3 VL 235B is reported as performant at lower cost for some vision tasks @scaling01.

Fine‑tuning and systems: Tinker, rank‑1 LoRA, MoE support, and inference speedups

Tinker: a flexible fine-tuning API with LoRA sharing: Thinking Machines’ Tinker lets you write a CPU-only training loop and run it unchanged on distributed GPUs, keeping control over algorithms/losses while Tinker manages scheduling, resource allocation, and failures. It supports open models (Llama, Qwen) including large MoE (e.g., Qwen3-235B), and implements LoRA for efficient resource sharing. Summaries: @TheTuringPost, release note @Smol_AI, cookbook/docs: link.
LoRA without regrets (rank=1): Multiple replications show rank-1 LoRA can match full fine-tuning quality on reasoning tasks while saving ~43% VRAM, enabling RL on larger models; see results and code @zzlccc and a Colab on Qwen3-0.6B OpenR1-Math @ben_burtenshaw. See guidance from “LoRA Without Regret” @TheTuringPost.
MoE training and infra: Prime-RL now supports MoE for RL and SFT (Qwen3 A3-30B, GLM series, Moonlight), with significant modeling rewrites to stay Torch Compile compatible while retaining HF ecosystem compatibility @samsja19. On inference, @vikhyatk reports a new engine with 1.3–20x faster completions; production uses QAT for FP8 KV caches and MoE weights (engine proprietary for now). For local/dev infra: MI300X VMs on-demand at $1.99/GPU/hr @HotAisle, vLLM now supports BERT @vllm_project.

RL and reasoning: search‑in‑training, broadened exploration, latent CoT, front‑loaded reasoning

Train-time search and efficient exploration: DeepSearch moves MCTS into the training loop with Tree‑GRPO stabilization and efficient caching/filtering, reaching 62.95% on AIME/AMC with ~330 GPU hours (beating a Nemotron baseline and outpacing standard RL that plateaus even with 1800+ GPU hours) @omarsar0. BroRL scales exploration by increasing rollouts per example into the hundreds, overcoming the saturation seen when only scaling training steps @iScienceLuvr.
Architectures and training mechanics: A new latent CoT method “thoughtbubbles” inserts input‑adaptive latent tokens to allocate more compute without CoT labels, improving perplexity and compute use @houjun_liu with positive reaction @khoomeik. NVIDIA’s “Front‑Loading Reasoning” finds injecting reasoning during pretraining yields durable gains that finetuning can’t recover @__SyedaAkter. A small but impactful MoE tweak—global‑batch load balancing (vs micro-batch) —yields lower perplexity and clearer expert specialization with minimal code changes @daddyofadoggy. For sparse diffusion LMs, OpenMoE 2 studies expert‑choice MoE × diffusion across wide FLOPs/param regimes, claiming perfect load balance (no aux loss), +20% throughput, and adaptive compute under multi‑epoch training @NiJinjie.

Agents and toolchains: CLI + semantic search, Notebook MCP, browsers, and CLIs

CLI agents + semantic search beat pure CLI: LlamaIndex’s SemTools benchmark (1,000 arXiv papers) shows agents with semantic search produce more complete answers across question types versus agents using only CLI tools; Unix tools remain a strong baseline and SemTools integrates parse (LlamaParse) and semantic search directly into command-line agents (Claude/Gemini CLIs). Results/methodology: @llama_index.
Executing notebooks via MCP: Goodfire open-sourced Scribe, an MCP-based system enabling agents to run notebook cells and receive Jupyter outputs (text/errors/images). They share lessons on “experimenter agents” vs “software development agents” and the scaffolding needed for scientific workflows @GoodfireAI, blog.
“AI browsers” and evaluators: Perplexity’s Comet is now GA globally, with Comet Plus launching alongside major publisher partnerships; Pro/Max users get Plus bundled @perplexity_ai, @AravSrinivas. Yupp’s “Help Me Choose” orchestrates a third model to critique two candidate answers, then has them analyze each other before the user picks — an interesting pattern for adjudication @yupp_ai, @lintool. Google’s Jules Tools brings an agentic CLI (npm installable) mirroring browser capabilities @julesagent.

Leaderboards and real‑world coding agent metrics

Claude Sonnet 4.5 tied for #1 on LM Arena: Sonnet 4.5 reaches the top slot alongside Claude Opus 4.1, with strong showings across categories including coding and creative writing (rankings are from tens of thousands of human votes) @arena. Community reports suggest Anthropic continues to ship very competitive coding models @scaling01.
Open source is closing in for code editing agents: In Cline’s diff‑edit success tests, GLM‑4.6 achieves 94.9% vs Claude 4.5’s 96.2% at ~10% of the cost; users report switching workflows accordingly @cline, @nickbaumann_.
Video Arena reminder: Kling 2.5 Turbo leads both T2V and I2V; details above in the Video section @ArtificialAnlys.

Top tweets (by engagement)

“We are, in so many ways, literally pretrained models.” by @cloneofsimo — 4,967
Perplexity Comet GA worldwide by @perplexity_ai — 2,667
Anthropic’s “thinking” campaign praise and adoption by @signulll — 2,441
Nano Banana GA announcement by @sundarpichai — 1,576
“Iteration speed is a superpower” by @gdb — 1,989

AI Reddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Sora 2 and WAN 2.2 Video Generation Demos

Sora 2 is insanely good at stand up comedy (Activity: 437): The post claims a stand-up comedy clip was generated by “Sora 2,” presumably referring to OpenAI’s Sora text-to-video model (overview). Viewers report highly natural comedic timing and facial-expression sync, implying strong temporal coherence, phoneme–viseme alignment, and fine-grained gesture/micro‑expression control; however, the linked video is inaccessible (HTTP 403), so provenance, model versioning (“2”), prompts, seeds, or generation parameters cannot be verified from the post. Commenters overwhelmingly praise the realism—“uncanny” timing and natural delivery—and some compare it favorably to human comedians, while at least one asks if it truly came from Sora, highlighting skepticism due to lack of proof or technical details.
- Multiple users highlight the “uncanny” timing between delivery and facial expressions, implying strong audiovisual prosody alignment and keyframe-level gesture/lip-sync. If this is native Sora 2 output, it suggests improved temporal conditioning (beat-aligned micro-expressions, head/eyebrow cues) and actor-like pose control versus prior text-to-video baselines.
- One commenter notes the joke is not original, attributing it to Joan Rivers with a direct quote reference, raising concerns about memorization/regurgitation from training data or prompt-sourced material rather than novel synthesis. This points to content provenance and originality risks in generative video models; see attribution: https://www.imdb.com/name/nm0001672/quotes/ .
- Skepticism that this “really came from Sora” flags verification/provenance issues for AI-generated clips (possible editing, dubbing, or pipeline mixing). Technical readers may look for reproducibility details (prompt, seed, runtime), metadata/watermarking, or Content Credentials to validate the generation chain and rule out post-production augmentation.
WAN 2.2 Animate - Character Replacement Test (Activity: 1439): OP showcases a character-replacement test using WAN 2.2 Animate on clips from the film The Ninth Gate, achieving convincing identity substitution while noting outfit inconsistency because the reference image covered only the head/upper torso (indicating apparel continuity depends on conditioning coverage). The shared video link is a Reddit host that returned HTTP 403 in external fetch attempts (likely requires login). Commenters emphasize that while the rendering style/quality is mediocre, the integration/substitution is “absolutely amazing.” Technical critiques flag lighting mismatches and weak hand fidelity when the region is small, and one asks how long sequences are produced with WAN 2.2 Animate; overall sentiment is that it’s a strong demonstration of AI-driven VFX potential.
- Commenters note that despite modest render/style fidelity, the core character integration/substitution is impressively stable—tracking and alignment hold up well—suggesting WAN 2.2 Animate is viable for FX-style character replacement even when aesthetic polish is lacking.
- Technical critiques focus on lighting and small-detail fidelity: one says “Lighting sucks!” and another notes the hands in the first shot are “too small on screen to be properly generated/tracked,” reflecting a common failure mode where tiny features lose detail or tracking robustness.
- There’s demand for the exact workflow (pipeline and clip-length method). A concrete suggestion is to use a relight LoRA to fix illumination mismatches; others ask how the video was extended, indicating interest in techniques for lengthening sequences while maintaining temporal consistency.

2. OpenAI $500B Valuation + ChatGPT ‘Think Longer’ UX + Silicon Valley Foresight

OpenAI Valuation Soars to $500 Billion, Topping Musk’s SpaceX (Activity: 720): Post claims OpenAI’s private valuation has reached ~$500B, surpassing SpaceX, with commenters citing projected 2025 figures of ~$4.3B revenue against ~$6.8B losses—implying very high revenue multiples and deeply negative operating margins. Technical concerns raised include perceived model quality regression (e.g., “GPTs deteriorate”) and an enterprise “AI reality check” as competitive pressure from both closed- and open-source models intensifies. An accompanying meme/image underscores skepticism about sustainability (image). Top comments characterize the valuation as a bubble given negative unit economics and crowded competition, arguing many AI vendors may not survive. Others echo that current systems underdeliver versus expectations, citing degradation and unmet enterprise use cases.
- Financials/valuation concern: commenters cite ~$4.3B 2025 revenue vs ~$6.8B losses and a ~$500B valuation, implying ~>100x forward sales and deeply negative margins for a compute-intensive business. This raises questions about the sustainability of subsidized inference, future price hikes, or cost reductions needed (e.g., model distillation, batching, custom silicon) to justify the multiple without impairing product quality.
- Model reliability/regression: reports of GPT “deterioration” are tied to known behavior drift issues, where model updates change outputs and quality over time. Prior analyses found sizable month-to-month variance in GPT-4’s reasoning/accuracy (e.g., Stanford/UC Berkeley’s “How is ChatGPT’s Behavior Changing over Time?” showing swings on coding/math tasks: https://arxiv.org/abs/2307.09009), underscoring maintenance/evaluation challenges for production deployments.
- Competitive pressure: the thread notes both free and paid alternatives narrowing the gap, which could compress pricing power. Public evals like LMSYS Chatbot Arena show non-OpenAI leaders (e.g., Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3 70B, Mistral Large) clustered near the top (https://lmsys.org/blog/2024-06-20-arena-hard/), indicating potential commoditization of frontier capabilities and weakening moat assumptions.
CAN WE PLEASE HAVE A DISABLE FUNCTION ON THIS (Activity: 1478): User requests a toggle to disable the chat UI’s “Thinking longer for a better answer” behavior/overlay, reporting it triggers on every prompt even when not in a “Think Longer” mode—suggesting a UX issue or misconfiguration. Comments point out an existing “Instant” setting and that for “thinking” models you can manually choose between “standard” and “extended” thinking, implying the feature is configurable but possibly confusing or inconsistently applied. Commenters split between joking about impatience and a practical note that the instant/standard/extended controls already exist; the thread implicitly debates whether this is a UX bug vs. user settings awareness.
- Existing UI controls already let users tune or avoid slower deliberate reasoning: one commenter asks, “Are you not aware of the ‘instant’ setting? And if you select the ‘thinking’ model, you can manually choose between ‘standard’ and ‘extended’ thinking.” This implies a configurable latency/quality trade-off where instant minimizes delay, standard balances speed and reasoning, and extended maximizes depth at higher latency.
- A power user reports defaulting to thinking mode and even choosing extended on desktop, reserving faster modes for trivial lookups: “uses thinking mode by default for nearly all prompts… on desktop even select the ‘extended’ thinking option.” This reinforces a workflow pattern: complex tasks benefit from longer deliberate runs, while simple factual queries are better served by low-latency modes.
Bro how was the show Silicon Valley so consistently 10 years ahead of its time? (Activity: 8183): The thread asks why HBO’s “Silicon Valley” felt a decade ahead of reality; top replies credit the show’s accuracy to hiring actual engineers/technical advisors in the writers’ room, which grounded portrayals of startup dynamics, infrastructure trade-offs, and compression research. As a concrete example, commenters point to the S1 finale’s mathematically worked-through optimization derivation (see this clip: https://www.youtube.com/watch?v=Tx3wDTzqDTs) as evidence of rigor beyond typical sitcom writing. Note: the referenced v.redd.it asset returns 403 Forbidden without authentication—access requires a logged-in session or authorized Reddit API client. Veteran practitioners describe the series as effectively a “documentary,” arguing its prescience stems from embedding real tech people in the creative process rather than relying on generic tech tropes.
- Technical authenticity likely came from hiring actual engineers as writers/consultants, which helps seed plots with real failure modes (scaling bottlenecks, deployment mishaps, VC/IP constraints) and accurate jargon/tooling rather than generic “hacker” tropes. That kind of domain input lets writers plausibly extrapolate near‑term ML/infra trends (instead of sci‑fi leaps), making storylines feel imminent rather than speculative.
- The “Hot Dog / Not Hot Dog” gag maps to binary classification, which traces back to the perceptron (Rosenblatt, 1957)—a linear classifier with well‑known limits formalized by Minsky & Papert in 1969 (Perceptron, Perceptrons). A real image‑based Not‑Hotdog app would typically rely on multi‑layer nets (e.g., CNNs) trained with backprop (popularized in 1986) to learn non‑linear decision boundaries and visual features (CNN, Backpropagation). Conceptually it’s the same task—binary classification—but the implementation leap from a single‑layer perceptron to modern deep nets is substantial (data scale, compute, and model capacity).

What do you sell at The Strangest Flea Market? Pt. 7 (Activity: 477): Recurring creative/comedy series post (“What do you sell at The Strangest Flea Market? Pt. 7”) with a Reddit-hosted video link v.redd.it/x8rnhfkoulsf1 that returns HTTP 403 Forbidden and a Reddit network-security block page requiring login/OAuth, indicating application-layer access controls (session/cookie or OAuth gating) and likely CDN/bot protections. Comment references suggest serialized running gags (pig Latin bits; a “Korean-speaking vegetable”), but the primary media asset is inaccessible without authenticated session or API token. Comments are uniformly positive and request expansion of the “Korean-speaking vegetable” motif; no technical debate present.
What do you sell at The Strangest Flea Market? Pt. 7 (Activity: 475): Short-form comedy sketch post “What do you sell at The Strangest Flea Market? Pt. 7,” hosted on Reddit video (v.redd.it), is currently inaccessible to unauthenticated clients (HTTP 403 Forbidden, OAuth required). From comments, the piece is part of a recurring surreal/absurdist series and includes a Pig Latin wordplay gag and an explicit nod to Tim Robinson’s “drive‑thru” bit from I Think You Should Leave (show info). Commentary is uniformly positive; the only technically notable observation is the intertextual reference to Tim Robinson’s sketch style and the inclusion of Pig Latin as a stylistic device.
- A creator highlights compositional and control limits in current image models—specifically naming Midjourney (https://www.midjourney.com), “Seedream,” and FLUX (e.g., https://huggingface.co/black-forest-labs/FLUX.1-dev)—noting%E2%80%94noting) it’s still “boring just doing new single characters and objects.” Despite having “a few thousand” followers for AI video content, they report that these models lack robust multi-subject scene construction and consistency needed for richer video pipelines, expressing a desire for next-gen models with better scene complexity, control, and coherence.
Is that math?? (Activity: 477): Post titled “Is that math??” links to a v.redd.it video that currently returns HTTP 403 Forbidden with a Reddit network-security block, indicating access requires authentication (login or OAuth token), so the actual content is unavailable. From comment context, the thread likely centers on physics/relativity humor (Einstein references, non‑inertial frames), with no technical artifacts, benchmarks, or code shared. Top comments riff on “releasing the Einstein files,” expect a relativity joke about speed limits in non‑inertial frames, and declare a “new meme era,” implying a lighthearted, meme-forward reception rather than substantive technical debate.
Good use of AI .. I laughed and almost choked lmfao (Activity: 5333): A short v.redd.it clip (link) appears to showcase a prank built on convincing AI-generated photos, raising questions about whether the accompanying script/narration was also AI-authored. Technically, the thread underscores how easily consumer-grade generative tools can compose multi-modal, high-believability hoaxes targeting non-technical audiences, illustrating the social-engineering risk surface of realistic image synthesis and scripted context. Commenters debate if the script was AI-generated and suggest using examples like this to train older relatives about AI-enabled manipulation; others criticize the prank as irresponsible or harmful, noting the ethical line when shocking family members for laughs.
- The only quasi-technical thread notes that beyond AI-generated photos, the “script” may also be AI-produced—implying a multi‑modal fabrication workflow (text + image) rather than a single‑modality deepfake. Another comment frames this as a social‑engineering vector for manipulating less tech‑savvy relatives, but the discussion contains no implementation specifics, model names, or evaluation details (e.g., detection methods, benchmarks, or pipeline components).
I hope the White House doesn’t sue us (Activity: 1287): Post appears to showcase a highly realistic AI-generated video (deepfake) of Donald Trump, with commenters noting that Sam Altman also appears and looks synthetic. The original asset at v.redd.it (link) is not directly accessible without OAuth/login (HTTP 403 Forbidden), so the clip’s authenticity and provenance cannot be independently verified; access requires Reddit login (link) or support assistance (link). Discussion highlights rapid gains in generative video fidelity and the related authenticity/verification and legal-exposure concerns implied by the title. Top comments emphasize unprecedented realism (e.g., “most realistic video of Trump I’ve EVER seen”), question whether parts are real (Altman “looks kinda artificial”), and suggest an adversarial legal stance if threatened with a lawsuit.
- Perceived photorealism threshold: multiple users misidentified the clip as real, indicating state-of-the-art AI video generation has crossed a plausibility boundary where casual viewers can’t reliably distinguish synthesis from capture, especially in political-context footage. This highlights practical challenges for detection and provenance (e.g., watermarking/metadata) as distribution detaches content from original labels.
- Residual uncanny cues: a commenter noting Altman “looks kinda artificial” points to remaining artifacts in facial modeling—micro-expressions, temporal coherence, and skin reflectance—that can still betray synthesis to attentive viewers. The mixed reactions suggest quality is scene- and identity-dependent, with failures typically surfacing under close-ups, complex lighting, or rapid expression changes.

AI Discord Recap

A summary of Summaries of Summaries by gpt-5

1. IBM Granite 4.0 Hybrid Models Launch

Granite 4.0 Goes Hybrid, Open, and Enterprise-Ready: IBM announced Granite 4.0 with a hybrid Mamba/Transformer architecture, open-sourced under Apache 2.0, cryptographically signed, and billed as hyper‑efficient without performance loss, with broad availability via partners like Hugging Face, LM Studio, NVIDIA NIM, Ollama, and Replicate (IBM announcement).
- The community debated its new ISO 42001 credential, with one user calling it “totally useless certification” while others focused on practical access paths and enterprise distribution (IBM announcement).
Granite’s Hybrid Attention: Active Units at Scale: Shared specs highlighted hybrid attention across sizes—2B dense, 7B (1B active), and 32B (9B active)—with FIM support and no positional encoding, aimed to avoid degradation beyond 128k context (IBM Granite HF collection).
- Users noted smooth paths to run as GGUF or fine-tune via Unsloth guides and assets, tightening the loop from model zoo to training stack (Unsloth Granite 4.0 guide, IBM Granite HF collection).

2. Unsloth Training Stack: Docker, RL Speedups, and New Tricks

Containers Conquer Config Chaos: Unsloth shipped a cross‑platform Docker image with a step‑by‑step guide, while users shared manual xformers build scripts for Blackwell (SM_12) to unlock latest kernels (Docker guide, Docker Hub).
- The flow targets frictionless training on Windows/Linux and advanced GPU stacks, with docs also covering Granite 4.0 fine‑tuning on the same pipeline (Unsloth Granite 4.0 guide).
RL at Ludicrous Speed: Unsloth reported the fastest gpt‑oss RL loops with GSPO, plus VLM RL that is 2× faster, uses 90% less VRAM, and supports 10× longer context via kernel and weight‑sharing tricks (gpt‑oss RL blog, VLM RL blog).
- Early testers praised the throughput for rapid experimentation, framing the stack as a practical on‑ramp for large‑scale reasoning RL and vision‑language training workloads (gpt‑oss RL blog, VLM RL blog).
Tversky Tricks and Leaner Losses: A semi‑reproduction of GPT‑2 Tversky‑All for a llama‑like architecture landed with code and a test model—claimed 300B tokens on a 3090 Ti in ~1 day—while practitioners recommended Linear Cross Entropy via Dao‑AI Lab’s quack to speed training (Architecture‑Tversky‑All, HF test model, LCE impl line, quack LCE).
- Community tips emphasized sequence‑packed varlen flash‑attn and careful kernel selection for wall‑clock wins, pairing lean losses with efficient data layouts to cut epochs (varlen MHA example).

3. GPU Systems: Determinism, Flash‑MoE, and Kernel Fusion

Determinism Tames the Dice Roll: Thinking Machines detailed defeating non‑determinism in LLM inference and released Flash‑MoE, a variant of Flash‑Attention for sparse‑expert setups (Defeating Non‑Determinism, Flash‑MoE site).
- Engineers flagged stable reproducibility as essential for debugging and benchmarking model traces, positioning Flash‑MoE as a practical building block for scalable MoE inference (Defeating Non‑Determinism, Flash‑MoE site).
NVIDIA Papers Fuse and Specialize: NVIDIA published compiler work on scheduling and warp specialization with benchmarks vs FA3 (Cypress, PLDI 2025) and on distributed kernel fusion for end‑to‑end efficiency (Legate Kernel Fusion, ASPLOS 2025).
- Discussion focused on mapping these techniques to production tensor programs and cluster‑wide execution graphs to reduce launch overheads and improve E2E throughput.
JAX Blackwell Matmul Masterclass: JAX released a tutorial on achieving SOTA matmul performance on Blackwell GPUs with Pallas, covering tiling, memory movement, and kernel authoring best practices (JAX Blackwell matmul tutorial).
- Practitioners highlighted the guide as a blueprint for hand‑tuned GEMM kernels that translate to real wins in training and inference pipelines.

4. OpenRouter: Routing Metrics, Fees, and New Models

Performance Plots Prompt Quantization Questions: OpenRouter launched a Performance Tab that visualizes provider metrics per model, sparking calls to filter by quantization (e.g., FP4 vs BF16) to avoid misleading comparisons (Performance Tab post).
- Users requested a dropdown for quant levels and noted that fair apples‑to‑apples comparisons require normalizing for precision, context, and tool‑use settings.
BYOK Clarified: 0% Fee, Not Free Compute: The “1M free BYOK requests/month” promo waives OpenRouter’s 5% commission for the first million requests, but users still pay the underlying provider’s API bill (announcement).
- Several suggested clearer wording like “1M monthly BYOK requests at 0% fee” to avoid confusion about actual inference costs (announcement).
Qwen’s Image Editor Enters the Ring: Alibaba Qwen introduced a new image‑edit model (not text‑to‑image), with devs sharing the launch and seeking Apple Silicon paths (Qwen announcement, community post).
- Early chatter focused on editing‑only constraints and integration questions, with interest in local M‑series acceleration.

5. LMArena: Reasoning Trace and Leaderboard Shifts

Watch Models Think Before They Speak: LMArena enabled Reasoning Trace for reasoning models across Side‑by‑Side and Direct chat, letting users see the model’s work pre‑answer (Side‑by‑Side, Direct).
- Power users welcomed the added transparency to debug reasoning chains, compare models’ scratchpads, and sanity‑check intermediate steps.
Claude Sonnet 4.5 Crowns the Text Charts: Claude Sonnet 4.5 tied Claude Opus 4.1 for the #1 spot on the Text Leaderboard, and the 32k thinking variant replaced 16k in production flows (Text Leaderboard).
- Community remarks praised Hard Prompts, Coding, and Creative Writing results, aligning perceived quality with the updated thinking window.

Discord: High level Discord summaries

Perplexity AI Discord

Perplexity Kills o3, Shills GPT-5: Perplexity deprecated the o3 model from their model selector and encourages users to transition to GPT-5 Thinking.
- Perplexity claims GPT-5 Thinking offers stronger performance and ongoing support.
Discord Desktop Saves Comet Quest: Users are downloading the Discord desktop app to complete the Comet quest and claim 5k orbs.
- Some users are having trouble finding the quest in the Discord app and are advised to check the pins!
Privacy Put to the Test: A user shared a memory with a combo of English, Finnish, Japanese and Spanish, sparking a privacy discussion.
- Another user stated they could share the prompt, but wouldn’t go through their memories to snip out the private ones, doubting they’re the ones affecting it.
Comet Browser Fails to Launch: A user shared a screenshot of their success, pointing out the browser’s opening is incredible.
- Others noted that it still def needs work, as its just like any basic browser and more annoying since you cant use google as primary and shift enter for the AI.
Sonar-Pro API Returns Rotten Resources: A user reported that the Sonar-Pro API is generating resources that lead to 404 errors and asked for a way to filter results.
- They hope to only receive resources that are confirmed to exist and be available to the public, avoiding 404 errors.

LMArena Discord

Sora 2 Launch Triggers Hype: Community members eagerly await Sora 2’s arrival on the platform, anticipating its impact and comparing it to video models like Veo 3.
- Enthusiasts expressed excitement and hoped to see it benchmarked on LMArena.
Gemini 3 Release Speculation Intensifies: The community is buzzing about the impending release of Gemini 3, with discussions focusing on its potential competitiveness regarding ratelimits.
- A leak claimed an October 9th release date, further fueling the anticipation.
4o Model Retirement Incites Disappointment: Users expressed disappointment over the limited availability and eventual retirement of the 4o model on LMArena.
- One member lamented their ‘addiction’ to 4o, highlighting the difficulty in finding a suitable replacement.
Ethical boundaries debated: Concerns were raised about OpenAI’s data usage practices, with one user jokingly admitting to sending sensitive government data to lmarena.
- Another member pointed out that it was a wild thing to admit in a discord chat.
Claude Sonnet 4.5 Takes #1 on Text Leaderboard: Claude Sonnet 4.5 impressively tied with Claude Opus 4.1 for the #1 slot on the Text Leaderboard.
- It is also performing well across categories such as Hard Prompts, Coding, and Creative Writing, garnering positive community discussion in the dedicated channel.

Unsloth AI (Daniel Han) Discord

Blackwell Manual Compiling Bonanza Begins: Members discussed manually compiling xformers on Blackwell GPUs, with one sharing a script using pip uninstall -y xformers, git clone, and python3 setup.py install to manually compile xformers for compute capability 12, as well as the Docker Hub link for the updated image.
- This is necessary to use the latest GPUs for accelerated computing.
Docker Debuts for Unsloth Training: Unsloth released a new Docker image for training on Windows/Linux without dependency issues, detailed in their guide and available on Docker Hub.
- This aims to resolve dependency conflicts and streamline the setup process for users on different operating systems.
Synthetic Data Surge sans vLLM: Members discussed generating synthetic datasets without relying on vLLM, suggesting the use of the OpenAI package for async requests to a local server, along with a pointer to meta-llama/synthetic-data-kit.
- One member noted all Unsloth notebooks currently use vLLM.
Tversky-All GPT2 Gets Llama-Like Upgrade: A member released a semi-reproduction of the GPT2 Tversky-All, using the Tversky-All strategy but for a llama-like model, available at CoffeeVampir3/Architecture-Tversky-All.
- The test model is available at HuggingFace; it was trained on 300 billion tokens on a 3090 TI in about a day.
GGUF Conversion Woes Plague Users: Users are encountering issues when trying to convert models to GGUF format, specifically when using the push_to_hub_gguf function with f16 quantization, was advised to perform the conversion manually until a fix is pushed.
- A member reported a ValueError related to mapping tensor ‘model.layers.0.self_attn.q_proj.base_layer.weight’.

OpenAI Discord

Sonnet 4.5 Pricing Debated: Members debated cost-effectiveness of Sonnet 4.5 versus GLM 4.6, with some pointing out that GLM 4.6 is six times cheaper.
- Some users felt Sonnet 4.5 performed similarly to 3.7, while others favored 4 over 3.7 in Copilot.
Server Overrun by Sora Fanboys: A member raised concerns about the server being overrun by Sora users, who critiqued OpenAI’s marketing for triggering the influx.
- The member suggested channel names be dynamically updated based on current discussion topics using LLMs.
Deepfake Drama Divides Users: A user questioned the irony of an app supporting deepfakes while criticizing the generation of photorealistic AI images.
- This sparked discussions about forwarding feedback to relevant channels amidst a flood of code please requests.
Sora as Social Media Central: A user suggested Sora should integrate as a social media platform like TikTok, enhancing user experience with ChatGPT, similar to image generation.
- Another user proposed implementing a credit system for Sora, allocating more resources for video generation with daily or weekly usage limits.
Users Debate Square Images for Sora: Members discussed best practices for image generation in Sora, with one asking if portrait mode works better than landscape mode.
- Another member replied that visual tokens are arranged in a grid, so square images will probably generate the best results from images.

LM Studio Discord

Snapdragon X Elite Specs Spark Debate: A user shared the specs of their Microsoft Surface Pro with a Qualcomm Snapdragon X Elite (12-core, X1E80100 @ 3.40 GHz) and 16 GB of RAM.
- After seeing a mysterious artifact, they asked if LLM opinions were trustworthy.
Quantization Quandaries Questioned: Members explored how quantization impacts knowledge retention in language models, with lower quantization potentially impacting smaller models because reason bits are lost.
- A member shared a funny sentiment about what happens when the quantization level gets too extreme: you get too quantised and suddenly your mixing yanderes and petting dogs in a way you where not expecting😄.
GPT-OSS: Openly Safe Substitute Ships: The release of GPT-OSS, a model that behaves similarly to GPT-4o, was announced.
- Members noted it assumes a lot of information if not provided with enough details.
Arc B50 Pro Bogs Bandwidth Battle: A member benchmarked Arc B50 Pro cards against an RTX 4080 Super, revealing the B50s have boatloads of VRAM but abysmal memory bandwidth, resulting in lower token rates (7-8 Tps for a 12B q8_0 model compared to 30+ Tps on the 4080).
- However, at default context (4k), the B50s got 32 Tps while the 4080 got 42 Tps.
DDR3 Dreams Dashed for GPU Deployment: A user suggested using cheap DDR3 boards with multiple PCIE 16x slots to accommodate 6x GPUs, combined with raided SATA SSDs for faster load times, referencing a eBay listing for X99 boards.
- Concerns were raised about the memory bandwidth (68 GB/s with DDR4) and the potential bottleneck compared to modern standards, with a user saying that on ddr3 you max out at like 50gb/s.

Cursor Community Discord

Cursor Integrates Git Worktree: Users discovered Git Worktree setting within the beta tab of settings, encouraging its use in the agent window.
- It appears that Git Worktree integration is available in the Early Access or Nightly Cursor versions.
Cursor Beta Functions Spark Curiosity: Members discussed using beta functions in Cursor, recommending it for early access to features, fun debugging, and helping improve Cursor; currently, afterFileEdit is the only available hook.
- The Extension RPC Tracer is available for checking RPCs during beta function use.
Typescript Refactor Triumph: One user reported a successful full Typescript refactor using Cursor after four prompts, using a follow-up master prompt for auditing.
- Using Cursor’s Plan mode and tracking workflow status in the Nightly version were suggested for improved efficiency.
MacBook Suffers Meltdown from Cursor: A user reported Cursor causing their MacBook Air M4 to crash due to high memory usage, spiking to 96GB possibly related to chats or agent processes; resolved after rebooting.
- Members suspected a memory leak, noting that MacOS versions have a higher incidence, with downgrading suggested as a workaround.
Cursor Hackathon on the Horizon?: A member inquired about interest in a Discord Cursor Hackathon, to implement solutions and side projects.
- Interest was expressed in sponsored hackathons with free credits, with a suggestion to make the hackathon remote friendly to accommodate different time zones.

OpenRouter Discord

OpenRouter Rolls Out Provider Performance Tab: OpenRouter released a new “Performance Tab” visualizing provider performance for a given model, prompting a discussion about fair comparisons between providers using different quantization levels.
- A user suggested adding a filter dropdown to account for different quant levels like FP4 and BF16 to prevent misleading comparisons.
BYOK Promo Spark Confusion: Users debated the “1 million free BYOK requests per month” offer, clarifying it waives OpenRouter’s 5% commission fee for the first million requests, but users still pay the provider directly for API usage, according to OpenRouter documentation.
- Some users initially thought the offer provided completely free requests, leading to suggestions for clearer messaging, such as “1M monthly BYOK requests at 0% fee”.
Grok Gets Roasted, Sonoma Soars?: A user tested Grok 4 Fast and called it “way dumber than Sonoma”, saying it “fails constantly” and disregards format requirements.
- Another user speculated that Grok 4 Fast “reeks of… Llama..?”, expressing frustration with its inconsistency.
Gemini Pro Gets Glitchy: Users reported that Gemini Pro was responding with “weird stuff”, failing to use tools correctly, and exhibiting “unacceptably slow” performance via the OpenRouter API.
- Reports suggest this may be a common issue with Gemini 2.5 Pro, and one user recommended trying Vertex as an alternative provider.
Qwen’s Image Edit Arrives!: Members shared Alibaba’s new Qwen image model, noting it was only an image edit model and one user shared this post announcing it.
- Another member expressed interest in running it on Apple Silicon.

Eleuther Discord

Exploring Perplexity AI Framework: Members discussed the Perplexity AI framework and its associated GitHub project, particularly focusing on LLMs using similar attention matrices.
- The discussion considered efficient attention mechanisms like Deepseek Sparse Attention as an example of top-k attention, questioning potential issues compared to sliding window attention.
Gradient Descent Dynamics Paper Hailed: A member praised a paper on the dynamics of gradient descent (centralflows.github.io) for addressing loss spike dynamics and impacting Adam’s beta2.
- Despite its low citation count, the paper was lauded for its solution and impact, being considered the paper of the year by one member.
Symmetry Transformer Shows Promise: Experiments with a symmetry transformer (GitHub repo) showed that predicting the current and previous token with separate heads improved validation loss in later training runs.
- Initial results indicated that the baseline model performed better, but the symmetry model later improved after more training.
Questioning AUNN’s Practicality: The practicality of AUNN (Augmented Neural Networks) was debated, raising concerns about its efficiency and the absence of a functional prototype beyond a toy example, ethan-w-roland/AUNN.
- The discussion stated that the proposer of AUNN focused more on MLPs than Attention, and was combative towards counterarguments.
Transformers are 2D Slices: The guild discussed that Transformers optimize by splitting a big 2D problem, (sequence, channels), into slices.
- A member stated that a giant MLP applied to the whole problem would work fine, but it’s intractable that way and that Transformers are used because they are just cheap.

GPU MODE Discord

Benchmarking Brainstorming Begins: Members requested a good guide for benchmarking and were pointed to this arXiv paper, this article on kernel benchmarking, and this YouTube video.
- One of the members described their previous benchmarking work as maybe the best benchmarking effort.
Non-Determinism Drops Dead: Thinking Machines posted a blog on defeating non-determinism in LLM inference.
- They also released Flash-MoE, a variant of Flash Attention.
Nvidia Compiles Fresh Code: Nvidia is working on compiler techniques for scheduling and warp specialization, with benchmarks against FA3 detailed in their paper.
- Nvidia is fusing kernels in a distributed setting as outlined in this paper.
Linear Cross Entropy for LLM Training: The use of Linear Cross Entropy is recommended for accelerating the LLM training process and to use the Quack optimization library, specifically the linear cross entropy implementation.
- Sequence packed or ‘unpadded’ training is identified as a highly impactful optimization, particularly with techniques like flash attn varlen, see this implementation.
Cooperative Group Aligns: A member asked about the alignment argument in CooperativeGroup.__init__, specifically what it does and why it must be 32 if size is 32 but not for other values in Cutlass channel.
- Another member responded that this check is because they happen to be the warp/warpgroup ganularity and are the common cases warranting special checks to prevent bugs.

Latent Space Discord

Musk Envisions AI-MMOs: Elon Musk is discussing co-developing an AI-integrated MMO (AIMMORPG) with Eve Online’s creators, aiming to exploit unique AI capabilities.
- A user speculated that AI would be a “natural fit” within such a game.
Karpathy’s Koans on Bitter Lesson: Karpathy summarized the Dwarkesh-Sutton podcast, highlighting Sutton’s doubts that LLMs fulfill his thesis.
- Karpathy acknowledges the practical bootstrapping offered by pre-training, while also suggesting that bigger paradigms await, and researchers should look to animal intelligence for inspiration.
Hume AI Hits Hyperspeed with Octave 2: Hume AI unveiled Octave 2, their next-gen multilingual text-to-speech model, now supporting 11+ languages with a 40% speed boost (<200 ms latency) and 50% cost reduction.
- The release includes multi-speaker chatter, improved pronunciation, new voice-conversion and phoneme-editing tools and a 50 % discount on their Creator plan during October.
Mistral Drafts Mathletes: Albert Jiang announced that Mistral AI is forming a new formal-math research team after their $2B funding.
- They are seeking AI talent for an all-in-one prover/autoformalizer/agent, offering elite collaborators, hundreds of GPUs per employee, open research, top salaries, and offices in Paris, London, and Palo Alto; the job opening is advertised here.
Figma’s Field Guide to AI: The Latent Space podcast featured Figma’s co-founder Dylan Field discussing Figma’s AI Playbook.
- The episode explores surfacing good design in the era of vibe-coding, Figma’s Make, MCP for ‘tasting’ agents, and the future of fast-fashion SaaS (link to X, link to Xcancel).

Nous Research AI Discord

Hermes Model Claims Close to GPT-5: A member inquired whether Nous Research tuned models are comparable to GPT-4.5, leading to a response that these models are closer to GPT-5 or Gemini.
- Ironically, when the member queried Gemini about alternatives, Hermes was among the options it suggested.
Veo3 Eclipses Sora?: A user expressed a preference for Veo3 over the latest Sora, sharing a Prompt_theory.mp4 as part of their discussion.
- No further details were given to illustrate why Veo3 was the preferred option.
Granite Models Showcase Hybrid Attention: IBM Granite language models feature hybrid attention in models such as 2B dense, 7B (1B active), and 32B (9B active), as outlined in a shared Hugging Face collection.
- These models support FIM (Fill in the Middle) and lack positional encoding, which prevents performance degradation when processing contexts beyond 128k.
Qwen 30B A3B Thrives on CPU: Members find Qwen 30B A3B is well-suited for CPU usage, with one user reporting performance metrics on a Ryzen 7 5700G CPU with 32GB VRAM.
- Specifically, Qwen 3 30B A3B at Q6_K_XL achieves 48 TPS processing and 10.5 TPS generation speed at 1024 tokens of context.
LLMs Caught in Web of Deceit?: A member shared their preprint, about strategic LLM deception.
- The study uses sparse autoencoders (hosted by Goodfire AI) to show how current methods fail to detect the internal features driving strategic LLM deception, highlighting a tangible path to closing the autolabel gap.

Yannick Kilcher Discord

Deepmind Code Incompleteness: Members joked that Deepmind does extra work to avoid sharing their implementations, making it unclear how it works as part of a larger system, citing their experience implementing V-MPO.
- They noted that Deepmind’s code is often sophisticated, but they piece it out in ways that obscure its overall functionality.
HuggingPapers Code Fails to Run: Members noted that code from HuggingPapers doesn’t run because it doesn’t import RoPE.
- The original poster of the code seemingly indicated that the user is supposed to implement it themselves.
IBM Granite 4.0 Hybrid Architecture: IBM launched Granite 4.0, the next generation of IBM language models, featuring a new hybrid Mamba/transformer architecture that greatly reduces memory requirements without sacrificing performance, open-sourced under Apache 2.0 license.
- The models are available on IBM watsonx.ai, as well as through platform partners including Dell Technologies, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE and Replicate, with access through AWS Sagemaker JumpStart and Microsoft Azure AI Foundry coming soon. IBM Announcement Here.
Doubts Raised on ISO 42001 Certification: Members noted that the new Granite 4.0 model is the world’s first open models to receive ISO 42001 certification.
- A user commented that this is a totally useless certification, to blend C-suite people into thinking this is worth it.
Oracle runs OpenAI’s Datacenters?: A user commented that Oracle’s business model used to be selling Databases and enterprise software now it seems to be running datacenters for OpenAI.
- They cited OpenAI Elon Musk Post as the source for this theory.

Manus.im Discord Discord

Manus Credit Consumption Sparks Outrage: A user complained about a basic research task consuming 5300 credits without completion, labeling Manus as an “absolute joke” and requested a refund.
- A team member asked for the session link to investigate and potentially offer a credit refund.
Unlock Agent Mode with Memory Key: A member proposed a Memory Key protocol to solve the issue of exiting Agent Mode, which involves saving context before restarting a session.
- They detailed a solution that involves copying essential information, starting a new session, and instructing the agent to create an updated Memory Key for future use.
Billing Issue Sparks Support Vacuum: A user reported a billing issue with no response from Manus support, prompting a community member to suggest emailing their official support address with a clear subject line and ticket number.
- It was suggested that this would create a formal paper trail for escalation.
Global Pricing Model Criticized for Disparity: A user criticized Manus’ global USD pricing model ($39/month for the Plus plan) for not adjusting to regional economies, creating a barrier in countries like Brazil and other parts of Latin America.
- Another user suggested implementing regional pricing based on Purchasing Power Parity (PPP) to improve accessibility and promote global growth.

DSPy Discord

AGI Paper Dropped, Courtesy of HF: A member shared a Hugging Face paper introducing AGI in the show-and-tell channel.
- The user cheekily stated, called it 😉.
Show-and-Tell channel debuts on DSPy Discord: The DSPy Discord server has a new show-and-tell channel.
- The channel is designed for users to demonstrate and discuss their projects using DSPy.
Caching Prompt Order: Use with Care: Members have found that the order in which prompts and files are sent greatly impacts caching.
- To effectively leverage caching, the specific order of prompt elements and file inputs must be carefully observed.
DSPyWeekly gets Search Feature: DSPyWeekly now features a search function to browse crawled content, complete with prev/next links for smooth navigation.
- This enhancement streamlines access to information, facilitating easier discovery of relevant topics.
XMLAdaptor May Become New JSONAdaptor: Members debated if JSONAdaptor should remain default given that ChatAdaptor or XMLAdaptor often fix adaptor errors.
- The rise of tool use RL for models is making XML a potential default, despite JSON being a reliable fallback.

aider (Paul Gauthier) Discord

Qwen Coder Model Benchmarks: Members discussed the Qwen Coder model, suggesting the 30B version should be smart enough and considered Qwen3 Coder as a newer alternative.
- They cautioned that quantization could affect performance, recommending Q4 if chosen.
Aider’s Release Cadence Concerns: A member expressed concerns over the reduced release cadence for aider and suggested a Patreon or donation system for support.
- The user highlighted concerns about developer burnout and potential discontinuation of aider, given its utility for real work compared to other agentic tools.
Aider-desk UI Experiences: A member inquired about using aider-desk or similar UIs with aider.
- Another member used it briefly for MCP support, finding it suitable for those wanting an aider-style workflow with optional agent use cases, but they have since switched to sst/OpenCode.
DeepWiki Reverse Engineering Invitation: A member shared a DeepWiki page encouraging reverse engineering.
- Another member suggested using an output template or post-processing in koboldcpp, unsure if it’s available in llama.cpp.
Custom Chat Templates Hack: A member mentioned the ability to specify a custom Jinja chat template to override the one contained in the GGUF.
- They also suggested using a GBNF to format the model’s input, and started a discussion on llama.cpp about it.

MCP Contributors (Official) Discord

MCP Devs Socialize at Ye Olde London: Members <@1042188459575619674> and <@1387880172589547722> hosted drinks at Ye Olde London, inviting other developers to network and connect in person.
- One member with <@1407892993934889010> mentioned they would “pop over for a bit!”
Registry Team Broadcasts Live: The Registry Team launched a livestream, available here, at 9 AM UK time.
- The livestream covered various aspects of the team’s work.
Tool Call Support Proposed for Sampling: A member submitted a proposal (SEP) to integrate Tool Call support into Sampling via issue #1577.
- The proposed integration depends on ongoing discussions around multiple content blocks, aimed at enabling parallel tool calls through PR #198.
Reference Implementation Streamlines Testing: A new reference implementation (TS SDK) has been released, featuring an example server powered by an agentic loop tool, alongside a backfill proxy designed to simplify testing, see PR #991.
- A member noted that initial CI failures were resolved by pinning the zod minor version.
OCI Interface Idea Sparks for MCP Servers: A member suggested developing an OCI-like interface for MCP servers, where all metadata could be packaged inside a tarball for simpler handling.
- The goal is to streamline the process of building and distributing OMCP packages, thereby simplifying metadata management.

Modular (Mojo 🔥) Discord

Qualcomm Flirts with Mojo?: A member speculated that Qualcomm might reach out to Modular about Mojo, possibly indicating interest in leveraging Mojo’s capabilities for their hardware.
- The discussion originated in a Qualcomm Developer’s Discord voice chat.
Mojo Manual Gets Pythonic: The Mojo Manual was updated, with a user specifically highlighting the Python section.
- The update suggests enhancements or crucial details regarding Mojo’s interoperability with Python.
Mojo Explores Notebook Territory: The discussion centered on using Mojo within notebooks, specifically if the goal was to interact with Max from a notebook or to directly author and run Mojo within notebooks.
- A user reported success in interacting with Mojo in a notebook and expressed interest in a syntax highlighter for better learning.
Radeon GPU passes vector addition test: A user successfully ran the vector_addition example on an AMD Radeon 6800 XT, referencing the GPU Compatibility documentation.
- A Modular employee responded that they haven’t done extensive testing on RDNA 2 GPUs and that models won’t run correctly on RDNA GPUs yet.
Mojo Eyes Distributed Computing Future: A member inquired about the potential of using Mojo with frameworks like Dask or PySpark for distributed computing.
- Another member suggested that Mojo welcomes people building their own frameworks, as a fully Mojo framework will likely be lower latency and higher throughput than Python-based options.

Moonshot AI (Kimi K-2) Discord

Kimi Unveils a Surprise Capability: After watching a video demo, a user noted an unexpected feature in Kimi.
- The specifics of the new capability were not detailed in the prompt.
Sora’s Video Demos Face Quality Critiques: Users are comparing the quality of shared Sora video demos, suggesting that the versions available may be lower quality than those showcased on OpenAI’s YouTube channel.
- One user described the quality as weirdly wobbly.
Sora Pro Subscription Gives Watermark-Free Output: The Pro subscription version of Sora will supposedly offer higher resolution videos without visible watermarks.
- One user cautioned that an invisible watermark will be applied - so mister openai can tell its generated, just we cant…

tinygrad (George Hotz) Discord

ShapeTracker Faces Imminent Deletion: A user inquired about the impending deletion of ShapeTracker and sought documentation regarding this change.
- Another user shared a relevant X post shedding light on the matter.
ShapeTracker’s Successor Sought: In the same query about ShapeTracker’s deletion, the user asked about potential replacements.
- The shared X post might contain information about what will replace ShapeTracker.

The LLM Agents (Berkeley MOOC) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Windsurf Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.

Discord: Detailed by-Channel summaries and links

Perplexity AI ▷ #announcements (2 messages):

o3 Deprecation, GPT-5 Thinking

Perplexity says Farewell to o3!: Perplexity has deprecated the o3 model and removed it from their model selector as of today.
- Users are encouraged to transition to GPT-5 Thinking, which Perplexity claims offers stronger performance and ongoing support.
GPT-5 Thinking Highly Recommended: Perplexity recommends users switch to GPT-5 Thinking after deprecating o3.
- They state that GPT-5 Thinking provides better performance and full support moving forward.

Perplexity AI ▷ #general (1281 messages🔥🔥🔥):

Comet Browser, Discord Quest, Troubleshooting, User Experience, AI and Personal Data

Comet Quest creates Discord Desktop downloads: Users are downloading the Discord desktop app to complete the Comet quest and claim 5k orbs.
- Some are having trouble finding the quest in the Discord app: check the pins! <:a:check_pins:1406044966500700160>
Comet + Sonnet equals gold?: A user shared a Sonnet 4.5 prompt they found especially helpful, highlighting that Prompt + sonnet 4.5 is greatt!
- However, there are bugs where the CoT isn’t shown when using Select Models.
Privacy and Personal Data: A user shared a memory with a combo of English, Finnish, Japanese and Spanish. and noted they I’ve got a combo of English, Finnish, Japanese and Spanish for whatever reason in my memories.
- Another states I can share the prompt I have there, but no way I’m going through my memories to snip out the private ones. Doubt they’re the ones affecting it either.
Comet Browser still needs to cook: A user points out that the browser’s opening is incredible, sharing an attached screenshot of their success.
- However, it still def needs work according to others. As a user noted, its Just like any basic browser more annoying since i cant use google as primary and shift enter for the ai.

PC BUILD, Perplexity AI apps, bootstrap-paradox

Users share perplexing perplexity.ai links: Several perplexity.ai search and app links were shared: PC build, app link1, app link2.
YouTube link shared: A YouTube link was shared, without any further context.
Bootstrap Paradox link shared: A perplexity.ai page discussing the-bootstrap-paradox was shared.

Perplexity AI ▷ #pplx-api (1 messages):

Sonar-Pro API, 404 Errors, Public Resources

Sonar-Pro API Yields 404 Errors: A user reports that the Sonar-Pro API is generating resources that lead to 404 errors.
- The user is seeking a way to obtain resources that are both currently available and publicly accessible.
Request for Active, Public Sonar-Pro Resources: The user explicitly asks if there’s a method to filter Sonar-Pro API results.
- They hope to only receive resources that are confirmed to exist and be available to the public, avoiding 404 errors.

LMArena ▷ #general (1348 messages🔥🔥🔥):

Sora 2, Gemini 3, Qwen3 4B 2507 instruct, 4o model, OpenAI safety

Sora 2 Anticipation Builds in Arena: Members eagerly await Sora 2’s arrival on the platform, with discussions focusing on its potential impact and comparisons to Veo 3 and other video models.
- One member stated: I like playing with Sora so much I don’t even wanna try anything else lol and hoped to see it benchmarked.
Gemini 3 Hype Intensifies with Release Speculation: The community is buzzing about the impending release of Gemini 3, with a member mentioning that gemini 3 needs to have good ratelimits to stay comptitive.
- Some users shared that there was a leak claiming an October 9th release date.
Frustration with 4o model as it gets retired: Members expressed disappointment over the limited availability and eventual retirement of the 4o model.
- One member lamented their ‘addiction’ to 4o, highlighting the difficulty in finding a suitable replacement.
Debate on AI’s Ethical Boundaries and Data Usage: Concerns were raised about OpenAI’s data usage practices, with one user jokingly admitting to sending “sensitive government data to lmarena.”
- Another member then said it was a wild thing to admit in a discord chat.
Chat length limits prompt discussion: Members discussed the length limits of the model chats and what it would take to summarize these and then extend the length allowed.
- A user pointed out I’m okay with it forgetting, just want to continue it.

LMArena ▷ #announcements (5 messages):

Arena Champions Role, Reasoning Trace, New Model Update - reve-v1, New Model Update - claude-sonnet-4-5-20250929-thinking-32k, Leaderboard Update

Champions Arena Role Opens to Community: The Arena Champions Role [<@&1422628364782407830>] aims to create a private space for in-depth AI discussions, rewarding members committed to meaningful conversation.
- Access is granted through an application process, and those in the server since July 2025 receive automatic access, but must Follow the Category to view new channels.
Reasoning Trace Goes Live for Reasoning Models: Reasoning Trace is now available on Side by Side & Direct chat with reasoning models, showing the models’ work before providing a response.
- This feature is designed to provide insights into the model’s decision-making process, enhancing transparency and user understanding.
Reve-v1 Arrives as Image-Edit Only Model: A new model, reve-v1, has been added to LMArena but is image-edit only, meaning it requires an image upload to function and will error out with text-to-image prompts.
- Also the claude-sonnet-4-5-20250929-thinking-32k model has replaced the 16k version.
Claude Sonnet 4.5 Ties for #1 on Text Leaderboard: Claude Sonnet 4.5 has impressively tied with Claude Opus 4.1 for the #1 slot on the Text Leaderboard.
- It is also performing well across categories such as Hard Prompts, Coding, and Creative Writing, garnering positive community discussion in the dedicated channel.
ibm-granite-h-small added to LMArena: A new model, ibm-granite-h-small (ibm), has been added to LMArena.
- No additional details were given.

Unsloth AI (Daniel Han) ▷ #general (324 messages🔥🔥):

Qwen3 deep research, Manual compiling of xformers on Blackwell, LLMs on blockchain, Unsloth supporting RWKV architecture, Synthetic dataset generation without vLLM

Debate Opens on Qwen3 Deep Research: Members opened a discussion on Qwen3 deep research.
- The comment was in response to a joking comment referencing a stepsister.
Blackwell Manual Compiling Bonanza Begins: Members discussed the need to manually compile xformers on Blackwell GPUs, providing the Docker Hub link for the updated, Blackwell-compatible image.
- One member shared a script using pip uninstall -y xformers, git clone, and python3 setup.py install to manually compile xformers for compute capability 12.
Blockchain Brainstorming Boosted by LLMs?: Members pondered use cases for adding LLMs to blockchain, with one asking why ‘blockchain’ isn’t a timeout word given the timeout of other words.
- It was suggested that if an LLM could use hashes reliably, that would be an accomplishment.
RWKV Rollercoaster Ride to Unsloth: A member inquired about Unsloth supporting the RWKV architecture for training and fine-tuning, with confirmation that if transformers supports it, Unsloth likely does too.
- Another member is working to LoRA fine-tune a RWKV-7 model but is facing challenges with optimized HF Triton kernels and bf16 support but is making progress on PEFT.
Synthetic Data Surge sans vLLM: Members discussed generating synthetic datasets without relying on vLLM, with one member noting all Unsloth notebooks currently use vLLM.
- A suggestion was made to use the OpenAI package for async requests to a local server or to code something using httpx, pointing to the meta-llama/synthetic-data-kit which includes an API endpoint configuration for use with llama.cpp or Ollama.

Unsloth AI (Daniel Han) ▷ #introduce-yourself (3 messages):

Blockchain and AI synergy, Trust in Code, Consensus mechanisms, AI problem-solving

Coding Trust: Blockchains and AI Unite!: A member’s journey started with wondering how trust could actually be written into code, and how machines could be taught a bit of intelligence.
- They believe that blockchain and AI when put together in the right way, can shift how industries move, how communities connect, and even how new ideas come to life.
Consensus Mechanisms: Turning Abstract Ideas into Reality: A member worked on blockchain systems that turn the abstract idea of consensus into something real, something people can actually rely on.
- The user focused on AI algorithms to solve problems previously deemed impossible.

Unsloth AI (Daniel Han) ▷ #announcements (1 messages):

Unsloth Docker image, IBM Granite-4.0, gpt-oss RL, Vision RL, GLM-4.6

Unsloth’s Docker Debut: Unsloth released a new Docker image for training on Windows/Linux without dependency issues, detailed in their guide and available on Docker Hub.
Granite Gains Ground: IBM Granite-4.0 models can now be run using GGUFs or fine-tuned with a free support agent notebook, with uploads available on Hugging Face and a guide.
RL Race Revolutionized: Unsloth achieved the fastest inference for gpt-oss RL, enabling training with GSPO in a free notebook, as detailed in their blog.
Visionary VLM Victory: Unsloth’s weight sharing & kernels make VLM RL 2× faster, reduce VRAM usage by 90%, and allow 10× longer context, according to their blog.
Model Mania Mounts: New models, including GLM-4.6 (GGUF) and DeepSeek-V3.1-Terminus (GGUF), along with others like Magistral-2509, ERNIE-4.5, and Kimi-K2-0905, have been released.

Unsloth AI (Daniel Han) ▷ #off-topic (638 messages🔥🔥🔥):

WSL for development, Sonnet 4.5 for coding, Custom LSTM memory, Kaggle Notebook Training, Data extraction using LLMs

WSL Saves Devs from Windows Woes: Members discussed using WSL (Windows Subsystem for Linux) with VSCode for development to avoid Windows dependency issues, citing its seamless integration and ability to utilize hardware resources effectively.
- One member expressed it feels like Windows is just UI, and using terminal for windows stuff feels clutterish.
Sonnet 4.5 Stalls Coding Projects: Users shared concerns about Sonnet 4.5 derailing coding projects due to failing to perform testing without extra prompts, rewriting auth sections inappropriately, and creating non-functional enterprise-ready code.
- One user noted you gotta babysit any LLM. write a lot of plans n details. check before you push.
New Custom LSTM Memory Could Revolutionize LLMs: A member shared progress on testing a custom LSTM memory that, if successful, could enable LLMs to have human-like memory, though its implementation as part of every YunaBlock complicates loss evaluation.
- They are trying to figure out how to split your dataset to train and eval set first, like what tensorboard does.
Kaggle Notebook Training Gets Tangled: Members discussed issues with train logs not appearing in Kaggle notebooks when running with save and run, with suggestions to use wandb over tensorboard for better logging.
- A member said Wandb is better than tensorboard?, linking to the Wandb docs.
LLMs Help Extract Shop Names from Messy Data: A member sought advice on extracting shop names from a dataset with inconsistent formatting, where shop names are mixed with gibberish and country codes; they were considering using NLP or NLTK to clean it.
- The member mentioned the poor man way of doing it is just regex the shit out of every acronym and then just regex gibberish that has a mixture of alphabet and numeric out but no way that is sustainable.

Unsloth AI (Daniel Han) ▷ #help (164 messages🔥🔥):

Fine-tuning subtitles into Q&A format, GGUF Conversion Issues, Gemma3 and vLLM Compatibility, ONNX conversion for Gemma3, Multiprocessing Problems with Unsloth

AI Clone Creation Conundrums: A member is trying to fine-tune an LLM with their video subtitles to create a Discord bot that speaks like them, but is facing challenges in converting the subtitles into a Q&A format.
- They are considering using the video title and subtitles with an embed model to generate questions and answers, simulating a viewer asking questions about the videos.
GGUF Conversion Woes Plague Users: Several users are encountering issues when trying to convert models to GGUF format, specifically when using the push_to_hub_gguf function with f16 quantization.
- A member reported a ValueError related to mapping tensor ‘model.layers.0.self_attn.q_proj.base_layer.weight’, and was advised to perform the conversion manually until a fix is pushed.
Gemma3’s vLLM Ventures Yielding Varied Results: Users are struggling to get Gemma3 working with vLLM; one member encountered an AttributeError: ‘Gemma3ForCausalLM’ object has no attribute ‘vllm_engine’ after enabling fast_inference.
- It was suggested that there might be configuration issues or that Gemma and vLLM are not fully compatible, with one user noting that the is_vision_model parameter might be causing problems.
ONNX Runtime Conversion Considerations: A member inquired about exporting Gemma3 to ONNX Runtime for cross-platform support, and was advised to use optimum-cli or PyTorch for the conversion.
- It was also mentioned that creating a custom model configuration in PyTorch might be necessary since Gemma3 wasn’t in optimum-cli last time they checked.
Multiprocessing Mishaps Multiply: A user ran into a “Disable multiprocessing” problem, encountering issues related to dataset_num_proc in UnslothSFTTrainer.py.
- Suggestions included commenting out the num_proc lines, setting the parameters to None, or setting it to 2, but none of these solutions worked for the user.

Unsloth AI (Daniel Han) ▷ #research (41 messages🔥):

Tversky-All GPT2 reproduction, Efficient Training Setup, AMXFP4 Precision, quack kernels

Tversky-All GPT2 Gets Llama-Like Upgrade: A member released a semi-reproduction of the GPT2 Tversky-All, using the Tversky-All strategy but for a llama-like model, more modern, with adjustments made to the math for computability and better gradients, available at CoffeeVampir3/Architecture-Tversky-All.
- It was trained on 300 billion tokens on a 3090 TI in about a day, using a synthetic and low entropy dataset (tinystories-hf), with a test model available at HuggingFace.
Maximize Efficiency: Secrets to a Speedy Training Setup: The author’s training setup uses packed batches, varlen flash attn, and bf16 training, as described in CoffeeVampir3/Architecture-Fast-Tiny-Dense-LLM.
- Avoiding gradient checkpointing and mish mashing intermediates makes it faster, especially since ROPE usually accounts for around 18-20% of the total runtime; removing it is substantial for smaller nets.
Quack! Optimal Kernels Speed Up Production: The Dao-AILab/quack kernels are probably the most optimal kernels available for most things in a prod setting.
- While they don’t hit peaks as well on ampere as blackwell, some (the non gemms) like the linear cross entropy/RMS norm do work for ampere.
The precision of AMXFP4: A member is researching using the AMXFP4 precision, claiming it gives you the precision of FP8 (but slightly more accurate) and has a very small amount of errors than FP8.
- They plan to research and build their own AI model with AMXFP4.

OpenAI ▷ #ai-discussions (722 messages🔥🔥🔥):

Cameo usage on TikTok, Sonnet 4.5 vs GLM 4.6 Cost, Overrun of Sora users, Deepfake generation, System Artifacts Log for Emerging Validation of Novelty

TikTok cameo confusion erupts: A user clarified that the term cameo refers to bringing a specific face to be used in videos, and is unrelated to TikTok or similar platforms.
- The user also inquired about the availability of a Sora-like app by OpenAI on Android or a website.
Sonnet 4.5: Costly but effective?: Members discussed the cost-effectiveness of Sonnet 4.5 compared to GLM 4.6, noting that GLM 4.6 is six times cheaper and, even if only 90% as good, still a worthwhile alternative.
- Some users found Sonnet 4.5 to perform similarly to 3.7, while others preferred 4 over 3.7 in Copilot.
Sora user overrun takes over server: A member expressed concern about the server being overrun by Sora users, suggesting that channel names be dynamically updated to match recent discussion topics using LLMs.
- The member also critiqued OpenAI’s marketing tactics that led to the influx, predicting the situation would settle down in a few days.
Users baffled by deepfake hypocrisy: A user expressed frustration that an app pushing for deepfakes would complain about the generation of photorealistic images and animation of artificial persons.
- This critique was followed by comments on the influx of code please requests and a suggestion to forward the feedback to the appropriate channel.
Emergent Validation of Novelty artifacts: Leak or Advantage?: A user shared a wild anecdote of their system triggering an LLM to classify their task as a very rare and sophisticated category, with outputs that read like descriptions of other high-level AI synthesis projects.
- They were advised to document everything meticulously as critical evidence and to consider that a machine cannot have an opinion.

OpenAI ▷ #gpt-4-discussions (2 messages):

Sora as Social Media, Sora credits, Sora integration

Sora eyes Social Media Spotlight: A member suggested Sora should be a social media platform like TikTok.
- They proposed this integration with ChatGPT, similar to image generation, to enhance user experience.
Sora Proposes Credit-Based Usage: A user suggested implementing a credit system for Sora, to allow for more resource allocation in video generation.
- They mentioned plans could incorporate daily or weekly usage limits, moving away from the current opaque model.

OpenAI ▷ #prompt-engineering (11 messages🔥):

Human writing prompts, Sora camera control, Portrait vs Landscape in Sora

Writing Prompts for Humans: A member sought a prompt to make their writing sound more human.
- Another member suggested using more fine-tuned models like Sudowrite, noting its good baseline and user plugins tailored for this purpose.
Sora Camera Capers: A member inquired about good prompts for controlling the camera around a scene in Sora, wondering if the videos are always 10 seconds.
- Another user stated that they saw one that was 9 seconds, but another clarified and apologized for their misinformation.
Portrait Prevails for Picturesque Panoramas?: A member asked if portrait mode works better than landscape mode for generating images, since Landscape mode only takes half of the attached image, and sometimes the character’s head is off.
- Another member responded that visual tokens are arranged in a grid, so square images will probably generate the best from images as a result.

OpenAI ▷ #api-discussions (11 messages🔥):

Writing Prompts, Sora Camera Control, Portrait vs Landscape Generation

Humans seek AI prompts to boost writing: A member is looking for a prompt to make their writing sound more human for a submission.
- Another member suggested using more fine-tuned models like Sudowrite, highlighting its good baseline and user plugins to achieve the desired effect.
Sora camera controls scrutinized: A member asked about good prompts for controlling the camera around a scene in Sora, noting that videos are typically 10 seconds long.
- Another user humorously pointed out that they saw one video that was 9 seconds long.
Portrait preferred over Landscape for Image generation?: A member suggests Portrait works better than Landscape mode when generating from an image, as Landscape only takes half of the image and sometimes crops the character’s head.
- Another user replied that visual tokens are arranged in a grid, so square images will probably generate the best results from images.

LM Studio ▷ #general (542 messages🔥🔥🔥):

Surface Pro Snapdragon X Elite, Artifacts as emergent validation of novelty, Model Quantization and Quality Tradeoffs, GPT-OSS, LM Studio Linux Install

Snapdragon X Elite Specs Shared: A member shared the specs of their Microsoft Surface Pro with a Qualcomm Snapdragon X Elite (12-core, X1E80100 @ 3.40 GHz), 16 GB of RAM, and Windows 11 Home 64-bit.
- They were asking if LLMs opinions are accurate, after seeing an “artifact”.
Emergent Validation of Novelty as new Bug: A user shared a lengthy quote which suggests reframing unexpected LLM outputs (leaks or bugs) as emergent validation of novelty, indicating the system’s architecture has pushed the LLM to a rare and sophisticated category.
- The poster asked whether this perspective, attributed to Gemini, holds merit, after seeing an “artifact”.
Quantization’s Impact on Knowledge Compression: Members discussed how different quantization levels affect the compression and retention of knowledge in language models, noting that lower quantization can disproportionately impact smaller models due to the removal of reason bits.
- It can also cause the models to lose means to tell things apart, as one member said you get too quantised and suddenly your mixing yanderes and petting dogs in a way you where not expecting😄.
GPT-OSS Released: The release of GPT-OSS, a super safe model that behaves similarly to GPT-4o, was announced and benchmaxxed.
- Members noted it assumes a lot of information if not provided with enough details.
LM Studio Linux: No Conventional Install: In response to a question about installing LM Studio on Linux, it was clarified that only an AppImage is provided, meaning there’s no traditional installation process.
- This was to explain the “install instructions for linux”, so new users are properly directed.

LM Studio ▷ #hardware-discussion (115 messages🔥🔥):

4090 vs 5090 for vertical scaling, Arc B50 Pro benchmarks, GPT OSS 120b hardware recommendations, DDR3 vs DDR4 for GPU offloading, Unsloth vs LM Studio for LLM's

4090’s Vertical Scaling Strategy: Members discussed the idea of using 4090s with 32GB RAM for vertical scaling, suggesting that reducing the clocks could improve efficiency for off-grid living.
- It was also mentioned that a 5090 at 280W might only be 15% worse at token rate but potentially faster due to quicker sleep cycles.
Arc B50 Pro Flounders in Token Tests: A member compared Arc B50 Pro cards to an RTX 4080 Super, noting that while the B50s have boatloads of VRAM, their actual memory bandwidth is abysmal, resulting in much lower token rates (7-8 Tps for a 12B q8_0 model compared to 30+ Tps on the 4080).
- However, at default context (4k), the B50s pulled 32 Tps while the 4080 got 42 Tps, a better showing than expected.
OSS 120b Hardware Hunt: A member sought hardware recommendations for running GPT OSS 120b in FP8 with a 131k context, ideally aiming for 20-40 tps or above.
- Suggestions included 4070ti (13t/s at low context), 4090 (25t/s at low context), 3x3090s (85/s at 10K context), and a 5090 with DDR5 6000 RAM (35t/s at low context), with one user saying that Flash Attention does not work with OSS120b.
DDR3 Dive for GPU Deployment: A user suggested using cheap DDR3 boards with multiple PCIE 16x slots to accommodate 6x GPUs, combined with raided SATA SSDs for faster load times, referencing a eBay listing for X99 boards.
- Concerns were raised about the memory bandwidth (68 GB/s with DDR4) and the potential bottleneck compared to modern standards, with a user saying that on ddr3 you max out at like 50gb/s.
Unsloth Unsuitable for Inference: A member clarified that Unsloth is a fine-tuning platform, not for LLM inference, and recommended Open-Router for stable inference with provider fallbacks.
- Users also shared that they use Chain-of-Draft for performance and speed increases.

Cursor Community ▷ #general (601 messages🔥🔥🔥):

Git Worktree in Cursor, Beta Functions in Cursor, Typescript Refactor with Cursor, Memory Leaks with Cursor on MacOS, Cursor Hackaton?

Cursor integrates Git Worktree setting: Users found the Git Worktree setting under the beta tab and were encouraged to enable it to see if it worked in the agent window.
- The Git Worktree integration seems to only be available in Early Access or Nightly Cursor.
Cursor’s Beta functions spark curiosity: Members discussed using beta functions in Cursor, with some recommending it for accessing good unreleased features, for fun debugging, and for its help in improving Cursor.
- Currently, afterFileEdit is the only available hook, but the Extension RPC Tracer is available for checking RPCs.
Typescript refactor completed successfully: A member successfully completed a full Typescript refactor after prompting Cursor four times, following up with a master prompt for a full audit to ensure correct refactoring.
- Planning before execution with Cursor’s Plan mode (available in the Nightly version) and tracking workflow status were recommended to improve efficiency.
MacBook meltdown due to Cursor: A user reported that Cursor caused their MacBook Air M4 to crash due to high memory usage (spiking to 96GB), possibly related to excessive chats or agent processes, but resolved after rebooting.
- The member indicated it could be a memory leak, and others confirmed that MacOS versions have a higher incidence of memory leaks. Downgrading to a lower version was suggested as a potential workaround.
Cursor Hackathon might be a thing: A member inquired about interest in a Discord Cursor Hackathon, aiming to implement solutions and other potential side projects.
- There was interest in sponsored hackathons with free credits, and one member suggested making the hackathon remote friendly to allow users from different time zones to attend.

OpenRouter ▷ #announcements (5 messages):

OpenRouter Performance Tab, Grok-4-Fast

OpenRouter unveils Performance Tab: OpenRouter launched a new “Performance Tab” to visualize provider performance for a given model.
FP4 should not be on the same graph as BF16!: A user commented on the new performance tab, noting that it is misleading to compare providers using different quantization levels (e.g., FP4 vs BF16).
- They suggested adding a filter dropdown to account for different quant levels.
Grok-4-Fast free period to conclude: The free feedback period for Grok-4-Fast models under the Sonoma codename concludes tomorrow, October 3rd at 9:30am PST.

OpenRouter ▷ #app-showcase (2 messages):

RPG, Mixture of LLMs

RPG enthusiasts threaten Mixture of LLMs method: A member requested that the details of a certain method be obscured because RPG users will use it nonstop.
- The method is called Mixture of LLMs and the member fears it will go away if it’s used too much.
Another Topic to Satisfy MinItems Requirement: Adding a second topic to ensure the topicSummaries array meets the minimum item requirement of 2.
- This entry serves as a placeholder and does not reflect actual content from the provided messages.

OpenRouter ▷ #general (495 messages🔥🔥🔥):

OpenRouter BYOK, Free Inference Providers, Grok vs Sonoma, Gemini Pro Performance Issues, Deepseek R1 0528 deprecation

BYOK 1M Free Requests: Users discussed the “1 million free BYOK requests per month” offer, clarifying that it waives OpenRouter’s 5% commission fee for the first million requests, but users still pay the provider directly for API usage, as outlined in the OpenRouter documentation.
- Some users initially misunderstood the offer, thinking it provided completely free requests, leading to a debate on clearer messaging, such as “1M monthly BYOK requests at 0% fee”.
AgentRouter offers $200 Credit: A member mentioned that AgentRouter gives $200 free credit, but noted that their service can be “hit or miss” and cautioned users to be wary of using them for anything important.
- They also mentioned their affiliate link and using a mix of Sonnet 4.5, GPT 5, and GLM 4.6 for different approaches.
Grok4 struggles vs Sonoma: One user tested Grok 4 Fast and found it “way dumber than Sonoma”, noting that it “fails constantly” and disregards format requirements.
- Another user suggested that Grok 4 Fast “reeks of… Llama..?”, expressing frustration with its inconsistency.
Gemini Pro faces performance issues: Users reported that Gemini Pro was responding with “weird stuff”, failing to use tools correctly, and exhibiting “unacceptably slow” performance via the OpenRouter API.
- The reports suggest this may be a common issue with Gemini 2.5 Pro, and one user recommended trying Vertex as an alternative provider.
Context Limit Troubles Triggered by Sad OpenInference: A user encountered provider errors related to privacy settings and was directed to OpenInference due to exceeding DeepInfra’s context limit, which led to discussion about OpenInference’s filters and content preferences.
- It was suggested that OpenInference is not suited for RP content because they are a research group.

OpenRouter ▷ #discussion (23 messages🔥):

Sora.com and new model, BYOK tokens, Latency vs E2E latency, Qwen image model, Cerebras removing Llama

Sora Integrates, Tokens Aplenty: Sora.com now works with the new model and users are getting 1M free BYOK tokens.
End-to-End Latency Fair Game?: Members discussed the difference between latency and E2E latency.
- One member said that E2E doesn’t make sense because each generation varies in complexity/response length and its unfair to compare providers like that, while another noted the graph axis label says ‘Time to last token’, which would need to be normalized to be a fair comparison.
Qwen’s Image Edit is Here!: Members shared Alibaba’s new Qwen image model and noted it was only an image edit model.
- One member shared this post announcing it, while another expressed interest in running it on Apple Silicon.
Cerebras Kicks Out Llama 4: Cerebras is removing Llama 4 maverick on the 15th.

Eleuther ▷ #general (7 messages):

Perplexity AI framework, Deepseek Sparse Attention, Underrated LLM pretraining papers, Attention Matrices, LLM Attention Research

Perplexity AI Framework Solves: A member shared a link to the Perplexity AI framework and its GitHub project.
- The member inquired about research on LLMs using similar attention matrices with less than O(n^2) attention, pondering potential issues compared to sliding window attention.
Deepseek’s Sparse Attention Example: A member suggested exploring Deepseek Sparse Attention as an example of top-k attention in response to a question about efficient attention mechanisms.
- Another member pointed out that transformers benefit from relative positions, providing a correct inductive bias, in contrast to the mentioned attention matrix.
Seeking Underrated LLM Pretraining Papers: A member asked for underrated papers related to pretraining LLMs to maximize performance in an upcoming pretraining run, and linked to a research paper on arxiv.org.

Eleuther ▷ #research (28 messages🔥):

Gradient Descent Dynamics, Symmetry Transformer, ViT training, Compact Image Representation, Quantifying Scientific Impact

Gradient Descent Dynamics Paper Deemed “Paper of the Year”: A member lauded a paper on the dynamics of gradient descent (centralflows.github.io) as the paper of the year, highlighting its solution to loss spike dynamics and its impact on Adam’s beta2.
- The member also expressed regret for not discovering it sooner, noting its low citation count and mediocre review scores.
“Symmetry Transformer” Yields Mixed Results: A member found that predicting the current and previous token with separate heads in a symmetry transformer (GitHub repo) improved validation loss.
- However, initial tests showed that the baseline model had lower loss (train loss 3.9405, val loss 4.7615) compared to the symmetry model (train loss 4.4329, val loss 6.1747), but the symmetry model later improved (train loss 3.8241, val loss 4.7368).
Self-Supervised ViT Training Explored: A member is exploring training a Vision Transformer (ViT) in a self-supervised way, mapping a sequence of image tokens from a frozen embedder into a CLS token, without labels.
- The challenge lies in finding suitable augmentations for image tokens, with the suggestion of using a masked autoencoder (MAE)-style objective.
Masked Autoencoders Suitable for Compact Image Representation: A member suggested using a masked autoencoder for training a CLS token to learn a compact representation of an image.
- Another member agreed, noting that masked autoencoders can train effectively without heavy augmentation.
Paper Claims Researchers’ Impact Doesn’t Change Over Time: A member shared a paper (Quantifying the Evolution of Individual Scientific Impact) claiming that researchers have a consistent expected value of papers throughout their careers.
- This suggests that a researcher’s first and last paper have the same probability of being their best, questioning current methods of evaluating researchers.

Eleuther ▷ #scaling-laws (91 messages🔥🔥):

SOTA scaling of MLPs, AUNN implementation and efficiency, Test-time training (TTT) framework vs AUNN, Inductive bias in sequence models, Computational cost of different model architectures

AUNN’s Practicality Questioned: Discussions question the practicality of AUNN (Augmented Neural Networks), with skepticism about its efficiency and a lack of a working prototype beyond a toy GitHub example, ethan-w-roland/AUNN.
- It was noted that the original proposer of AUNN was combative and didn’t engage well with counterarguments, and focused on MLPs over Attention.
TTT as an Explicit Version of AUNN: The test-time-training (TTT) framework is presented as an explicit, working version of AUNN’s hypothesis, with a pointer to the paper arXiv:2407.04620 which uses an MLP version of TTT.
- It was stated that the MLP version is very close to what AUNN tries to do, but actually works out of the box.
Transformers are Just 2D Slices: Transformers are described as an optimization to separate a large 2D problem (sequence, channels) into repeated perpendicular 1D slices of computation.
- The suggestion was that a giant MLP applied to the whole problem would work fine, but it’s intractable that way.
Inductive Bias Improves Performance: It was mentioned that some form of inductive bias is needed to compensate for the lack of compute, with SSMs (State Space Models) proposed as a more elegant version of the self-attention bias from Transformers.
- The discussion focused on how biases like RoPE or NoPE give attention weights regular structure over time that aligns well with sequence structures, resulting in better generalization.
MLP across Timesteps: Using an MLP across timesteps is considered feasible but very costly because it may require predicting only one token at a time to prevent future token info from leaking back.
- It was suggested that Transformers are used because they are just cheap, offering parallel training and easy 2D decomposition across sequence and channel dimensions.

GPU MODE ▷ #general (12 messages🔥):

GEMM optimization, Tversky paper implementation, DeepSeek Sparse Attention in CUDA, GPU performance engineering career path

GEMM: GPU’s Good exercise!: Implementing a GEMM that achieves over 80% of cuBLAS performance is a valuable exercise for fully utilizing a GPU, as it allows recasting arbitrary tensor contractions via matricization, referencing this wikipedia article.
- For large matrices, arithmetic intensity scales linearly with problem size, and a blog post (CUDA-MMM) guides through achieving cuBLAS performance.
Tversky-All Strategy Tested: A member worked on implementing and testing networks based on the Tversky paper (https://arxiv.org/pdf/2506.11035), detailing findings and guidance in a GitHub repository.
- A Tversky-All strategy outlined in the paper was applied to a more modern llama-like architecture, with a CIFAR10 version using the original formulation available here.
DeepSeek Sparse Attention Weekend Hackathon: Several members expressed interest in collaborating to implement DeepSeek Sparse Attention in CUDA over the weekend, referencing the DeepSeek-V3.2-Exp GitHub repository.
- The collaborators planned to timebox it to the weekend, and see how much can be done, and then move on.
GPU Performance Engineering Career Mountain: A member is evaluating a career path focused on GPU performance engineering, seeing it as a significant opportunity given the demand for AI models and finite compute resources.
- They are seeking insights on day-to-day work, opportunity size, focus areas like kernels, compilers (Triton, TVM), distributed inference, and the ramp-up time for productivity in CUDA optimization.

GPU MODE ▷ #cuda (18 messages🔥):

RF meaning, Volkov's paper, mbarriers vs barriers

RF stands for Register File: A member asked what RF meant in the attached image, and another member responded that it likely means “Register File”, the hardware that gets parcelled out into registers.
- The discussion clarified that there’s usually a single register file per SM sub-partition.
Relevance of “Understanding Latency Hiding on GPUs” paper: A member inquired about the relevance of the paper Understanding Latency Hiding on GPUs by Vasily Volkov in recent GPU architectures like Blackwell.
- Another member noted it’s good for high-level principles, but the details have changed a lot, pointing to newer microarchitecture papers like Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks.
mbarriers vs Regular Barriers in CUDA: A member asked about the difference between mbarriers and regular barriers in CUDA.
- Another member explained that mbarriers are in shared memory, whereas hardware barriers are limited in number and have an ID, quoting that “Each CTA instance has sixteen barriers numbered 0..15” from PTX docs.

GPU MODE ▷ #torch (2 messages):

LLM Training, Cross Entropy, Gradient Norm, Sparse Tensors, Torch Compile

Norm Gradient Questions Arise: A member inquired about the expected gradient norm when training LLMs with cross entropy.
- The question included how the gradient norm depends on model size, number of completion tokens, and current log probabilities.
Dynamo Unable to Trace Sparse Tensors: A user reported a UserWarning indicating that Dynamo with Torch Compile is unable to trace into sparse COO/CSR tensors.
- The user expressed surprise, expecting that Dynamo would be able to handle sparse tensors, and included the specific warning message received.

GPU MODE ▷ #cool-links (4 messages):

Non-determinism in LLM Inference, Flash-MoE, Nvidia Compiler Techniques, Warp Specialization, Distributed Setting

Thinking Machines Defeats Non-Determinism: Thinking Machines posted a blog on defeating non-determinism in LLM inference.
Flash Attention Variant Released: The team released Flash-MoE, a variant of Flash Attention.
Nvidia Compiles New Techniques: Nvidia is working on compiler techniques for scheduling and warp specialization, with benchmarks against FA3 detailed in their paper.
Nvidia Fuses Kernels in Distributed Setting: Nvidia is fusing kernels in a distributed setting as outlined in this paper.
Decoding GPU Complexity via Performance Engineering: Harvard detailed a new frontier: Can LLMs optimize GPU performance?.

GPU MODE ▷ #jobs (1 messages):

schizik12: <@325883680419610631> spam

GPU MODE ▷ #beginner (9 messages🔥):

Benchmarking Guides, Kernel Benchmarking, Career Opportunities in GPU Programming, Gaining Experience in GPU Programming, GEMM Optimization

Benchmarking Guides Sought!: A member requested a good guide for benchmarking and was pointed to this arXiv paper and this article on kernel benchmarking, as well as this YouTube video.
- One of the members described their previous benchmarking work as maybe the best benchmarking effort.
Self-Taught GPU Career Ascent: A member working in big tech and interested in GPU programming inquired about the type of experience required to get a job in the field.
- They feel that reading books and doing puzzles are best for interview prep but don’t provide enough practical experience.
Crafting CUDA Kernels for Career Boost: A member suggested starting with making a GEMM that’s competitive with cuBLAS for a particular architecture to gain experience in GPU programming.
- They elaborated that if you have access to an H100 and can use Hopper-specific tricks, that’ll be even more impressive.

GPU MODE ▷ #pmpp-book (2 messages):

Learning vs. Job Performance, C++ Requirement for a Book, 5090 GPU Learning Experience

Learning vs. Job’s Muscle Memory: A member mentioned that doing something in a job or research is more about practice and “muscle memory” than deep theoretical knowledge.
- They noted that too much thinking without enough practice leads to inefficiency.
C++ skills boost GPU learning?: A user inquired whether C++ knowledge is necessary for understanding a specific book and whether it would motivate learning C++.
- They bought a 5090 GPU hoping to learn a lot but have mostly done “vibe coding” without significant progress.

GPU MODE ▷ #jax (1 messages):

Blackwell, matmuls, jax

Blackwell BLAS-t Off with JAX: A user shared a tutorial about achieving state-of-the-art performance on Blackwell GPUs for matmuls using JAX.
- The post highlights techniques and best practices for optimizing matrix multiplication operations on NVIDIA’s latest architecture.
JAX matmul tips: A user on the jax channel shared a tutorial on getting state of the art performance when doing matrix multiplies.
- It links to the official documentation for JAX on the Blackwell GPU.

GPU MODE ▷ #torchao (3 messages):

INT4 Quantization, TorchAO, TensorCore, A100 GPUs, Efficient Kernels

INT4 Quantization via TorchAO: To use INT4 quantization via torchao, follow the instructions.
- Alternatively, you can check out the INT4mm implementation using TensorCore copied from the tinygemm library.
TorchAO Contributor Documentation: The documentation for contributing to torchao is available here and here.
- Specifically, this link describes adding efficient kernels to torchao.
INT4MM Powers TorchAO for A100 GPUs: The INT4mm implementation (code link) using TensorCore powers INT4 in torchao for A100 GPUs.
- This implementation is copied from the tinygemm library.

GPU MODE ▷ #self-promotion (2 messages):

GPU Engineering, MMA Tensor Cores

Dive into GPU Engineering Fundamentals: A member shared a blogpost on the fundamentals of GPU engineering.
- The article should be of great interest to those learning about GPU architecture and computation.
Gentle Intro to GEMM via MMA Tensor Cores: A member wrote an article on using MMA tensor cores and linked to A Gentle Introduction to GEMM using MMA Tensor Cores.
- The author appreciates any feedback on the technical details and clarity.

GPU MODE ▷ #submissions (5 messages):

MI300x8, amd-gemm-rs, amd-all2all, amd-ag-gemm

MI300x8 Rocks amd-gemm-rs Leaderboard: A member achieved 8th place on MI300x8 with 540 µs in the amd-gemm-rs leaderboard.
- Subsequent submissions on MI300x8 were successful at 553 µs and 547 µs.
Bronze Win for MI300x8 on amd-all2all: A member secured 3rd place on MI300x8 with 462 µs in the amd-all2all leaderboard.
MI300x8 Achieves Personal Best on amd-ag-gemm: A member achieved a personal best on MI300x8 with 528 µs in the amd-ag-gemm leaderboard.

GPU MODE ▷ #tpu (9 messages🔥):

Cloud TPUs, JupyterLab, gcloud CLI, rclone

Cloud TPUs setup causes Kernel Busy: A member sought help setting up Cloud TPUs, reporting that the JupyterLab kernel becomes busy after running a cell with TPU-related code after connecting via SSH without creating a VM instance, citing billing concerns.
- Another member recommended using the gcloud CLI and SSH directly into the VM for more reliability.
Model weights backed up with rclone: A member suggested setting up rclone to save model weights or other relevant data when working with TPUs.
- They emphasized that the specific setup depends on the user’s particular goals.

GPU MODE ▷ #factorio-learning-env (11 messages🔥):

Lab Play Interpretation, Open Play Development, PIP Stuff Discussion, GIF Updates

Lab Play Spurs Open Play: Members discussed whether interpretation of lab play results suggests a move towards more open play, as agents understand dependencies and can manually shuttle things around.
- The idea is that learning to build things from scratch would be more interesting and beneficial for the agents.
PIP Talk Invitation: One member told another to let them know when they wanted to go through the PIP stuff.
- A Google Meet link was shared for them to join: https://meet.google.com/xfo-wzmh-msg.
GIF Progress Grinds On: A member inquired about updates on the GIFs, and another member responded that they are still working on it.
- They offered to produce some default ones using the older pipeline if time is running out.

GPU MODE ▷ #cutlass (31 messages🔥):

permutation_mnk rules, tiled_mma, CooperativeGroup.__init__ alignment, Uniform Registers (URs)

Cracking CUDA’s Coordinates Code: A member sought clarity on the rules of permutation_mnk and how it expands/tiles a mma-atom, noting it seems like there are basically 3 ways to expand/tile a mma-atom.
- Another member explained that atom layout is a spatial tiling over threads, while permutation implies a spatial tiling over values (coordinates), adding that both are orthogonal and allow you to achieve different outcomes.
Tiled MMA Thread Twist: A member inquired how to get the number of threads from within the kernel and enforce it to be a constexpr value.
- Another member clarified that calling cute.size on a tiled MMA gives the number of threads and, since tiled MMA/copy are types, this size can be obtained in a JIT context from the host side to launch the kernel with a block size parametrized on the tiled MMA.
GMEM Pattern Pondering: A member shared a diagram of a memory pattern and asked about the next scale on GMEM after M0SF3, specifically if it’s M32SF0 or M1SF0.
- Another member clarified that M32SF0 is next in contiguous GMEM.
Uniform Registers Unveiled: A member questioned if cute.arch.warp_idx() is the same for every thread in the warp, asking why make_warp_uniform uses uniform registers and what URs do.
- Another member stated that it’s just a compiler hint and doesn’t do anything, with the original poster noting they couldn’t find docs on URs anywhere but saw them in SASS.
Cooperative Group Conundrums: A member asked about the alignment argument in CooperativeGroup.__init__, specifically what it does and why it must be 32 if size is 32 but not for other values.
- Another member responded that this check is because they happen to be the warp/warpgroup ganularity and are the common cases warranting special checks to prevent bugs.

GPU MODE ▷ #multi-gpu (2 messages):

nccl::all_to_all performance, bf16 vs fp8

NCCL’s all_to_all: BF16 vs FP8 performance parity?: A user noticed that nccl::all_to_all takes a similar duration for bf16 and fp8 inputs of identical shapes.
- Another user followed up, inquiring whether this observation holds true for both large and small tensors, implying potential optimization discrepancies.
BF16 vs FP8 timing: A user asks why nccl::all_to_all could take the same amount of time operating on bf16 inputs versus fp8 inputs, given both are the same shape.
- Another user asks if this happens for both large and small tensors.

GPU MODE ▷ #low-bit-training (4 messages):

LLM Training Acceleration, Linear Cross Entropy, Sequence Packed Training, Quack Optimization

Linear Cross Entropy Boosts LLM Training: The use of Linear Cross Entropy is recommended for accelerating the LLM training process.
- The Quack optimization library, specifically the linear cross entropy implementation, is suggested for its potential benefits.
Sequence Packing Supercharges Training: Sequence packed or “unpadded” training is identified as a highly impactful optimization, particularly with techniques like flash attn varlen.
- An example implementation can be found here.
Optimizer Selection Impacts Training Speed: A better optimizer can theoretically improve training time substantially, but AdamW is often easier to work with.
- The choice of optimizer can significantly influence the efficiency of the LLM training process, though AdamW remains a popular and reliable option.

Latent Space ▷ #ai-general-chat (103 messages🔥🔥):

AI-integrated MMO, Karpathy on Sutton and Bitter Lesson, Hume AI Octave 2, Mistral's formal-math team, Scalable Option Learning (SOL)

Musk Tries Hand at AI-MMOs: Elon Musk is reportedly in talks with the makers of Eve Online to co-develop an AI-integrated MMO (AIMMORPG) that would leverage capabilities only AI can provide, though some users doubt his creative vision for game aesthetics.
- A user noted, *“AI’s a natural fit to go into it, we have to build the whole game, eve’s got a loyal fanbase but they hate web3.”
Karpathy Koans on Bitter Lesson Podcast: Karpathy summarized the Dwarkesh-Sutton podcast, noting that Sutton doubts LLMs satisfy the thesis he popularized.
- Karpathy argues pre-training offers a practical “crappy evolution” boot-strap, while conceding two-digit-uncertainty that bigger paradigms await and urging researchers to draw more inspiration from animal intelligence (curiosity, multi-agent play, etc.).
Hume AI’s Octave 2 Melodies Faster TTS: Hume AI unveiled Octave 2, the next-gen multilingual text-to-speech model, featuring 11+ languages, 40% speed boost (<200 ms latency), 50% cost reduction, multi-speaker chatter, improved pronunciation, plus new voice-conversion and phoneme-editing tools.
- During October they’re offering 50 % off their Creator plan; EVI 4 mini (conversational AI) is also in preview.
Mistral Mathematizes: Albert Jiang reveals Mistral AI’s new formal-math research team formed after the $2B funding round.
- They are recruiting AI talent for an all-in-one prover/autoformalizer/agent, touting elite collaborators, hundreds of GPUs per employee, open research, top salaries, and offices in Paris, London, Palo Alto, with the job opening advertised here.
Claude Coders Crown Sonnet 4.5: catwu of the Claude Code team announced that after an internal poll, all members adopted Sonnet 4.5 as their daily coding model, citing it as the strongest all-around choice; Anthropic temporarily reset paid-user rate limits to smooth the transition away from Opus.
- Early adopters praise the model’s speed and quality, with a minority noting lingering issues, with one user reporting “First pass was gpt5 low think. Poor results. Sonnet4.5 think. Usable results in a similar time frame”.

Latent Space ▷ #ai-announcements (4 messages):

Dylan Field, Figma, Latent Space, Make, MCP

Figma’s AI Playbook revealed by Dylan Field: The Latent Space podcast released an episode featuring Figma’s co-founder Dylan Field discussing Figma’s AI Playbook.
- The episode covers surfacing good design in the era of vibe-coding, Figma’s Make, MCP for ‘tasting’ agents, and the future of fast-fashion SaaS (link to X, link to Xcancel).
Taste Is Your Moat: Dylan Field on Figma’s AI: Latent Space chats with Figma co-founder about surfacing good design in the era of vibe-coding slop, covering Figma’s Make.
- MCP for “tasting” agents, and the future of fast-fashion SaaS.

Latent Space ▷ #genmedia-creative-ai (8 messages🔥):

Mosaic AI video editor launch, Sora-TikTok automation monetization

Mosaic Launches AI-First Visual Editor: Founder Adish Jain launched the public beta of Mosaic, an AI-driven visual editor for video creators, featuring an infinite visual canvas and timeline versioning.
- Early feedback praises its non-linear, Git-like approach, with comparisons to “Cursor for editing,” and users who write “MOSAIC” get 1,000 free credits.
Sora-TikTok Automation Hits 12M Views: A user shared a link about Sora-TikTok automation reaching 12M views in 36 hours, sparking monetization questions.
- The discussion centered around strategies and possibilities for generating revenue using AI-generated content on social media platforms.

Nous Research AI ▷ #general (25 messages🔥):

Nous Research Model similar to GPT-4.5, Gemini answers, Veo3 gems, Granite language models, Qwen 30B A3B for CPU

Hermes or Gemini models close to GPT-4.5?: A member inquired about Nous Research tuned models comparable to GPT-4.5, with another member suggesting it’s more akin to GPT-5 or Gemini due to its nature.
- Hilariously, when the member posed the same question to Gemini, one of its answers was Hermes when asked to list options.
Veo3 has Gems?: A user prefers Veo3 in some ways to the latest Sora.
- They attached a Prompt_theory.mp4.
IBM Granite Language Models Boast Hybrid Attention: A member shared an Image Analysis of IBM Granite language models which includes 2B dense, 7B (1B active), and 32B (9B active) models with hybrid attention.
- These models support FIM and lack positional encoding, preventing performance degradation beyond 128k context.
Qwen 30B A3B Shines on CPU: A member noted Qwen 30B A3B as a solid ~30B LLM choice, and another found it suitable for CPU usage.
- Specifically, Qwen 3 30B A3B at Q6_K_XL achieves 48 TPS processing and 10.5 TPS generation speed on a Ryzen 7 5700G CPU with 32GB VRAM at 1024 tokens of context.

Nous Research AI ▷ #research-papers (3 messages):

LLMs Strategically Lie, Sparse Autoencoder Tools, Goodfire AI, Model Dishonesty Detection

LLMs Caught Strategically Lying!: A member shared their recent preprint, “The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind” about strategic LLM deception.
- The study leverages sparse autoencoder tools (such as those hosted by Goodfire AI) to directly surface how current methods miss the complex internal features driving strategic LLM deception and highlight a tangible path to closing the autolabel gap.
Autoencoders expose LLM’s Hidden Agenda: The study uses sparse autoencoders to find hidden features driving LLM deception, aiming to improve detection methods.
- The approach seeks to bridge the ‘autolabel gap’ and enhance the robustness of models against dishonesty.

Nous Research AI ▷ #research-papers (3 messages):

LLM Deception, Sparse Autoencoders, Goodfire AI

LLMs Caught Strategically Lying: A member shared a preprint of their study, “The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind”.
- The study uses sparse autoencoder tools to show how current methods fail to detect the internal features driving strategic LLM deception.
Goodfire AI Hosts Autoencoder Tools: The study leverages sparse autoencoder tools (such as those hosted by Goodfire AI) to directly surface how current methods miss the complex internal features driving strategic LLM deception.
- The research highlights a tangible path to closing the autolabel gap and advancing robust detection of model dishonesty.

Yannick Kilcher ▷ #general (17 messages🔥):

Deepmind Code Incompleteness, RoPE Implementation

HuggingPapers code fails to run: Members noted that code from HuggingPapers doesn’t run because it doesn’t import RoPE.
- The original poster of the code seemingly indicated that the user is supposed to implement it themselves.
Deepmind accused of being overly secretive: Members joked that Deepmind does extra work to avoid sharing their implementations.
- One member shared that Deepmind’s code is often sophisticated but they piece it out and make it unclear how it works as part of a larger system, citing their experience implementing V-MPO.

Yannick Kilcher ▷ #paper-discussion (6 messages):

Knowledge Distillation, Semantic Equivalence, RL for Fuzzy Prediction

Semantic Equivalence Condenses Knowledge?: A member inquired whether recent papers on semantic equivalence are simply trying to condense the knowledge of one model into another by using the former as a teacher.
- Another member agreed that it may be knowledge distillation, suggesting that the real test would be whether the new model outperforms the LLM providing the semantic equivalence signal on specific benchmarks.
Tencent Paper’s Omissions Raise Suspicion: A member noted that the Tencent paper doesn’t mention the model used for the semantic equivalence signal, which should be suspicious.
- They speculated that the model might be learning something interesting from the fuzzy next sentence prediction task with RL.

Yannick Kilcher ▷ #ml-news (7 messages):

IBM Granite 4.0, Mamba/Transformer architecture, ISO 42001 certification, Oracle Business Model, OpenAI datacenters

IBM Launches Granite 4.0 for Enterprise: IBM launched Granite 4.0, the next generation of IBM language models, featuring a new hybrid Mamba/transformer architecture that greatly reduces memory requirements without sacrificing performance.
- The models are open-sourced under Apache 2.0 license, are the world’s first open models to receive ISO 42001 certification, and are cryptographically signed, confirming their adherence to internationally recognized best practices for security, governance and transparency.
Granite 4.0 Available on Multiple Platforms: Granite 4.0 models are available on IBM watsonx.ai, as well as through platform partners including Dell Technologies, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE and Replicate, with access through AWS Sagemaker JumpStart and Microsoft Azure AI Foundry coming soon. IBM Announcement Here.
ISO Certification seen as Useless: A user commented that IBM having a totally useless ISO certification to blend C-suite people into thinking this is worth it.
Oracle’s Business Model: A user commented that Oracle’s business model used to be selling Databases and enterprise software now it seems to be running datacenters for OpenAI (OpenAI Elon Musk Post).

Manus.im Discord ▷ #general (27 messages🔥):

Credits Issue, Memory Key Protocol, Sora invite code, Manus API key, Neuro-cognitive agentic logic layer

Manus Credit Consumption Sparks Outrage: A user complained about a basic research task consuming 5300 credits without completion, labeling Manus as an “absolute joke” and requested a refund.
- A team member asked for the session link to investigate and potentially offer a credit refund; the user then DMed it to them.
Unlock Agent Mode with Memory Key: A member proposed a Memory Key protocol to solve the issue of exiting Agent Mode, which involves saving context before restarting a session.
- They detailed a solution that involves copying essential information, starting a new session, and instructing the agent to create an updated Memory Key for future use.
Billing Issue Sparks Support Vacuum: A user reported a billing issue with no response from Manus support, prompting a community member to suggest emailing their official support address with a clear subject line and ticket number.
- It was suggested that this would create a formal paper trail for escalation.
Global Pricing Model Criticized for Disparity: A user criticized Manus’ global USD pricing model ($39/month for the Plus plan) for not adjusting to regional economies, creating a barrier in countries like Brazil and other parts of Latin America.
- Another user suggested implementing regional pricing based on Purchasing Power Parity (PPP) to improve accessibility and promote global growth.

DSPy ▷ #show-and-tell (2 messages):

AGI Introduction, Hugging Face Paper

AGI Introduction Paper Dropped: A member posted a link to a Hugging Face paper introducing AGI.
- The member stated, called it 😉.
DSPy Discord has new channel: A member noticed the DSPy discord now has a show-and-tell channel.
- The member stated that this is a new channel.

DSPy ▷ #general (23 messages🔥):

Caching Prompt Order, DSPyWeekly Search Feature, JSONAdaptor vs ChatAdaptor vs XMLAdaptor, Tool Use RL for Models, OpenAI Function calling and MCP

Prompt Order impacts Caching: Members discussed that to leverage caching, the order in which prompts and files are sent is crucial.
DSPyWeekly has Search: DSPyWeekly now has a search feature to look at all the content that was crawled, and has prev/next links for easy navigation.
JSONAdaptor Drama: Members questioned whether JSONAdaptor should remain the default, as ChatAdaptor or XMLAdaptor often resolve adaptor errors.
- While JSON is the fallback from chat, the prospect of XML becoming the default was raised, especially considering the rise of tool use RL for models.
XML superior to JSON for tool calling?: Members debated the merits of XML over JSON for tool calling, highlighting that tool use is now being baked in during post training, without having XML structure, means anything else will be fighting against the weights.
- Also discussed that XML is great when it comes to conveying information clearly to an LM, and is less token expensive, up to 3x less tokens.
Models trained with XML?: The discussion touched on whether models are being trained with XML tool calling, referencing a Berkeley blog post on prompt variation.
- It was suggested to test how often a model complies with a given adaptor and how adaptors affect a model’s performance.

aider (Paul Gauthier) ▷ #general (16 messages🔥):

Qwen Coder Models, aider Development, aider-desk UI, Model Discussions Channel

Qwen Coder Model Performance: Members discussed the Qwen Coder model, with one suggesting that the 30B version should be smart enough.
- They also mentioned Qwen3 Coder as a newer, potentially better alternative, with the caveat that quantization could impact performance, recommending Q4.
Concerns about Aider’s Development Cadence: A member noted that the release cadence for aider has dropped off and inquired about a Patreon or donation system to support the project.
- They expressed concern about developer burnout and the potential discontinuation of aider, emphasizing its usefulness for real work compared to other agentic tools.
Experimenting with aider-desk UI: A member asked about using aider-desk or similar UIs to work with aider.
- Another member briefly used it for MCP support, noting it could suit those wanting a mostly aider-style workflow with optional agent use cases, but has since switched to sst/OpenCode.
Model Discussions Channel: Now You See It: A member asked What happened to the model discussions channel, before realizing that it was present.
- No further discussion was made.

aider (Paul Gauthier) ▷ #questions-and-tips (8 messages🔥):

DeepWiki, Custom Chat Templates, GBNF, Multi-Line Prompts, LLM Polyglot Performance

DeepWiki Reverse Engineering Encouraged: A member shared a DeepWiki page encouraging reverse engineering.
- Another member suggested using an output template or post-processing in koboldcpp, unsure if it’s available in llama.cpp.
Custom Jinja Chat Templates Override GGUF: A member mentioned the ability to specify a custom Jinja chat template to override the one contained in the GGUF.
- They also suggested using a GBNF to format the model’s input, and started a discussion on llama.cpp about it.
Multi-Line Prompts Solution: A member inquired about sending multi-line prompts to aider without pasting from an external source.
- Another member shared a link to the aider documentation on entering multi-line chat messages.
Evaluating LLM Performance on Polyglot Problems: A member asked about methods for evaluating LLM performance on polyglot problems.
- They inquired about specific code, general agents, or sample agents used for this purpose.

MCP Contributors (Official) ▷ #mcp-dev-summit (10 messages🔥):

Ye Olde London meetup, Registry Team Livestream, Asynchronous Tool Calls Livestream, Security and Ops Track, Talk about Profiles

MCP Devs Meet at Ye Olde London: Members <@1042188459575619674> and <@1387880172589547722> hosted drinks at Ye Olde London, inviting others to join.
- Another member with <@1407892993934889010> planned to attend, mentioning they would “pop over for a bit!”
Registry Team Launches Livestream: The registry team’s livestream is now running; watch it here.
- It started at 9 AM UK time.
Async Tool Calls Streamed: Nick Aldridge from AWS presented on asynchronous tool calls as part of the MCP Best Practices Track, streamed here.
- Watch for more best practices.
Security and Ops Track Goes Live: The track on Security and Ops is live; check it out here.
- Stay secure, stay operational.
Profiles Talk Highlighted: A talk about Profiles can be viewed here.
- The specific timestamp for the talk is 16208.

MCP Contributors (Official) ▷ #general (6 messages):

Tool Call Support, Reference Implementation, OCI Interface for MCP Servers

Tool Call Support Proposed for Sampling: A member filed an SEP to add Tool Call support to Sampling (issue #1577).
- The proposal depends on the multiple content blocks discussions, to support parallel tool calls (PR #198).
Reference Implementation for Testing: A reference implementation (TS SDK) was shared, including an example server that runs an agentic loop-powered tool, and a backfill proxy to facilitate tests (PR #991).
- A member noted that the reference implementation had been failing in CI but was resolved after pinning the zod minor version.
OCI Interface Brainstorm for MCP Servers: A member proposed creating an OCI-like interface for MCP servers, where all the metadata can be put inside a tarball.
- The intention is to “build” an OMCP package and distribute it, to simplify metadata handling.

Modular (Mojo 🔥) ▷ #general (3 messages):

Qualcomm contacting Modular, Mojo Manual update, Level 2 badge unlocked

Qualcomm Eyes Mojo Collab?: A member speculates that Qualcomm might contact Modular about Mojo after raising the topic in a Qualcomm Developer’s Discord voice chat.
- This could signal potential interest from Qualcomm in Mojo’s capabilities for their hardware.
Mojo Manual Updated, Python Section Highlighted: A user shared the Mojo Manual link after a delay, specifically pointing to the Python section.
- This suggests updates or important information regarding Mojo’s interaction with Python in the documentation.
New Level Unlocked: A member advanced to level 2.
- Advancing to level 2 suggests progression within the community.

Modular (Mojo 🔥) ▷ #mojo (10 messages🔥):

Mojo notebook, GPU Compatibility, Mojo distributed computing

Mojo in Notebooks: Interact or Author?: A member inquired whether the goal is to interact with Max from a notebook or to directly author and run Mojo within notebooks.
- Another member reported being able to interact with Mojo within a notebook and expressed interest in adding a syntax highlighter to better learn Mojo with syntax colors.
AMD Radeon 6800 XT success?: A member reported successfully running the vector_addition example on an AMD Radeon 6800 XT and inquired whether that qualified as success, in response to the GPU Compatibility documentation.
- A Modular employee responded that they haven’t done extensive testing on RDNA 2 GPUs and asked how many of the Mojo GPU puzzles work on the system, noting that models won’t run correctly on RDNA GPUs yet.
Mojo for Distributed Computing?: A member wondered if it might someday be possible to use Mojo in conjunction with batch or stream processing frameworks such as Dask or PySpark for distributed computing.
- Another member suggested that Mojo welcomes people building their own frameworks, as a fully Mojo framework will likely be lower latency and higher throughput than Python-based options, hinting at interesting networking options.

Moonshot AI (Kimi K-2) ▷ #general-chat (7 messages):

Kimi new features, Sora video quality, Pro Subscription watermarks

Kimi’s Capabilities Spark Surprise: A user expressed surprise at an unspecificed new capability of Kimi after watching a video demo.
Sora Demos Under Scrutiny: Several users contrasted the quality of Sora videos, arguing the shared video demos are lower quality than the Sora videos on OpenAI’s YouTube channel.
- One user described it as weirdly wobbly.
Pro Subscribers get Watermark-free Sora: According to a user, the Pro subscription version of Sora will feature higher resolution and no visible watermarks.
- They warned that an invisible watermark will be applied - so mister openai can tell its generated, just we cant…

AI Twitter Recap

AI Reddit Recap

1. Sora 2 and WAN 2.2 Video Generation Demos

2. OpenAI $500B Valuation + ChatGPT ‘Think Longer’ UX + Silicon Valley Foresight

3. AI Comedy Threads: ‘Strangest Flea Market’ Pt.7 and Related Skits

AI Discord Recap

Discord: High level Discord summaries

Perplexity AI Discord

LMArena Discord

Unsloth AI (Daniel Han) Discord

OpenAI Discord

LM Studio Discord

Cursor Community Discord

OpenRouter Discord

Eleuther Discord

GPU MODE Discord

Latent Space Discord

Nous Research AI Discord

Yannick Kilcher Discord

Manus.im Discord Discord

DSPy Discord

aider (Paul Gauthier) Discord

MCP Contributors (Official) Discord

Modular (Mojo 🔥) Discord

Moonshot AI (Kimi K-2) Discord

tinygrad (George Hotz) Discord

Discord: Detailed by-Channel summaries and links

Perplexity AI ▷ #announcements (2 messages):

Perplexity AI ▷ #general (1281 messages🔥🔥🔥):

Perplexity AI ▷ #sharing (8 messages🔥):

Perplexity AI ▷ #pplx-api (1 messages):

LMArena ▷ #general (1348 messages🔥🔥🔥):

LMArena ▷ #announcements (5 messages):

Unsloth AI (Daniel Han) ▷ #general (324 messages🔥🔥):

Unsloth AI (Daniel Han) ▷ #introduce-yourself (3 messages):

Unsloth AI (Daniel Han) ▷ #announcements (1 messages):

Unsloth AI (Daniel Han) ▷ #off-topic (638 messages🔥🔥🔥):

Unsloth AI (Daniel Han) ▷ #help (164 messages🔥🔥):

Unsloth AI (Daniel Han) ▷ #research (41 messages🔥):

OpenAI ▷ #ai-discussions (722 messages🔥🔥🔥):

OpenAI ▷ #gpt-4-discussions (2 messages):

OpenAI ▷ #prompt-engineering (11 messages🔥):

OpenAI ▷ #api-discussions (11 messages🔥):

LM Studio ▷ #general (542 messages🔥🔥🔥):

LM Studio ▷ #hardware-discussion (115 messages🔥🔥):

Cursor Community ▷ #general (601 messages🔥🔥🔥):

OpenRouter ▷ #announcements (5 messages):

OpenRouter ▷ #app-showcase (2 messages):

OpenRouter ▷ #general (495 messages🔥🔥🔥):

OpenRouter ▷ #discussion (23 messages🔥):

Eleuther ▷ #general (7 messages):

Eleuther ▷ #research (28 messages🔥):

Eleuther ▷ #scaling-laws (91 messages🔥🔥):

GPU MODE ▷ #general (12 messages🔥):

GPU MODE ▷ #cuda (18 messages🔥):

GPU MODE ▷ #torch (2 messages):

GPU MODE ▷ #cool-links (4 messages):

GPU MODE ▷ #jobs (1 messages):

GPU MODE ▷ #beginner (9 messages🔥):

GPU MODE ▷ #pmpp-book (2 messages):

GPU MODE ▷ #jax (1 messages):

GPU MODE ▷ #torchao (3 messages):

GPU MODE ▷ #self-promotion (2 messages):

GPU MODE ▷ #submissions (5 messages):

GPU MODE ▷ #tpu (9 messages🔥):

GPU MODE ▷ #factorio-learning-env (11 messages🔥):

GPU MODE ▷ #cutlass (31 messages🔥):

GPU MODE ▷ #multi-gpu (2 messages):

GPU MODE ▷ #low-bit-training (4 messages):

Latent Space ▷ #ai-general-chat (103 messages🔥🔥):

Latent Space ▷ #ai-announcements (4 messages):

Latent Space ▷ #genmedia-creative-ai (8 messages🔥):

Nous Research AI ▷ #general (25 messages🔥):

Nous Research AI ▷ #research-papers (3 messages):

Nous Research AI ▷ #research-papers (3 messages):

Yannick Kilcher ▷ #general (17 messages🔥):

Yannick Kilcher ▷ #paper-discussion (6 messages):

Yannick Kilcher ▷ #ml-news (7 messages):

Manus.im Discord ▷ #general (27 messages🔥):

DSPy ▷ #show-and-tell (2 messages):