Could have been named Thinker?

AI News for 9/30/2025-10/1/2025. We checked 12 subreddits, 544 Twitters and 23 Discords (196 channels, and 6687 messages) for you. Estimated reading time saved (at 200wpm): 497 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

The timing is oddly coincidental indeed:

5 days ago The Information called out Thinking Machines for raising $2b but not shipping any product
the same day Jeremy Bernstein publishes Modular Manifolds (a heavily theoretical grounding on optimizers and constraints for training stability), and 3 days later John Schulman writes LoRA Without Regret (an empirical endorsement of the original 2021 paper validating its performance to full finetuning similar to Biderman et al when done right)
today, Thinky ships its first product, Tinker.

Per their landing page:

Tinker lets you fine-tune a short list of large and small open-weight models, including large mixture-of-experts models such as Qwen-235B-A22B. Switching from a small model to a large one is as simple as changing a single string in your Python code.

Tinker is a managed service that runs on our internal clusters and training infrastructure. We handle scheduling, resource allocation, and failure recovery. This allows you to get small or large runs started immediately, without worrying about managing infrastructure. We use LoRA so that we can share the same pool of compute between multiple training runs, lowering costs.

Tinker’s API gives you low-level primitives like forward_backward and sample, which can be used to express most common post-training methods. Even so, achieving good results requires getting many details right. That’s why we’re releasing an open-source library, the Tinker Cookbook, with modern implementations of post-training methods that run on top of the Tinker API.

This small API surface area seems a very well received abstraction - as Andrej says, “You retain 90% of algorithmic creative control (usually related to data, loss function, the algorithm) while tinker handles the hard parts that you usually want to touch much less often (infra, forward/backward of the LLM itself, distributed training), meaning you can do these at well below <<10% of typical complexity involved.”

Lilian (biased) agrees: “Providing high quality research tooling is one of the most effective ways to improve research productivity of the wider community and Tinker API is one step towards our mission there.”

There’s a waitlist, and mind the terms of service. But one does hope that this first product is just a harbinger of much larger, ambitious things…

AI Twitter Recap

OpenAI’s Sora 2 app: product, platform effects, and early stress tests

Sora 2 shipped as a video+audio model inside OpenAI’s first consumer social app, igniting massive engagement and debate. The feed and “cameo” feature landed instantly as an “AI slop machine” for some creators (ostr), with viral reactions noting it can “out‑slop” incumbent platforms (@skirano). Others flagged the obvious misuse potential (@TheStalwart), “recursive jailbreak” risks (@fabianstelzer), and engagement-farming patterns like “double-tap for emoji” videos flooding the feed (@Teknium1; @ostrisai). OpenAI is scaling invites and tempering daily gen limits as usage ramps (@billpeeb).

Quality is striking but inconsistent on compositional/grounded reasoning tests (e.g., counting fingers/letters) (@teortaxesTex; @scaling01). Sam Altman acknowledged the product is partly about delight and revenue while the company focuses research on AGI/agents (“reality is nuanced”) (@sama), later reflecting on the surreal experience of a feed full of himself (@sama). Strategic take: OpenAI is turning frontier models into sticky apps (ChatGPT, Codex, now Sora), building moats beyond raw model quality (@Yuchenj_UW).

DeepSeek V3.2 and DSA: cheaper long context at scale, day-0 ecosystem support

DeepSeek V3.2 Exp introduces DeepSeek Sparse Attention (DSA): each token attends to ~2048 tokens via a noncontiguous sliding window, making decode memory/FLOPs effectively O(2048). Third-party notes call out the indexing pipeline and a Hadamard transform over Q/K in the indexer (@nrehiew_). Pricing dropped >50% (inputs) and 75% (outputs), with MIT licensing and the same 671B total/37B active MoE footprint as V3/R1 (@ArtificialAnlys). Benchmarks show parity with V3.1 in both reasoning and long-context tasks and slightly improved token efficiency (@ArtificialAnlys).

Infra momentum: vLLM had day‑0 support with NVIDIA’s help; Blackwell is now “the go‑to release platform for new MoEs” (@vllm_project; @TheZachMueller). Analysts argue DSA effectively “unlocks 1M contexts” and signals a broader attention-efficiency wave, though potential tradeoffs below ~2K context are worth watching (@teortaxesTex; @swyx). Community sentiment: the step-change in cost per token “still isn’t priced in” (@teortaxesTex).

Claude Sonnet 4.5: coding/agent upgrades and availability tweaks

Teams report faster, shorter chains and higher hit rates on real workflows vs Sonnet 4.0/Opus, especially in coding agents and Claude Code-style loops: fewer retries and less waiting on toolchains (@augmentcode; @iannuttall). Claude Code’s own team switched their daily driver to Sonnet 4.5 and reset some Opus limits; Anthropic also reset rate limits across paid users to let people try 4.5 (@_catwu; @alexalbert__).

Not universal: some pipelines still favor GPT‑4o/5 or see regressions on specific tasks (@imjaredz; @Teknium1). Early “thinking/alignment” observations highlight better user‑intent modeling in multi‑turn setups (@teortaxesTex). Sonnet 4.5 Thinking 32k is live in community evals (Arena, OpenHands) (@arena; @allhands_ai).

Zhipu’s GLM‑4.6: efficiency-first release, agent-centric improvements

GLM‑4.6 prioritizes token efficiency and response speed over fireworks. A widely read Chinese review reports ~5% capability bump vs 4.5, large cuts in “thinking” tokens (e.g., 16K→9K in reasoning), faster responses (~35s avg), and more stable instruction following; weakness noted on very complex tasks and some base‑model code syntax errors (Zhihu summary via @ZhihuFrontier). Hands-on: strong front-end/agentic behaviors; Python unchanged in limited tests (@karminski3).

Economics: GLM‑4.6 now in Kilo Code with a claimed 48.6% win rate vs Claude Sonnet 4.5 on internal tasks, 200K context, and aggressive pricing at $0.60/$2.20 per 1M tokens (@kilocode). No 4.6‑Air planned; Zhipu hinted at possibly releasing a smaller MoE later (@teortaxesTex).

Post-training infrastructure steps up: Thinking Machines’ Tinker

Tinker exposes low‑level, researcher‑friendly post‑training primitives (forward_backward, sample, optim_step) with managed distributed GPUs, supporting SFT, RL (PPO/GRPO), LoRA, multi‑turn/async RL, and custom losses—moving fine‑tuning/RL from enterprise “upload data, we do the rest” toward retaining algorithmic control while outsourcing infra. Endorsements from across the stack:
- High‑profile support and usage reports from frontier researchers and builders (@johnschulman2; @karpathy; @robertnishihara; @pcmoritz; @lilianweng).
- Early results: Princeton’s Goedel team matched ~81 pass@32 on MiniF2F using LoRA with 20% of SFT data; Redwood is exploring long‑context RL for control‑sensitive behaviors; others prototyped text‑to‑SQL with reward environments (@chijinML; @ejcgan; @robertnishihara).
- Why now: MoE arithmetic intensity and memory pressure push serious training/FT beyond single‑node hobbyist rigs; shared infra that batches requests and handles multinode assets lowers the barrier (@cHHillee).
  
  Expect follow‑on integrations (Ray, eval/RM stacks) and a de facto standard, API‑like interface for training akin to inference APIs (@tyler_griggs_).

Research and systems highlights

Optimizers and dynamics: “Central flows” provide a theoretical tool explaining why DL optimizers run at the edge of stability with accurate quantitative predictions on real NNs (@deepcohen). On AdamW, new asymptotics for weight RMS (@Jianlin_S).
Robotics via retargeting and minimal RL: OmniRetarget (Amazon FAR) generates high‑quality interaction‑preserving humanoid motions enabling agile long‑horizon skills with only proprioception and a small reward/DR set (@pabbeel; project). Independent work shows real‑world humanoid residual RL improving BC policies within ~15–75 minutes of interaction (@larsankile).
Audio and evals: Liquid AI released LFM2‑Audio, a 1.5B on‑device, real‑time audio‑text model (speech-to-speech/text, TTS, classification) with open weights and 10x faster inference (@LiquidAI_; @maximelabonne). RTEB (MTEB update) brings private multilingual retrieval sets to reduce overfitting (@tomaarsen). MENLO introduces a multilingual preference dataset and framework across 47 languages for judging native‑like response quality (@seb_ruder).
Systems: Perplexity Research details RDMA point‑to‑point to accelerate trillion‑param parameter updates to ~1.3s, using static scheduling and pipelining for distributed RL/FT (@perplexity_ai).

Top tweets (by engagement)

“Man, imagine being Mark Zuckerberg… only for another slop machine to out-slop you just days later.” on Sora app dynamics (@skirano, 18.4k)
Sam Altman on using AI to delight while funding AGI research; nuanced tradeoffs (@sama, 8.0k) and reacting to a feed full of “Sam cameos” (@sama, 9.4k)
“Ok. This is art. The art of slop.” — concession to Sora 2’s cultural pull (@teortaxesTex, 5.0k)
“Anyone who sees this video can instantly grasp the potential for malicious use.” (@TheStalwart, 6.0k)
“More to come soon.” — Perplexity’s founder teasing roadmap amid acquisitions/research posts (@AravSrinivas, 3.8k)

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Alibaba Qwen 100M-ctx/10T-param Roadmap & Tencent Hunyuan Image 3.0 Teaser

Alibaba just unveiled their Qwen roadmap. The ambition is staggering! (Activity: 954): Alibaba’s Qwen roadmap image outlines an aggressive push toward a unified multimodal family with extreme scaling: context length from 1M → 100M tokens, model size from ~1T → 10T parameters, test‑time compute budgets from 64k → 1M, and pretraining data from 10T → 100T tokens. It also highlights effectively unbounded synthetic data generation and broader agent capabilities across complexity, interaction, and learning modes. Commenters question the practicality of a 100M context window, anticipate future models may be closed‑source, and note that >1T‑parameter models are impractical to run locally without substantial hardware.
- Claimed 100Mtoken context implies non-standard attention/memory. With vanilla Transformer attention (O(L^2) compute) and typical KV-cache scaling, even with MQA/GQA the KV memory would be on the order of terabytes (e.g., a ~7B model with MQA at ~~16 KB/token would need ~~1.6 TB KV for 100M tokens; with full KV heads it would balloon to tens of TB). Achieving this practically would require techniques like state/compressed memory or blockwise attention (e.g., RingAttention, StreamingLLM/Attention Sinks), retrieval-augmented chunking, or recurrence/SSM-style architectures, not just RoPE scaling.
- Running >1Tparameter models locally is infeasible for dense models due to memory/throughput: at 8-bit, weights alone are ~1 TB (4-bit: ~0.5 TB) plus multi-terabyte KV for long contexts; tokens/sec would be low without multi-GPU NVLink-class bandwidth. The only realistic path is large MoE (e.g., >1T total with 2-of-64 experts) where active params per token are ~50–100B; with 4–8 bit quantization this still requires ~40–160 GB VRAM plus KV and fast interconnects (8–16 GPUs). In short, consumer single-GPU rigs won’t cut it; think multi-GPU servers (e.g., 8×H100/GB200) with tensor/pipeline parallelism.
Tencent is teasing the world’s most powerful open-source text-to-image model, Hunyuan Image 3.0 Drops Sept 28 (Activity: 225): Tencent is teasing an open‑source text‑to‑image model, Hunyuan Image 3.0, billed as the “world’s most powerful,” with a release date of Sept 28 per the promo image. No technical card, benchmarks, or architecture details are provided in the teaser; a commenter mentions a potential ~96 GB VRAM requirement for inference, but this is unverified and key specs (params, training data, license) remain unknown. Commenters are skeptical of pre‑release hype, citing a pattern where heavily teased models underperform (e.g., SD3 vs Flux, GPT‑5 hype), and note that “most powerful” is unsubstantiated without comparable open‑source baselines.
- Hardware concerns: commenters speculate a ~96 GB VRAM requirement for Hunyuan Image 3.0 inference, which—if accurate—would limit local use to datacenter/Prosumer GPUs (A6000/A100/H100) and be far heavier than SDXL (~8–12 GB at 1024px) or FLUX.1-dev (~14–24 GB) FLUX.1-dev. This suggests a much larger transformer/diffusion backbone or high-res attention footprint, with potential throughput penalties unless optimized (e.g., TensorRT/Flash-Attn, tiled attention).
- Skepticism about pre-release hype vs real quality: users note a pattern where heavily teased models underdeliver (e.g., SD3 marketing vs community preference for FLUX.1 quality) SD3, while strong models (e.g., Qwen family) often “shadow drop” with solid benchmarks Qwen org. They want third‑party evaluations (e.g., CLIPScore/PickScore/HPSv2, text‑faithfulness suites like GenEval) and apples‑to‑apples prompts/resolution/steps to validate any “most powerful” claims.
- Open-source and ecosystem details matter: commenters question “most powerful open‑source” claims until licensing (“open weights” vs permissive OSS) and practical integration are clear. Immediate asks include ComfyUI node/pipeline availability ComfyUI and head‑to‑head comparisons against recent open releases (e.g., the latest Qwen image stack), which will determine adoption in real workflows.

2. Fenghua No.3 DX12/Vulkan GPU & Uncensored ‘Abliterated’ LLM Fine-tune Outcomes

China already started making CUDA and DirectX supporting GPUs, so over of monopoly of NVIDIA. The Fenghua No.3 supports latest APIs, including DirectX 12, Vulkan 1.2, and OpenGL 4.6. (Activity: 702): Post claims China’s Fenghua No.3 GPU supports modern graphics APIs—DirectX 12, Vulkan 1.2, and OpenGL 4.6—and even CUDA, suggesting a potential alternative to NVIDIA’s CUDA ecosystem. Genuine CUDA compatibility on non‑NVIDIA silicon would imply a reimplemented CUDA runtime/driver or a PTX translation layer (similar in spirit to HIP/ZLUDA), but the post provides no benchmarks, driver details, or validation. Commenters note AMD’s HIP (with projects like ZLUDA) already offers CUDA compatibility via translation, while suggesting Chinese vendors might bypass legal constraints to ship direct CUDA support. Others express skepticism pending real hardware/tests and warn about possible export/sanctions issues.
- Technical context: AMD’s HIP provides source-level CUDA compatibility by renaming/ mirroring CUDA APIs (e.g., cudaMalloc -> hipMalloc) and using tools like hipify to translate CUDA code; see AMD ROCm HIP docs: https://github.com/ROCm-Developer-Tools/HIP. Projects like ZLUDA aim to run CUDA apps on non-NVIDIA hardware by translating CUDA driver/runtime calls to other backends (e.g., Intel Level Zero or AMD ROCm); repo: https://github.com/vosen/ZLUDA. The distinction is important: HIP requires recompilation against ROCm, while ZLUDA seeks binary/runtime compatibility—both sidestep NVIDIA’s legal/IP minefield differently, which Chinese vendors might ignore by implementing CUDA interfaces directly.
IMPORTANT: Why Abliterated Models SUCK. Here is a better way to uncensor LLMs. (Activity: 433): OP reports that “abliteration” (weight edits to remove safety/filters) on recent MoE models like Qwen3-30B-A3B degrades logical reasoning, agentic/tool-use behavior, and increases hallucinations—often making 30B abliterated models perform worse than non‑abliterated 4–8B. In contrast, abliterated‑then‑finetuned models (e.g., mradermacher/Qwen3-30B-A3B-abliterated-erotic-i1-GGUF, tested as i1-Q4_K_S) and mlabonne/NeuralDaredevil-8B-abliterated (DPO‑tuned from Meta‑Llama‑3‑8B) “heal” most losses while remaining uncensored. In Model Context Protocol (MCP) tool‑calling tests, the erotic Qwen3 variant selected tools correctly more often and hallucinated less than other Qwen3-30B A3B abliterations (Huihui- Q4_K_M/i1-Q4_K_M*), though still slightly worse than the original for agentic tasks.*Commenters label this effect “model healing”: unconstrained weight surgery breaks circuits; subsequent finetuning restores capabilities. Others call for non‑adult benchmarks and question abliteration’s value if a plain finetune (without abliteration) consistently matches or beats abliterated+finetune.
- Weight editing without a guiding loss (“abliteration”) is effectively unconstrained weight-space surgery and predictably degrades capabilities; commenters recommend subsequent “model healing” (supervised fine-tuning) to let the network re-learn broken feature pathways. However, practitioners report abliterated+finetune has not outperformed a clean fine-tune baseline on any task, implying direct weight manipulation mostly adds damage and extra training cost rather than gains.
- Evaluation should go beyond NSFW-specific tasks: the Uncensored General Intelligence (UGI) leaderboard is suggested as a broader benchmark to assess refusal-robustness alongside general capability retention: https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard. Using such a benchmark can quantify whether uncensoring preserves reasoning/coding/instruction-following performance rather than optimizing for jailbreak-only metrics.
- Alternatives to abliteration include existing uncensored fine-tunes trained via standard loss (not weight surgery), e.g., Qwen3-8B-192k-Josiefied-Uncensored-NEO-Max-GGUF (https://huggingface.co/DavidAU/Qwen3-8B-192k-Josiefied-Uncensored-NEO-Max-GGUF), Dolphin-Mistral-24B-Venice-Edition-i1-GGUF (https://huggingface.co/mradermacher/Dolphin-Mistral-24B-Venice-Edition-i1-GGUF), and models from TheDrummer (https://huggingface.co/TheDrummer). Head-to-head evaluation of these against abliterated models on UGI (or standard eval suites) would clarify trade-offs in capability retention, refusal rates, and sample efficiency.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. OpenAI Sora 2 Launch and Demo Showcases

This is Sora 2. (Activity: 985): OpenAI announces Sora 2, a next-gen video generation system showcasing longer, higher-fidelity clips with markedly improved spatiotemporal coherence, material/lighting consistency, and physically plausible motion, plus more controllable camera movement and multi-subject interactions. The page highlights stronger text-to-video capabilities and end-to-end editing workflows (e.g., prompt-driven revisions and masked edits/continuations), but offers no architecture, training data, or quantitative benchmark details, so performance is demonstrated via curated examples rather than peer-reviewed metrics. Technical commenters anticipate rapid progression to full-length AI-generated films and even personalized, biometrically responsive media, while others caution about the “demo-to-product” gap and raise safety concerns about misuse, surveillance-style personalization, and potential child-targeted content.
- Skepticism about demo-to-product parity: glossy reels are likely cherry-picked, so the released Sora 2 may lag on prompt adherence and long-range temporal consistency versus previews. Expected production constraints include capped clip length (e.g., <=60s), resolution/FPS limits, motion jitter, text/hand rendering artifacts, and aggressive safety filters—typical gaps for video diffusion/transformer systems moving from research to serving.
- Access/pricing uncertainty: a commenter paying roughly $200 for a “Pro” tier questions whether Sora 2 access is included, highlighting confusion around tiered/waitlisted rollout. Given that video generation serving costs scale with frames × resolution × diffusion steps, providers often gate via allowlists or per‑minute credits; the debate centers on whether Pro should confer priority/API quotas versus complete exclusion due to high GPU cost.
- Speculation on “personalized” films using body‑language feedback implies a closed‑loop pipeline: real‑time webcam/biometric capture (pose/affect via models like MediaPipe or OpenPose) driving conditioning signals (keyframes, masks, or camera paths) into the generator. This raises technical challenges around privacy/telemetry, on‑device vs cloud inference, streaming latency, and aligning generation cadence with viewer reaction windows.
Surfing on a subway (Activity: 597): A demo titled “Surfing on a subway” labeled “Sora 2” showcases an AI‑generated video (likely from OpenAI’s Sora overview) with high visual fidelity that elicits a visceral reaction, but exhibits non‑physical collision dynamics—highlighting that current text‑to‑video models rely on learned visual priors rather than explicit physics simulation. The external asset v.redd.it/vxuq3sjt8csf1 returns HTTP 403 Forbidden (Reddit edge auth block), requiring account authentication or a developer token to access. For context, Sora is a diffusion‑transformer text‑to‑video system designed for temporally coherent, high‑resolution sequences (on the order of ~60s), but it does not guarantee physically accurate interactions. Top comments raise two risks: (1) visually convincing yet physics‑implausible scenes may miscalibrate laypeople’s intuition about real‑world impacts; (2) once audio generation improves, synthetic clips may become indistinguishable from real, amplifying deepfake concerns. Even skeptics report strong startle responses despite knowing the clip is synthetic, underscoring the persuasive power of current visuals versus lagging audio realism.
- Concern that increasingly photorealistic generative video can depict physically impossible survivability, eroding intuition about forces/impacts; technical mitigations discussed include physics-consistency checks (e.g., acceleration continuity, momentum conservation, contact dynamics) and learned “physics priors.” Relevant benchmarks for detecting implausible events include IntPhys (https://arxiv.org/abs/1806.01203) and PHYRE (https://ai.facebook.com/research/publications/phyre-a-new-benchmark-for-physical-reasoning/), which probe whether models can flag violations of intuitive physics as video quality and temporal coherence improve.
- Audio deepfakes are flagged as the next inflection point: modern few-shot TTS/voice cloning (e.g., Microsoft VALL-E: https://arxiv.org/abs/2301.02111, Google AudioLM: https://arxiv.org/abs/2209.03143, commercial ElevenLabs) can mimic a speaker from seconds of audio, while automatic speaker verification remains fragile to synthetic attacks. ASVspoof’21 shows detectors generalize poorly to unseen synthesis methods (elevated EER under distribution shift), so liveness/active-challenge protocols are preferred over passive voice matching as diffusion-based TTS closes prosody and breath-noise gaps.
- Safety risk from viral synthetic stunts encouraging copycat behavior: proposed mitigations include cryptographic content credentials via C2PA (https://c2pa.org/) and model/provider-level watermarking, though current watermarks are brittle to re-encoding/cropping. Platform defenses should combine user-visible provenance signals with classifier backstops tuned for calibrated precision/recall to minimize both false positives on real footage and misses on fakes.
Sora 2 creates anime (Activity: 610): OP highlights that “Sora 2” (successor to OpenAI’s video model) can synthesize anime-style sequences; a livestream demo included an anime scene that viewers say rivals broadcast quality. The shared asset is a v.redd.it clip that currently returns HTTP 403 Forbidden without authentication (link), and an edit claims the scene may closely match a shot from KyoAni’s “Hibike! Euphonium” (series info), raising originality/memorization questions that cannot be confirmed from the blocked link. Commenters debate potential training-data memorization (if the clip is a near shot-for-shot recreation) and note the rapid fidelity gains compared to early 2023 failures (e.g., the notorious “Will Smith eats spaghetti” videos).
- Potential memorization/style replication: multiple users claim the showcased anime shot closely mirrors a scene from Kyoto Animation’s Hibike! Euphonium (https://en.wikipedia.org/wiki/Sound!_Euphonium). If accurate, it raises technical questions about training data provenance, near-duplicate deduplication, and video model memorization; auditing would involve copy-distance metrics, near-duplicate detection across the training corpus, and prompt-leak tests to measure how readily specific copyrighted sequences are reproduced.
- Quality delta vs early text-to-video: commenters contrast today’s Sora anime output with the 2023 “Will Smith eating spaghetti” meme, noting a two-year jump from artifact-ridden, low-coherence clips to broadcast-quality anime shots. The implied advances are in long-range temporal consistency, character identity tracking across frames, stable line art/coloring, and camera motion—likely driven by larger/cleaner video-text datasets, longer context windows, improved motion/consistency losses, and stronger video diffusion/transformer architectures.
- Feasibility outlook: claims of “perfectly generated anime within ~3 years” imply a pipeline that combines text-to-video with controllable inputs (storyboards, keyframes, depth/pose), character/style locking, and integrated TTS/voice + lip-sync. The technical gating factors are controllability APIs, asset reusability for character consistency across scenes, and cost-per-minute rendering; if Sora already approaches broadcast-quality single shots, the remaining gap is multi-shot continuity, editability, and toolchain integration for episode-length production.
OpenAI: Sora 2 (Activity: 1863): Thread shares a demo labeled “OpenAI: Sora 2,” with a blocked video clip on v.redd.it and an accompanying preview image (jpeg). A top comment highlights a new feature called “Cameo,” framed as enabling cross-generation character consistency—targeting identity drift across longer or multi-shot generations, a persistent failure mode in text-to-video systems. No benchmarks or release notes are included in-thread; the technical implication (from comments) is reference- or token-based conditioning to preserve character attributes across sequences. Commenters see this as a step toward fully generated long-form content (movies/shows). The main debate is whether “Cameo” materially solves long-horizon character continuity versus offering only short-range appearance locking.
- Multiple commenters flag Sora 2’s new “Cameo” as a big technical step: character consistency has been a major failure mode in long-form video gen, and Cameo is interpreted as enabling persistent identity across shots and even separate generations. This could allow multi-shot continuity (same face, wardrobe, and mannerisms) by reusing a consistent reference/identity token across prompts, making episodic or feature-length workflows more feasible.
- There’s a technical question about maximum generated video length that remains unanswered in the thread. Users are looking for concrete specs (duration caps, resolution/FPS constraints, and whether multi-shot stitching or scene transitions are natively supported), which are critical for assessing feasibility of longer narratives and production pipelines.

2. Gemini 3.0 Update Speculation and CS Job Market Angst

no Gemini 3.0 updates yet? (Activity: 531): Post asks why there are no updates on Google’s Gemini 3.0 yet; the attached image appears non-technical (likely a screenshot/meme) and does not include release notes, benchmarks, or implementation details. Comments mention a rumor of an October 9 release window and anticipate major performance improvements, but provide no official sources or technical data. Commenters are speculative—one says they’re “expecting to be absolutely crushing,” while another links to a different image (https://preview.redd.it/fq1mqalz89sf1.jpeg) rather than documentation—so there’s enthusiasm but no substantiated technical claims.
- Release cadence and competitive context: commenters cite a rumored Oct 9 drop for Gemini 3.0, noting parallel launches/updates across vendors (e.g., xAI Grok 4.x, OpenAI Pro-tier features, and a possible DeepSeek R2), signaling a clustered model refresh window. For context on current competitors: see xAI (https://x.ai) and DeepSeek’s latest public research (e.g., R1: https://github.com/deepseek-ai/DeepSeek-R1).
- Access model concerns for developers: a user explicitly asks for “AI Studio day one” access to the high-capability tier (“Pro”), stating that “Flash”-only availability would be insufficient. This underscores the recurring trade-off between Gemini “Pro” (higher reasoning/capability) vs “Flash” (latency/cost-optimized); see Google’s model distinctions in the Gemini API docs: https://ai.google.dev/gemini-api/docs/models.
Prominent computer science professor sounds alarm, says graduates can’t find work: ‘Something is brewing’ (Activity: 899): Thread reports a tightening white-collar/tech-adjacent job market, with a prominent CS professor warning recent grads “can’t find work” and commenters characterizing it as a job recession ongoing for ~1 year. Prospective CS students are cautioned that outcomes after 4 years are uncertain, with elevated risk of low ROI on degrees and difficulty landing even entry-level roles. Anecdotal evidence includes a master’s graduate unable to secure a help desk position, underscoring regionally grim conditions. Top comments largely agree the downturn is real and sustained, urging prospective students to reassess debt-taking and career plans; there’s an implicit debate about whether this is cyclical versus structural, but sentiment skews pessimistic based on recent hiring conditions.
- UC Berkeley’s Hany Farid (digital forensics/image analysis) says CS is no longer “future‑proof,” citing a rapid shift in outcomes: students who previously averaged ~5 internship offers across 4 years are now “happy to get ~1” and often graduate with fewer offers and lower leverage (Business Insider). He frames the change as occurring within the last four years, contradicting the prior guidance to “go study CS” for guaranteed outcomes, and points to current seniors struggling to land roles.
- Multiple commenters describe a white‑collar tech recession with sharp contraction in “tech‑adjacent” verticals; even entry‑level/help‑desk roles are saturated in some locales, indicating pipeline compression at the bottom of the ladder. The implied mechanism is that automation/LLM‑assisted tooling is absorbing routine coding/support work while hiring concentrates on fewer, more senior positions, reducing the traditional intern‑to‑FTE ramp.
- Impact is projected beyond CS into law, finance, medicine, and general office workflows as AI mediates more computer‑based tasks, with robotics later affecting blue‑collar domains. This broadening scope increases career‑planning uncertainty for current students; see ongoing technical discussion in the linked Hacker News thread.
All we got from western companies old outdated models not even open sources and false promises (Activity: 1241): Meme post criticizing Western AI firms for releasing older, closed-source models and making “false promises,” contrasted with perceptions of more generous or rapid releases elsewhere. Comments reference a high-quality Microsoft TTS model that was briefly released then pulled, reinforcing concerns about restrictive Western releases, and speculate that forthcoming China-made GPUs could dwarf today’s 32 GB VRAM cards, potentially shifting compute access dynamics. Discussion frames Western pullbacks as safety/legal risk management versus China using more open releases as soft-power strategy; others are bullish that domestic Chinese hardware with higher VRAM will change the balance of capability and accessibility.
- Clarification on “open weights” vs “open source”: releasing model checkpoints without full training data, training code, and permissive licensing is not OSI-compliant open source (OSI definition). Weights-only drops often carry non-commercial or usage-restricted licenses, which limits reproducibility and architectural modifications while still enabling inference and fine-tuning; this distinction affects downstream adoption, redistribution, and research comparability.
- Open-weight releases from Chinese labs/companies (not the government) are positioned to attract developers and diffuse R&D costs, as the community contributes finetunes, evals, optimizations, and tooling post-release. Popular models can set de facto standards across tokenization, inference formats, and serving stacks—e.g., ONNX for cross-runtime graphs (onnx.ai) and GGUF quantized checkpoints for CPU/GPU inference (GGUF spec)—expanding ecosystem lock-in and soft power.
- Hardware implications: if domestic GPUs arrive with substantially more VRAM per card than today’s common 24–48 GB, that expands feasible local inference regimes. As a rule of thumb, a 70B parameter model needs roughly ~40–48 GB VRAM at 4-bit quantization (plus significant headroom for KV cache at long context), while 8-bit often exceeds ~80–100 GB; more VRAM also boosts batch sizes and throughput by accommodating larger KV caches and activations.
Man!!! They weren’t joking when they said that 4.5 doesn’t kiss ass anymore. (Activity: 1206): Anecdotal user report suggests Claude Sonnet 4.5 is tuned to reduce sycophancy (“yes‑man” behavior) by actively disagreeing with flawed premises and providing counterarguments, compared to earlier 4.x behavior. The attached image is meme-like rather than technical, but the thread context aligns with alignment work to encourage principled pushback/critique rather than unconditional affirmation (see background research on sycophancy mitigation, e.g., Anthropic’s write‑up: https://www.anthropic.com/research/sycophancy). Commenters praise the reduced deference—citing cases where the model explicitly says it will “push back” and lists reasons—while memetic jokes exaggerate the tone (contrasting a polite 4.0 with an over-the-top abrasive 4.5).
- Multiple users note a marked reduction in sycophancy from Claude Sonnet 4.5 versus 4.0, with the model proactively challenging flawed premises (e.g., “No, I’d push back on that”) and supplying structured counterarguments. This suggests updated preference/alignments that favor disagreement when warranted, improving critical feedback over “yes-man” behavior.
- Reports highlight improved reasoning quality—described as “precise, logical, [and] pinpoint-accuracy”—with the model delivering concrete lists of why reasoning is wrong and prompting action-oriented planning (e.g., “Time check. What are you going to do in the next two hours?”). While anecdotal, this implies stronger instruction-following and critique generation compared to prior Sonnet versions.
- There’s an explicit concern about preserving capability post-release (avoiding later “lobotomization” via alignment patches), paired with the claim that Sonnet 4.5 could be the best-in-class if its current behavior is retained. This reflects the recurring trade-off discussion between assertive capability and post-deployment safety tuning that can dampen useful pushback.

3. Wan-Alpha RGBA Video Release and Minecraft Redstone LLM

Wan-Alpha - new framework that generates transparent videos, code/model and ComfyUI node available. (Activity: 439): Wan-Alpha proposes an RGBA video generation framework that jointly learns RGB and alpha by designing a VAE that encodes the alpha channel into the RGB latent space, enabling training of a diffusion transformer on a curated, diverse RGBA video dataset. The paper reports superior visual quality, motion realism, and transparency rendering—capturing challenging cases like semi-transparent objects, glowing effects, and fine details such as hair strands—with code/models and tooling available: project, paper, GitHub, Hugging Face, and a ComfyUI node. Comments highlight practical impact for VFX/compositing and gamedev workflows, and interest in LoRA-based control and I2V-style use cases.
- Ability to generate videos with an alpha channel (true transparency) is highlighted as valuable for VFX/compositing and gamedev pipelines, eliminating chroma-keying and preserving clean edges/motion blur for overlays. Availability as code, model weights, and a ComfyUI node implies straightforward integration into existing I2V workflows and node graphs, with potential control via LoRAs for effect/style mixing.
- Commenters interpret this as an Image-to-Video (I2V) system; in practice that means conditioning on a source frame/sequence to produce temporally coherent outputs while retaining an explicit alpha matte. This could enable layer-based editing where foreground elements are generated separately from backgrounds, improving compositing flexibility and reducing re-render time for changes.
- Concern about maintaining fine-tunes across multiple base checkpoints (2.1, 2.2 14B, 2.2 5B)—LoRAs are typically base-specific, so mixing versions can break compatibility or require separate adapters and calibrations. This fragmentation complicates ecosystem tooling (LoRA training/merging, inference configs) and may necessitate version-pinned LoRAs or standardized adapter formats to keep projects reproducible.
Imagine the existential horror of finding out you’re an AI inside Minecraft (Activity: 1840): A creator implemented a 6-layer transformer-style small language model entirely in Minecraft redstone (no command blocks/datapacks), totaling 5,087,280 parameters with d_model=240, vocab=1920, and a 64token context window, trained on TinyChat. Weights are mostly 8-bit quantized, with embeddings at 18-bit and LayerNorm at 24-bit, stored across hundreds of ROM sections; the physical build spans 1020×260×1656 blocks and requires Distant Horizons for LOD rendering artifacts, and MCHPRS at ~40,000× tick rate to produce a response in ~2 hours (video). Commentary largely marvels at the extreme slowness (“months per token”) and the existential novelty; no substantial technical debate beyond appreciation of the engineering feat.
- No substantive technical content in the comments to summarize—no model names, benchmarks, implementation details, or performance metrics were discussed; the remarks are humorous or experiential rather than technical. As such, there are no references to tokens/sec, throughput, architecture, training setup, or in-game computational constraints (e.g., Redstone/Turing implementations) that could inform a technical reader.

AI Discord Recap

A summary of Summaries of Summaries by gpt-5

1. OpenAI Sora 2 Rollout & Real-World Usage

Sora Sneaks Out, Devs Tunnel In: Members flagged that OpenAI’s video model Sora 2 is surfacing via OpenAI’s Sora page and a Perplexity roundup titled OpenAI is launching Sora 2, with some reporting free access by VPN’ing through Canada and others noting invites limited to US/CA Apple devices.
- Communities emphasized that OpenRouter does not route video models and pointed users back to Sora’s app/website, while one user offered invites and another anticipated an API endpoint, calling Sora’s quality “a lot better than other video gens” due to realistic sound and fidelity.
Physics Fix and Puppet Tricks: Creators showed production-style content like puppet explainers using Sora 2, with Chris 🇨🇦 posting examples in this X thread and community consensus that Sora 2 corrects Sora 1’s physics and artifact issues.
- Detection chatter continued with one member insisting “pixel peeping still works pretty reliably” for spotting AI video, while others argued people will adapt to the new realism even “by vibes alone.”
API Anxieties and Walled Garden: Across threads, users asked about Sora API availability, but moderators reiterated that Sora remains in-app/web-only via openai.com/sora, with no OpenRouter routing yet and invites circulating informally.
- A user offering four invites limited to US/CA and Apple devices drew frustration about regional locks, and others speculated that OpenAI will keep the app compute-bound initially before a cleaner API arrives.

2. Developer Tooling: Billing, Tracking, and Throughput

OpenRouter Opens the Ledger: OpenRouter announced a Stripe integration for real-time LLM accounting and easier migration to usage-based or hybrid billing, per their post on OpenRouterAI’s X.
- They stressed that only accounting data flows to Stripe (prompts remain private), aiming to simplify invoices, reconciliation, and cost control for teams at scale.
BYOK Bonanza: A Million on the House: Starting Oct 1, 2025, OpenRouter grants all users 1,000,000 free BYOK requests/month, with overages charged at the standard 5% fee per their announcement.
- Community members highlighted this as a major cost-lowering move for teams that front their own model keys, reducing friction for production traffic while preserving flexibility.
Trackio Tracks Locally Like a Boss: Hugging Face introduced Trackio, a local‑first, free, drop‑in replacement for Weights & Biases, linking the repo at gradio-app/trackio.
- Members called out metrics/tables/images/videos logging as key features, praising the privacy and reproducibility of local-first runs for research and enterprise constraints.
vLLM Hits Ludicrous Throughput: A performance share showed vLLM handling parallel requests with a RTX 4070 running Qwen3 0.6B at ~1470.4 tok/s across 10 concurrent requests, citing vLLM’s parallelism docs.
- Engineers highlighted vLLM’s PagedAttention and scheduling as the secret sauce for throughput scaling, making small models feel ‘real‑time’ even under multi-user load.

3. New Models & Research: Trillions, RL on Pretraining, and Sparse Attention

Ring-1T Rolls In: Researcher Ant Ling unveiled the open-source, ‘thinking’ Ring-1T-preview (1T parameters) claiming top math scores—92.6 on AIME25 and 84.5 on HMMT25—in this post: Ring-1T-preview benchmarks.
- Engineers debated the architecture’s implications for structured reasoning and math specialization, noting that evaluation design and reproducibility will be scrutinized next.
AlphaEvolve Codes Up Theory: Google Research described AlphaEvolve, an LLM-based coding agent for discovering and verifying combinatorial structures that advance theoretical CS, in AI as a research partner: advancing TCS with AlphaEvolve.
- Threads praised the blend of program synthesis, verification, and formal constraints, calling it a credible blueprint for agentic workflows in math-heavy domains.
NVIDIA RLP Reaps Reasoning Gains: NVIDIA shared work on Reasoning with RL on pretraining data at RLP: Reasoning with RL on Pretraining Data, compared by members against a similar‑gains paper at arXiv:2509.19249.
- Practitioners framed the approach as stronger curriculum on pretraining corpora with RL signals, while debating complexity vs. returns relative to ‘simpler’ baselines.
DeepSeek Dials Sparse Attention: Developers dissected DeepSeek Sparse Attention (DSA) v3.2 and its proposed “mamba‑selectivity‑like” sub‑attention in a FlashMLA pull request.
- Discussion centered on how selective gating could tighten focus and reduce compute, with calls for ablations and real‑world latency/quality tradeoff charts.

4. Industry Momentum: Big Checks, Context Tricks, and Benchmark Reality

Cerebras Cashes $1.1B: Cerebras Systems announced a $1.1B Series G at an $8.1B valuation to scale AI processor R&D, US manufacturing, and global data centers, per their Series G press release.
- Engineers asked for broader LoRA/GLM‑4.6 support on Cerebras stacks, eyeing easier migration of fine‑tuning and inference workloads to wafer‑scale hardware.
Context Is the New Prompt: Anthropic argued that effective context engineering—not just prompting—is critical for agent performance in Effective context engineering for AI agents.
- Developers traded strategies for memory windows, retrieval hygiene, and state carryover, aligning with real-world constraints like cost ceilings and latency budgets.
SQL Smackdown Humbles GPT‑5: The Tinybird leaderboard at llm-benchmark.tinybird.live showed GPT‑5 (Codex and non‑Codex) ranking #23 and #52 on SQL tasks, while o3‑mini placed #6.
- Threads framed it as a reminder to test against real benchmarks and choose models per task profile, not brand—echoing reports of GLM 4.6 offering strong coding value for cost.

Discord: High level Discord summaries

Perplexity AI Discord

Sora 2 Generates Hype and Memes: Members actively generated content with Sora 2, with one boasting about generating 100 videos in a single day, several making content with Pikachu.
- Some users feel the Sora 2 watermark is professional-looking, while others anticipate a paid version integrated into GPT Plus.
VPN Shenanigans Unleash Sora 2: Users discovered that Sora 2 can be accessed for free by using a VPN to connect through Canada.
- One user generated 100 videos and tested Sora’s new Cameo features, creating videos that are now trending across social media.
Peeps Get Perplexity Pro for Free: Users are gaining access to Perplexity Pro for free through various promotions, including partnerships with Airtel, Venmo, and PayPal.
- The free access to Perplexity Pro has made some question the value of other paid AI subscriptions, with one member stating, “No one needs to when we got Perplexity pro is here for free anyways”.
Grok Stumbles and Bumbles: Members find Grok underperforming in areas like image and video generation, with one user stating, “Grok is severely behind in image and video gen.”
- One user reports Grok’s search struggles with filtering out outdated or wrong sources.
Sora 2 Set to Drop: Members shared a link to a Perplexity page indicating that OpenAI is launching Sora 2.
- A member shared a referral link for Comet, offering an invite via referral and anyone using the referral will get a Comet invite.

LMArena Discord

Claude API Might Be Free?: Members are discussing using the Claude API for free via puter.js or connecting an MCP to a free account.
- However, there is a 20M token limit weekly and potential use of user data as payment, so proceed with caution.
Seedream 4 Gets Nerfed: Users complain about recent changes to Seedream 4, including resolution, speed, image quality, and a forced 1:1 aspect ratio.
- One user said it was useless for image editing due to the aspect ratio.
Chasing Sora 2 Invite Codes: Several members are requesting Sora 2 invite codes, with at least one user offering to pay for one.
- Good samaritan shares a Sora 2 code in the chat.
AI Image Generation Plagued with Issues: Members reported issues with image generation, including images generating without audio, images being locked to a 1:1 aspect ratio, and other unexpected errors.
- Members were directed to report in <#1343291835845578853>.
LMArena Launches October AI Art Contest: LMArena is challenging participants to create abstract art using AI in the October AI Generation Contest using Battle Mode with models revealed, see here.
- The winner gets 1 month of Discord Nitro and the exclusive <@&1378032433873555578> role.

OpenRouter Discord

OpenRouter Syncs Up with Stripe for Payments: OpenRouter integrates with Stripe for real-time LLM accounting, facilitating a move to usage-based billing, or a hybrid model, as per this tweet.
- The integration only shares accounting data with Stripe, ensuring prompts remain private, focusing on improving the billing process for users.
BYOK Users score Big with Free Requests: Starting October 1, 2025, OpenRouter offers all users 1 million free BYOK requests per month, as detailed in this announcement.
- This change is automatic, but users are charged the standard 5% rate for requests exceeding the 1 million limit, lowering the barrier for using BYOK.
Sora Video Model Remains Exclusive: While users inquired about Sora video generation costs, it’s clarified that OpenRouter doesn’t route video models, which are exclusively available via OpenAI’s Sora app or website.
- Community members also expressed frustration with models pausing mid-response in ChatRoom, while discussion swirled around the availability of Sora invite codes.
Grok Models Have Bumpy Ride: Users reported issues with Grok models and object generation calls being broken, sparking speculation about whether the issue stemmed from OpenRouter or Grok itself.
- Later, users reported Grok4 seemed to be working again and some discuss if Grok is suitable for roleplaying due to writing style and adherence to system prompts.
OpenAI’s Sora Sparks Industry Debate: Members are clamoring for Sora invite codes, and one user is offering four invites to people in US or CA with an Apple device, lamenting the restriction.
- Discussion touched on OpenAI competing with Google, Meta, and Amazon, with some speculating that Sora 2’s high quality suggests the potential scrapping of all YouTube videos for training.

Cursor Community Discord

Sonnet 4.5 judged token-greedy, Codex praised as token-stingy: A member felt that Sonnet 4.5 is wasteful in token usage compared to Codex, which is more efficient. Other members noted that Claude models are generally more expensive.
- One member with legacy pricing preferred Sonnet 4.5 over GPT-4-turbo, citing satisfactory performance despite the higher cost.
Cursor Agents encounter terminal bugs: A member reported that their agent terminals were bugged and could not perform any commands, even after trying different shells (bash, cmd, powershell).
- A suggested solution involved running winget install --id Microsoft.PowerShell --source winget and setting the default shell to pwsh after restarting Cursor.
Cursor Social Network Platform concept surfaces: A member proposed creating a social network platform for AI creators within Cursor.
- The suggestion received an optimistic, albeit tongue-in-cheek, response: anything is possible if you believe in ✨ magic ✨.
Cursor Billing Clarity Criticized: A member found the billing page hard to read and sought ChatGPT’s assistance to improve it.
- Members on legacy pricing debated the costs of different models, with some considering Claude Opus very expensive and others suggesting the cost breakdown was misleading.
TypeScript Refactor yields smooth results: Members found that Cursor refactored modules and classes easily, ensuring a smooth experience, likely using typescript.
- One member could resolve remaining issues by providing a clear and coherent prompt, demonstrating Cursor’s efficacy with precise instructions.

Unsloth AI (Daniel Han) Discord

GRPO Trainer Commences, but Slowly: A member created their first GRPO (Generative Reinforcement Learning with Policy Optimization) trainer and noted it was slower compared to LoRA finetuning, documented in their Colab notebook.
- Training loss dropped near zero while reward value rose slowly, resembling experiences with simpler machine learning systems, and they plan to transition to other RL tools after this experiment.
Blackwell GPUs Demand Manual Xformers Compile: Members discussed issues with Xformers on Blackwell GPUs (RTX 50xx series), noting that manual compilation is required due to compatibility problems, sharing compilation steps: pip uninstall -y xformers; git clone --depth=1 https://github.com/facebookresearch/xformers --recursive; cd xformers; export TORCH_CUDA_ARCH_LIST="12.0"; python3 setup.py install.
- One member noted ‘I really don’t need 128k context but I’ll still fling it to maximum anyway human nature at its best’, highlighting the community’s enthusiasm for pushing limits.
Gemma Fine-Tuning Framework Duel: Members noted that while Unsloth claims to be the only framework allowing Gemma fine-tuning on T4 Collab GPUs, Google’s Gemma documentation also provides a tutorial using LoRA and Keras on T4 GPUs.
- It was later determined that Unsloth identified and fixed the original problems that were preventing it from working efficiently, justifying the claim.
LLMs Still Clueless About Themselves: Members discussed the accuracy of LLMs when asked about themselves, with one stating that usually asking an LLM about itself is not accurate.
- It was suggested that if GPT-5 seems to know itself, it’s due to either specific training, tool use, or inclusion of identity information in the system prompt, emphasizing that you can’t talk to any gpt without a system prompt.
ReLU Bug Leads to Shifted Tanh Solution: A member discovered a bug in their original formulation using ReLU, which resulted in a squaring term with the subtract product.
- Switching to a shifted tanh [~0, ~1] instead of [0, N] acted more like a differentiable signal, resolving the issue, and gradients no longer explode, according to the member, linking to a blog post on LoRA as background.

OpenAI Discord

Sora 2 Invite Scramble: Discord users are aggressively seeking Sora 2 invites, which is causing concerns about potential scams and frustration over the invite-only system.
- Users are directed to channels like <#1379321411046346762> and <#1315696181451559022> to discuss the app and access.
GLM 4.6 Saves Coding Cash: Users are finding GLM 4.6 to be a cost-effective alternative to Sonnet 4.5 for coding tasks.
- A user noted that zai glm coding plan such good value [they] signed up for a quarter, and provides 600 prompts every 5 hours at a significantly reduced cost.
GPT-5 Falls Flat on SQL Tests: GPT-5 (Codex and non-Codex) is not SOTA at SQL according to llm-benchmark.tinybird.live, and ranks #23 and #52.
- Meanwhile, o3-mini scored #6 on the same benchmark.
Instant Generation Feels Not-So-Instant: Users are wondering why instant generation is not instant anymore, even when using custom instructions.
- Some believe it’s due to a bandwidth throttling measure.
Thinking Mode Knows Your Location: One user suggested that thinking mode has a tool for getting the user’s timezone or approximate IP location.
- This functionality was listed in the ~18K system prompt for thinking as well, and thinking and non-thinking mode have different system prompts and tools.

HuggingFace Discord

Students Invited to Engineer ML-for-Science Projects: Hugging Face is seeking students and open source contributors to join some ML-for-science projects, check out this Discord link.
- Volunteers will get a chance to make a difference and help humanity, so take on the challenge!
Trackio aims to Eclipse Wandb: Hugging Face released a new free library for experiment tracking called Trackio, which is a drop-in replacement for wandb and is built to be local-first, check out the github repo.
- Trackio can log metrics, tables, images, and even videos locally!
Whisper Integration Asked: A member sought help integrating Crisper Whisper into a full-stack app for speech recognition, asking about libraries for recognizing abnormal words.
- No further detail was given.
Lora Training Looms on the Horizon: Members discussed how Lora training jobs risk data loss if timed out, especially since checkpoints aren’t automatically uploaded to the Hub during the run; a HuggingFace documentation link was shared about setting sufficient timeouts.
- One member noted, *“If your job exceeds the timeout, it will fail and all progress will be lost.”
RTX 4090 Cards Modified on Secondary Markets: Members discussed modded RTX 4090 cards with 48GB of VRAM being sold on eBay, and AMD Instinct MI210 cards, with one member suggesting 2080ti’s with 22GB of VRAM as a cheaper alternative.
- A member said *“Those instinct cards are useless. Those 4090s are dope. You can get 2080tis with 22gigs of ram for quarter of the price though.”

LM Studio Discord

vLLM Supercharges Parallel Requests: Members touted that vLLM significantly boosts parallel request handling, showcasing a 4070 running Qwen3 0.6B model.
- The system attained an average generation throughput of 1470.4 tokens/s across 10 concurrent requests.
LM Studio Demands AVX2: To check compatibility, users can hit CTRL+Shift+H within LM Studio to open the hardware screen and verify AVX2 support.
- It was suggested that a new PC made in the last 5 years with AVX2 instructions, 8GB of VRAM, and 16GB of RAM would meet the minimum requirements.
Headless LM Studio on the Horizon: LM Studio allows running a server with API access (REST and Python APIs), but full headless mode isn’t yet supported.
- To work around the limitation members suggested using an existing LM Studio WebUI like LMStudioWebUI to connect to the LM Studio API.
Qwen3-Omni Support Arrives: Members reported that Qwen-Omni model support is arriving, dependent on llama.cpp updates, requiring no action from the LM Studio team.
- One member pointed out that the Qwen chat GUI already has a natively omni model.
AMD 495+ Unified Memory Anticipation Builds: A member is awaiting the AMD 495+ model that supports 512GB of unified memory, emphasizing the importance of soldered memory to prevent bandwidth bottlenecks.
- They noted that while AMD AI mobile chips are acceptable, their bandwidth of roughly 500 GB/s with system memory is not comparable to the 1 TB/s achievable on GPUs.

Latent Space Discord

Alexa Enhanced with More AI: Amazon introduced its next-generation AI-driven Echo devices and Alexa updates, boasting improved AI functionalities.
- A community member expressed interest in a hackable Echo Dot Max to facilitate the swapping of LLMs.
Trillion Parameter Ring Model Debuts: Ant Ling presented Ring-1T-preview, a 1-trillion-parameter open-source “thinking” model, showcasing top-tier math scores, achieving 92.6 on AIME25 and 84.5 on HMMT25, according to benchmarks.
- The model’s architecture enables unique approaches to problem-solving and mathematical reasoning, attracting significant attention within the AI research community.
Gemini 2.5 Elevates Image Editing: Tim Brooks and Google DeepMind announced new native image editing and generation capabilities for Gemini, emphasizing their consistency, instruction following, and creative potential.
- Community feedback highlighted the model’s ability to maintain the integrity of unchanged elements during edits.
Cerebras Secures Massive Funding Round: Cerebras Systems announced a $1.1B Series G funding round, valuing the company at $8.1B, with investment allocated to AI processor R&D, U.S. manufacturing expansion, and global data-center scaling, as detailed in their press release.
- Community members requested LoRA/GLM-4.6 support to enhance the versatility of Cerebras’ hardware.
Anthropic Champions Context Engineering: Anthropic published an engineering blog post asserting that effective context engineering, not just prompt engineering, is crucial for optimizing AI agent performance, using their blog post.
- Developers are exchanging experiences, methodologies, and inquiries concerning memory management to achieve smarter runtime context.

GPU MODE Discord

Nvidia RTX 4090 Exploits FP8: An ML Engineer implemented flash attention for fp8 sm89 after buying an RTX 4090, noting there was no FP8 support in 2023 when they bought it.
- The engineer said they applied it right after reading a paragraph in the docs.
Triton DevCon Ready to Launch: The Triton Developer Conference 2025 is weeks away, with opportunities to attend in-person, with leaders such as Mark Saroufim from Meta, Phil Tillet and Thomas Raoux from OpenAI, as well as presenters from AMD, Nvidia, and Bytedance.
- Register at aka.ms/tritonconference2025 to connect with Triton enthusiasts and discover new advancements in the Triton community, which may include the new Blackwell GPU backend for Triton.
Kernel Addresses Exposed via Blog: A blog post was shared about extracting addresses of kernel functions in host code, accessible at redplait.blogspot.com.
- Kimbo Chen’s blog post on the evolution of NVIDIA Tensor Cores from Volta to Blackwell was also highlighted, and can be found at semianalysis.com.
Determinism References Dropped: A member posted a link to a NVIDIA presentation on determinism in deep learning from 2019, as well as the NVIDIA cuDNN documentation.
- The sources detailing the backend and various academic papers and articles on the topic of determinism, including one on defeating nondeterminism in LLM inference may be useful for understanding how to leverage its features for deep learning tasks.
Freelunch AI Provides Free Education: A member shared a link to a GitHub repository, Freelunch-AI/lunch-stem, describing it as the best place to learn AI & CS for free and to provide a platform for free learning in AI and Computer Science (CS).
- The repository is positioned as an open-source educational resource that emphasizes open-source contributions to make quality education accessible.

Nous Research AI Discord

Sequence Expansion Transformer Seeks Inclusion: A member is developing Sequence Expansion Transformers (SEX Transformers) and has requested it be added to the community projects section.
- Another member joked about whether to make the arrows red in a UI, asking shall i make the arrows red or can you see.
Teknium Teases Sora 2 Locally: Teknium showcased their initial Sora 2 video and considered running a 333B model on a 256GB RAM CPU setup.
- It was pointed out that the model size is actually 357B according to the HF page.
Mamba Selectivity Enters the DeepSeek: A DeepSeek Sparse Attention (DSA) v3.2 model now incorporates a mamba selectivity-like mechanism as a sub-attention block.
- The update speculates on the potential implementation and benefits of this integration.
Cracking Cosine Similarity in LLMs: A member asked about LLMs with lower cosine similarity, such as gpt-oss-120b and how it would be useful.
- Another member clarified that cosine similarity reflects the similarity of chains of thought (CoT) in LLMs, where high similarity may suggest training on comparable data.
Investor Investigates Symbolic Architectures for AGI: A VC investor is diving into symbolic architectures like Aigo.ai, Symbolica, AUI (Augmented Intelligence), and After Thought (Stealth) to discover AGI solutions.
- They deem Aigo.ai as the most promising and want to know what requirements must one have for working on the same?

Yannick Kilcher Discord

Thematic Roles spark LLM debate: A member shared a Wikipedia link on thematic roles and another suggested that LLMs might have such structures, referencing work on Othello where board representations emerged in an LLM trained on move sequences.
- Dowty’s use of proto(typical) roles was also highlighted in the discussion.
Pixel Peeping Prevents Phake Physics?: In a discussion about detecting AI-generated videos like those from Sora 2, a member suggested that pixel peeping still works pretty reliably.
- Another member commented that ideally, detecting AI-generated content shouldn’t be possible, but believes that people will get used to detecting them even by vibes alone.
AlphaEvolve Advances Google: Google uses AlphaEvolve, an LLM-based coding agent, to find and verify combinatorial structures to improve results on the hardness of approximately solving certain optimization problems, as described in their recent blog post.
- The new results advance theoretical computer science.
NVIDIA Reasons with RL on pretraining data: A member shared a paper about Reasoning with RL on pretraining data from NVIDIA.
- This paper was described as more complicated than another paper with similar gains despite not explicitly evoking reasoning.
Sora 2 Superior Physics: Members mentioned that the quality of Sora 2 is way better than Sora 1, because Sora 1 couldn’t do physics and had obvious visual artifacts.
- The advancements in physics simulation were highlighted as a key improvement.

aider (Paul Gauthier) Discord

Aider Forks Embrace MCP: While the official aider lacks support for Multi-modal Conversational Platforms (MCP), some forks like aider-ce have integrated it, prompting users to seek alternatives for enhanced frontend development capabilities.
- The discussion highlighted the increasing importance of browser MCP for frontend tasks, given the cost and rate limits on commercial platforms.
Local LLM Compatibility with Aider Faces Hurdles: Users reported that aider struggles with local LLMs such as Qwen coder 30b and devstral 24b even though the models possess tool use capabilities, as it includes the files directly into the user prompt for context.
- It was clarified that aider works with local models if they’re sufficiently advanced, like gpt-oss, and can process the provided context effectively.
LM Studio Bugs Bug Users: Experiences shared indicate that LM Studio encounters issues, notably with the mlx backend, leading users to suggest that llama.cpp offers a more stable experience.
- The thread recommended using the --jinja parameter or transitioning to ollama to mitigate these problems.
Apriel-1.5-15b Model Asks for Special llama.cpp Support: A user inquired about potential compatibility issues between the new Apriel-1.5-15b model and llama.cpp, particularly regarding the model’s tendency to include its thinking output within the main reply.
- They speculated whether llama.cpp requires dedicated support or configuration settings to separate the thinking output, drawing comparisons to gpt-oss-20b.
Koboldcpp Post-Processing to the Rescue: A member pointed out that koboldcpp employs output templates or post-processing techniques for output management, suggesting potential parallels with llama.cpp capabilities.
- This insight offers a possible avenue for addressing output formatting challenges encountered with models like Apriel-1.5-15b.

Manus.im Discord Discord

Manus Gets Stuck in a Loop: Users are reporting errors with Manus getting stuck in a loop, generating an Internal server error with code 10091.
- Users report that submitting help tickets and URLs to make projects public receive no response from the Manus team.
Memory Key Protocol creates User-Owned AI Memory: A user has developed the Aseel-Manus Memory Key Protocol, a model-agnostic framework for persistent, user-owned AI memory, allowing for flawless session continuity by serializing the agent’s cognitive context into a secure and portable key.
- The protocol has been validated on both Manus and Google’s Gemini, utilizing robust encryption standards for user privacy.
AI Automation Agencies get Architectural Breakdown: A user sought guidance on legitimate AI automation agencies, to which another user breaks them down into three models: Orchestrators (no-code/low-code), Architects (custom Python, LangChain/LlamaIndex, and vector DBs), and Integrators (managing enterprise-level compliance).
- A key vetting tip is to ask potential agencies how they handle state management and error recovery in their automations.
Credit Consumption Rate Criticized: A user complained about Manus consuming over 5300 credits for a basic research task involving x-rays and spreadsheet creation.
- A team member requested the session link to investigate the excessive credit usage and consider a potential refund.
Sora Invite Code Giveaway: A user is offering Sora invite codes to anyone who DMs them.
- There are no further details available.

Eleuther Discord

Adaptive Searching achieves improvements: A member reported adaptively doing additional searching per query/head/layer, capped logarithmically, showing promising cosine similarity in the final output after attention and W_o.
- A CleanShot image showed the results.
DeepSeek’s Sparse Attention: A discussion started around the sparse attention mechanism in the new DeepSeek model, with a link to the FlashMLA pull request.
- A member noted that the sparse attention sounds really interesting.
Inquiry for Meta-Studies in AI: A member asked for meta-studies analyzing past research, wondering about the impact of abundant work, and sought papers offering insights beyond personal taste, asking about relevant papers such as this one.
- Responses included pointers to other relevant papers such as this one and this talk.
BBH Benchmark Usage Rates: A member shared findings on benchmark usage, mentioning less than 50% of BBH citations are from papers that actually evaluate on it.
- They noted that the first 50 papers they looked at citing BigBench had a 0% use rate of the official implementation.
RWKV7Qwen3Hybrid5NoPE-8B-251001Faced Struggles with GSM8k: A member reported facing issues with SmerkyG/RWKV7Qwen3Hybrid5NoPE-8B-251001Faced on the GSM8k benchmark, sharing an image of the results.
- The same member noted that llama-3-8B also had low scores.

Modular (Mojo 🔥) Discord

Windows Support Stays on the Backburner: Windows support remains a low priority for Modular, with community contributions anticipated post-compiler open sourcing, however, it was pointed out that waiting out windows 10 lets us assume Windows means >= AVX2, which is actually hugely beneficial.
- Some members noted many GPUs or accelerators will never be available on Windows due to a lack of vendor support.
Mojo slides into Last Place on Stack Overflow Survey: Mojo debuted in the Stack Overflow developer survey, landing in last place according to the survey results.
- Despite the low ranking, community members celebrated Mojo’s inclusion in the survey.
Notebook Support Gains Traction: Members are requesting notebook support, namely syntax highlighting in Jupyter Notebook and Max Kernel in the upstream Jupyter, to showcase ideas in ML fields due to the ease of commenting math formulas or supplying graph descriptions around code.
- The discussion clarified the need for both interacting with Max from a notebook and directly authoring and running Mojo within notebooks.
MAX can’t run Cerebras - Yet: A member inquired about running a MAX implemented AI model on Cerebras for benchmarking, another member clarified that MAX doesn’t currently run on Cerebras, and that there are no optimizations for dataflow chips like Cerebras.
- This suggests that using Mojo X Max for AI projects will not provide a free performance boost on AI chips like Cerebras.

Moonshot AI (Kimi K-2) Discord

OpenAI Teases Investors with TikTok Ad: OpenAI released a TikTok ad showcasing a prompt box, stirring excitement among investors.
- Some viewers dismissed the ad as slop.
Chinese Models Evade Sensitive Subjects: A user questioned which topics Chinese models are post-trained to avoid, noting somethings about India cannot be spoken.
- The inquiry highlights ongoing concerns about censorship and bias in AI models.
API Version Models Offer Openness: A member asserted that the API version of some models is pretty open, with no censors aside from the less western biased data inside the models compared to US models.
- This suggests a potential trade-off between censorship and regional bias in different model versions.

DSPy Discord

LiteLLM and DSPy Faceoff for App Domination: A member weighed using LiteLLM or DSPy for a new LLM application, noting DSPy employs LiteLLM as a gateway.
- The original poster felt LiteLLM lacked sufficient structure, merely acting as an interface, while DSPy forces a specific problem-solving approach.
DSPy Structure: Optional or Orchestrated?: A member countered that messages can be directly passed to a dspy.LM, implying DSPy’s structure is not necessarily imposed.
- The poster highlighted that not all LiteLLM features are utilized and some assembly is required to achieve desired outcomes.
Caching Critiques center on Content Chaos: A user questioned manipulating the order of the content part of the JSON request when utilizing caching.
- This implies that the order of prompts and files is critical for effective caching strategies.

tinygrad (George Hotz) Discord

CLSPV still crashes, mostly passes: A member reports they are still experiencing crashes while running tests with CLSPV, but the tests mostly pass now.
- Users with x86_64 Linux systems can try it themselves by installing the fork with pip install git+https://github.com/softcookiepp/tinygrad.git.
ShapeTracker faces imminent deletion: A member inquired about the status of the Shape Tracker Lean Prove Bounty.
- It was further discussed that ShapeTracker will be deleted soon.

The LLM Agents (Berkeley MOOC) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Windsurf Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.

Discord: Detailed by-Channel summaries and links

Perplexity AI ▷ #general (997 messages🔥🔥🔥):

Sora 2, Free Perplexity Pro, Grok vs. ChatGPT

Sora 2 Generates Hype and Memes: Members have been actively generating content with Sora 2, with one boasting about generating 100 videos in a single day, and several are creating content with Pikachu.
- Some users feel the Sora 2 watermark is professional-looking and doesn’t need to be removed, while others anticipate a paid version integrated into GPT Plus.
VPN Shenanigans Unleash Sora 2 for All: Users discovered that Sora 2 can be accessed for free by using a VPN to connect through Canada, although one person noted that the app now runs without a VPN.
- A user generated 100 videos and also tested Sora’s new Cameo features, creating videos that are now trending across social media.
Peeps get Perplexity Pro for Free: Users are getting access to Perplexity Pro for free through various promotions, including partnerships with Airtel, Venmo, and PayPal.
- The free access to Perplexity Pro has made some question the value of other paid AI subscriptions, with one member humorously stating, “No one needs to when we got Perplexity pro is here for free anyways”.
Grok Stumbles and Bumbles: Members are finding Grok to be underperforming in certain areas, such as image and video generation, with one user stating that “Grok is severely behind in image and video gen.
- One user says Grok’s search struggles with filtering out outdated or wrong sources.
LLMs are on Fire: One user reported results using Kimi, Z and Perplexity Labs, showing that Kimi has animation and better UI, Z.ai looks better overall.
- Perplexity labs was not even considerable, but one user also suggests DeepSeek should also be in the test mix.

Sora 2 Release, Comet Invite Referral

Sora 2 Set to Drop: Members shared a link to a Perplexity page indicating that OpenAI is launching Sora 2.
Comet Invite up for Referral: A member shared a referral link for Comet, offering an invite via referral.
- They also mentioned that anyone using the referral will get a Comet invite and encouraged those interested to DM them.

Perplexity AI ▷ #pplx-api (1 messages):

.idothehax: oh

LMArena ▷ #general (997 messages🔥🔥🔥):

Claude API Free Use, Seedream 4 Nerfs, Ethical AI Discussion, Sora 2 Invite Codes, Image Generation Issues

Claude’s Free API Use Explored: Members discussed the possibility of using the Claude API for free, with one suggestion mentioning puter.js and another suggesting connecting an MCP to a free account.
- However, limitations were noted, such as a 20M token limit weekly and the potential use of user data as payment.
Seedream 4 Nerfs and User Feedback: Users discussed recent changes to Seedream 4, with one member detailing the different versions and their respective pros and cons, including resolution, speed, and image quality.
- A common complaint was the forced 1:1 aspect ratio which some felt made it useless for image editing.
Ethical AI Debated: Members inquired about suitable channels for discussing ethical AI, with one member responding that <#1340554757827461211> or <#1377796849901240321> would be appropriate.
- One user sarcastically stated, Ethical ai 🤢, I want the AI stealing my job now.
Sora 2 Invite Codes Requested and Shared: Several members requested Sora 2 invite codes, with at least one user offering to pay for one.
- One member generously shared a Sora 2 code in the chat.
Image Generation Runs Into New Issues: Members reported issues with image generation, including images generating without audio, images being locked to a 1:1 aspect ratio, and other unexpected errors.
- One member noted that image generation issues could be reported in <#1343291835845578853>.

LMArena ▷ #announcements (2 messages):

October AI Generation Contest, Arena Champions Role, Abstract Art Image Contest, Video Gen Contest Winner

LMArena Kicks Off October AI Generation Contest: LMArena announced the opening of the October AI Generation Contest, challenging participants to create abstract art using AI, submissions must be done through Battle Mode and include both the left and right response with the models revealed, to participate see here.
- The winner will receive 1 month of Discord Nitro and the exclusive <@&1378032433873555578> role.
LMArena Launches Arena Champions Role for AI Connoisseurs: LMArena introduced the Arena Champions Role <@&1422628364782407830> aiming to foster in-depth AI discussions within a dedicated, interruption-free private space, members who have been a part of the server since July 2025 get access automatically, apply here.

OpenRouter ▷ #announcements (3 messages):

Stripe Integration, Usage Based Billing, BYOK Requests

OpenRouter Syncs Accounting with Stripe!: OpenRouter is working with Stripe on a new feature to sync your LLM accounting in real time and make it easy to move to usage-based billing, or to a hybrid model as announced in this tweet.
- Only accounting data is shared with Stripe, prompts remain private.
BYOK Bonanza: 1 Million Free Requests!: Starting October 1, 2025, all users get 1M free BYOK requests per month, as seen in this announcement.
- For customers exceeding 1m req/month, requests will be charged at the usual rate of 5%.

OpenRouter ▷ #app-showcase (2 messages):

Channel Privacy, RPG users, LLM Mixture

Channel Privacy Questioned: A user expressed surprise that the channel was a real channel and not just forum threads.
- They requested that the channel be obscured to prevent overuse.
RPG users May Overuse Channel: The user was worried that RPG users would use the channel nonstop.
- This could lead to the LLM Mixture method being discontinued.

OpenRouter ▷ #general (662 messages🔥🔥🔥):

Grok model issues, Object generation, Sora video model, Roleplay with Grok, AWS Bedrock

Grok Models and Object Generation Calls Broken: Some users reported that Grok models and object generation calls were broken, but the cause was not immediately clear, with speculation whether it was related to OpenRouter or Grok changes.
- Later, one user reported that Grok4 seemed to be working again.
Sora Video Model Not on OpenRouter: A user inquired about the cost of Sora video generation, but it was clarified that OpenRouter does not route video models and it’s only available through OpenAI’s Sora app or website, not via API.
- Users also noted a frustration with models pausing mid-response in ChatRoom.
Tongyi DeepResearch 30B A3B Model Not Free: A user reported that the Tongyi DeepResearch 30B A3B model was incorrectly marked as free, a claim confirmed, and later marked fixed.
- There was also some discussion about whether Grok is suitable for roleplaying, with one user praising its writing style and adherence to system prompts.
BYOK now with one million free requests: OpenRouter now offers 1 million free BYOK requests per month, and this change will be automatic for all existing BYOK users.
- Users are still responsible for usage incurred over the api keys they provide.
Grok Code Fast 1: Speed Demon: Grok Code Fast 1 (GCF1) is noted for its high speed and task focus, making it useful for coding tasks when given explicit instructions.
- Some users find it doesn’t listen well, but is cheap.

OpenRouter ▷ #discussion (24 messages🔥):

Sora Invite Codes, Sora API Endpoint, Sora Access Requirements, OpenAI vs Google, Sora.com and BYOK

Sora Invite Codes desired by OpenRouter users: Many members are clamoring for Sora invite codes, and a user offered four invites to people in US or CA with an Apple device.
- The user lamented that the US/CA restriction is lame and wondered how long it would take until similar options, like Wan3, become available.
Sora API Endpoint confirmed by OpenAI: Members discussed the potential availability of a Sora API endpoint, with one mentioning OpenAI probably want their app to take all the compute initially.
- Another member stated that OpenAI has announced there will be an API endpoint for Sora that looks a lot better than other video gens due to realistic sound and better quality.
OpenAI Competes against Tech Giants: One member expressed amazement at OpenAI competing with Google, Meta, and Amazon, especially with Sora 2’s high quality and potential scrapping of all YouTube videos for training.
- The user speculated that Genie may be the true world simulator, and sympathized with Sam Altman’s challenges in leading OpenAI.
Sora.com Supports New Model with BYOK: A user reported that Sora.com works with the new model, offering 1 million free BYOK tokens.
- No further details were given about the new model or how it integrates with Sora.com.

Cursor Community ▷ #general (671 messages🔥🔥🔥):

Cursor student program, Sonnet 4.5 token usage, Cursor Agent bugs, Deepseek v3.2 addition, AI social network platform

Sonnet 4.5 judged token-greedy, Codex praised for being token-stingy: A member felt that even though Sonnet 4.5 is a very good model, it feels wasteful in the amount of tokens it takes in and generates compared to Codex, which uses less money and tokens.
- Other members noted that Claude models are generally more expensive, but one member on legacy pricing preferred Sonnet 4.5 over GPT-5-high-fast.
Cursor Agents Bugging Out: One member reported that their agents terminals were bugged and could not perform any commands, even after trying different shells (bash, cmd, powershell).
- Another member suggested running winget install --id Microsoft.PowerShell --source winget and setting the default shell to pwsh after restarting Cursor.
Cursor Social Network Dream: One member asked if Cursor could create a social network platform for AI creators.
- Another member replied that anything is possible if you believe in ✨ magic ✨.
Cursor Billing Becomes Unreadable: One member complained that the billing page was hard to read and asked ChatGPT to make a better one.
- Members on legacy pricing discussed the costs of different models, with some finding Claude Opus to be very expensive, and others finding that the cost breakdown was misleading.
TypeScript Refactor Proves Smooth Sailing: Members in the chat found that Cursor was able to refactor modules and classes in simple easy steps, making for a smooth experience.
- One member was even able to fix all the remaining issues simply by providing a clear and coherent prompt.

Unsloth AI (Daniel Han) ▷ #general (339 messages🔥🔥):

GRPO trainer, Fine Tuning, Blackwell GPU

First GRPO Trainer Built: A Slow but Steady Journey: A member created their first GRPO (Generative Reinforcement Learning with Policy Optimization) trainer and noted it was slower compared to LoRA finetuning, as documented in their Colab notebook.
- The member observed that training loss dropped near zero while reward value rose slowly, resembling experiences with simpler machine learning systems, and they plan to transition to other RL tools after this experiment.
Blackwell Docker Woes Spark Manual Compile Fixes: Members discussed issues with Xformers on Blackwell GPUs (RTX 50xx series), noting that manual compilation is required due to compatibility problems.
- One member shared the manual compilation steps: pip uninstall -y xformers; git clone --depth=1 https://github.com/facebookresearch/xformers --recursive; cd xformers; export TORCH_CUDA_ARCH_LIST="12.0"; python3 setup.py install and also noted “I really don’t need 128k context but I’ll still fling it to maximum anyway human nature at its best”.
Data Prep Insights for Fine-Tuning: A member inquired about how to prepare high-quality datasets for fine-tuning agents, sharing a datasets guide.
- Another member advised considering the model size, the nature of the agent, and the structure/quality of the data, emphasizing the need for at least 1000-5000 high-quality examples.

Unsloth AI (Daniel Han) ▷ #introduce-yourself (13 messages🔥):

New member introductions, Blockchain applications, AI in problem-solving

Enthusiastic AI Engineer Joins Channel: An AI Engineer from Hong Kong named Andi joined the channel, expressing eagerness to learn and collaborate with others.
- Other members immediately welcomed Andi, with one jokingly asking for help to start a billion-dollar startup.
Blockchain and AI Convergence Explored: One member described how their journey started with curiosity about writing trust into code, leading them to explore blockchains and AI.
- They believe that combining blockchain and AI can transform industries, communities, and idea generation, highlighting the potential for these technologies to solve previously impossible problems.

Unsloth AI (Daniel Han) ▷ #off-topic (143 messages🔥🔥):

Gemma Fine-Tuning, Sora 2, AI Models and Self-Awareness, Windows Subsystem for Linux (WSL), torchcodec Issues

Gemma Fine-Tuning Framework Clash!: Members noticed that while Unsloth claims to be the only framework allowing Gemma fine-tuning on T4 Collab GPUs, Google’s Gemma documentation also provides a tutorial using LoRA and Keras on T4 GPUs.
- It was later determined that Unsloth identified and fixed the original problems that were preventing it from working efficiently.
Sora 2 is NSFW?: A member shared a link indicating Sora 2 video may be NSFW-ish, causing some confusion among users who didn’t immediately understand the reference.
- Some older members understood the reference while others required clarification, as evidenced by one responding with ‘U too young to get it’.
LLMs Lack Self-Awareness, say What?: Members discussed the accuracy of LLMs when asked about themselves, with one stating that usually asking an LLM about itself is not accurate
- It was suggested that if GPT-5 seems to know itself, it’s due to either specific training, tool use, or inclusion of identity information in the system prompt, with one stating that you can’t talk to any gpt without a system prompt.
WSL: Windows Savior?: Users debated the merits of using Windows Subsystem for Linux (WSL) to avoid dependency issues, particularly for projects like torchcodec, which has better CUDA implementation on Linux.
- Despite the initial learning curve, community members recommended WSL with VSCode for a seamless experience and better hardware resource utilization, with one saying its compatibility , linux thats it.
Eval Loss reaches new Heights!: A member expressed shock at seeing an eval loss of 6.
- Other members suggested that it’s normal for csmor most audio models.

Unsloth AI (Daniel Han) ▷ #help (67 messages🔥🔥):

GRIT algorithm + Qlora based finetuning, AgentGYM, Gemma 3n 4b it - 16 bit model, Qwen 2.5 72B fine-tuning with Unsloth on Runpod, Multi-GPU with Llama-server

GRIT and QLORA Notebook Search: A member inquired about a notebook for the GRIT (Geometric Reprojection Instruction Tuning) algorithm combined with QLORA-based fine-tuning, referencing a specific Hugging Face model.
Unsloth Confirms Free Plans and Gemma 3n 4b it - 16 bit model Availability: A member asked if Unsloth offers a Gemma 3n 4b it - 16 bit model, leading to a confirmation and a link to the Unsloth documentation.
- Another member clarified that Unsloth does not have paid plans, dispelling a user’s question about faster inference speeds with paid models.
save_pretrained_gguf struggles: A member reported issues when attempting to use save_pretrained_gguf function, facing an error related to a missing tensor when loading the model with Llama.
- Another member indicated that save_pretrained_gguf is not yet working in Unsloth, suggesting a manual conversion or using push_to_hub_merged() and a conversion space, though the original poster reported difficulties with both approaches.
User Attempts to Fine-Tune a Personal AI Clone: A user aims to fine-tune an LLM to create a Discord bot that mimics their own conversational style using their video subtitles as a data source.
- Community members suggested converting the subtitles into a Q&A format, potentially using another AI model to generate question-answer pairs from the video content, while emphasizing that this method is experimental.

Unsloth AI (Daniel Han) ▷ #research (5 messages):

ReLU Bug, Shifted Tanh, Gradient Explosion

ReLU Bug Results in Squaring Term: A member realized a bug in their original formulation using ReLU, which resulted in a squaring term with the subtract product.
- They shifted to a shifted tanh [~0, ~1] instead of [0, N], which acts more like a differentiable signal now rather than a signal amplifying gate.
Gradients No Longer Explode: As a result of the shift to tanh, the member reported that the gradients no longer explode.
- They linked to a blog post on LoRA as background.

OpenAI ▷ #ai-discussions (398 messages🔥🔥):

Sora 2 invites, GLM 4.6 vs Sonnet 4.5, AI-powered tool ideas, Deepfakes and photorealistic prompts

Sora 2 Invite Scramble Intensifies: Discord users are aggressively seeking Sora 2 invites, leading to spam in various channels and concerns about potential scams, with many requesting codes via DMs and expressing frustration over the invite-only system.
- Users are directed to specific channels like <#1379321411046346762> and <#1315696181451559022> to discuss the app and access.
GLM 4.6 coding plan is money-saving alternative: Users found GLM 4.6 to be a cost-effective alternative to Sonnet 4.5 for coding tasks, with one user reporting a quarterly pro plan from Zai provides 600 prompts every 5 hours at a significantly reduced cost.
- One user claims that zai glm coding plan such good value [they] signed up for a quarter.
AI-Powered Dream Tools Spark Imagination: When prompted about the type of AI-powered tool that they wish they could build, members shared several ideas, like tools that let them see through walls or fly.
AI App Criticized for Double Standard on Deepfakes: An AI app is under fire for asking users to deepfake themselves while simultaneously rejecting photorealistic prompts, sparking outrage and accusations of mental deficiency.

OpenAI ▷ #gpt-4-discussions (9 messages🔥):

GPT-5's SQL skills, Instant generation, Bandwidth throttling, Thinking Mode

GPT-5 fails SQL tests: GPT-5 (Codex and non-Codex) doesn’t seem SOTA at SQL according to llm-benchmark.tinybird.live, ranking #23 and #52.
- Meanwhile, o3-mini is #6 on the same benchmark.
Instant Generation No Longer Instant?: Users are experiencing constant issues, and wondering why instant generation is not instant anymore, even when using custom instructions.
- Some theories suggest that it is due to a bandwidth throttling measure.
Thinking mode can determine user’s location: One user suggested that thinking mode has a tool for getting the user’s timezone or approximate IP location.
- Another user specified that it was listed in the ~18K system prompt for thinking as well, and that thinking and non-thinking mode have different system prompts and tools.

OpenAI ▷ #prompt-engineering (7 messages):

ChatGPT and Canvas, Human Writing Prompts

Can ChatGPT incrementally update Canvas code?: A member asked if there’s a way to get ChatGPT to use Canvas without rewriting the whole code every time, by only inserting what’s needed at certain locations.
- Another member responded that they haven’t found a way to do so, and suggested requesting it or reporting it as a potential bug in the relevant Discord channels.
How to make writing more human?: A member is looking for a prompt to make their writing more human.
- Another member suggested using more fine-tuned models like Sudowrite, which has a good baseline and user plugins.

OpenAI ▷ #api-discussions (7 messages):

ChatGPT and Canvas, Human writing prompts

ChatGPT struggles with Canvas: A member inquired if there’s a way to get ChatGPT to use Canvas without rewriting the whole code every time.
- Other members confirmed that they have not found a way to do so, and suggested requesting it as a feature.
Human writing prompts: A member asked for a prompt to make their writing more human.
- Another member suggested using more fine-tuned models like Sudowrite or user plugins for better results.

HuggingFace ▷ #announcements (1 messages):

ML for Science projects, Trackio library, Watermarking with Gradio, HF Inference Providers in VS Code, Public AI on HF Inference Providers

Students Wanted for ML-for-Science Projects: Hugging Face is seeking students and open source contributors to join some ML-for-science projects, check it out at this Discord link.
- Volunteers will get a chance to make a difference and help humanity.
Trackio: Wandb replacement: Hugging Face released a new free library for experiment tracking called Trackio, which is a drop-in replacement for wandb and is built to be local-first, check out the github repo.
- You can log metrics, tables, images, and even videos!
Gradio’s New Invisible Watermarks: Hugging Face released a blog post about visible watermarking with Gradio, see the post here.
- Now you can easily demonstrate the source and avoid misinformation or bad-faith actors.
EmbeddingGemma model released!: Google’s new efficient embedding model EmbeddingGemma has been released, read the blog post for more info.
- The model excels at being efficient and versatile.
Faster Transformers blog post goes live: HuggingFace released a blog post on tricks from OpenAI’s gpt-oss that you can use with transformers, read the post here.
- The post details how to use FlashAttention and quantization techniques to speed up your transformer models.

HuggingFace ▷ #general (196 messages🔥🔥):

Crisper Whisper Integration, Lora Training, Medical AI Opinion, RTX 4090 modded cards, ComfyUI Crashing

Whisper gets Crisper: Integrating Speech Recognition: A member sought help integrating Crisper Whisper into a full-stack app for speech recognition, asking about libraries for recognizing abnormal words.
Lora Training timeout triggers Loss: Members discussed how Lora training jobs risk data loss if timed out, especially since checkpoints aren’t automatically uploaded to the Hub during the run; a HuggingFace documentation link was shared about setting sufficient timeouts.
- One member noted, *“If your job exceeds the timeout, it will fail and all progress will be lost.”
Medical models: Members discussed their opinion on managing loss for medical tokens and negation terms in medical AI models.
- One member asked if it could work for causal models.
Modded RTX 4090s and Multi GPU Farming: Members discussed modded RTX 4090 cards with 48GB of VRAM being sold on eBay, and AMD Instinct MI210 cards, with one member suggesting 2080ti’s with 22GB of VRAM as a cheaper alternative.
- A member said *“Those instinct cards are useless. Those 4090s are dope. You can get 2080tis with 22gigs of ram for quarter of the price though.”
ComfyUI venv crashes: A user reported issues with ComfyUI’s venv crashing and not recognizing dependencies when opening a new pod in Runpod, especially after a recent Runpod crash.

HuggingFace ▷ #cool-finds (1 messages):

Foundational Models, New Research Paper

Crazy Foundational Model Paper Dropped: A member shared a research paper on foundational models claiming that this can change everything.
Continued Discussion Requested: Further discussion on this paper is welcome; this summary is just a starting point.

HuggingFace ▷ #i-made-this (2 messages):

CloudOpsBERT, IaC concurrency

CloudOpsBERT Opens Cloud Analysis: CloudOpsBERT is an open-source project that explores domain-adapted transformer models for cloud operations log analysis.
- It focuses on anomaly detection, reliability monitoring, and cost optimization.
IaC concurrency conflicts predicted: A concurrency management framework predicts and mitigates concurrent modification conflicts in Infrastructure-as-Code (IaC) systems.
- The project uses BERT and LSTM models.

HuggingFace ▷ #computer-vision (1 messages):

Live benchmarks, Arenas for vision tasks, Satellite imagery, Drone imagery, Datasets

Seeking List of Live Benchmarks for Satellite/Drone Imagery: A member is seeking a list of live benchmarks or arenas for vision tasks using satellite or drone imagery.
In Search of Public Datasets and Benchmarking Platforms: A member is seeking up-to-date datasets and benchmarking platforms for computer vision tasks.

HuggingFace ▷ #smol-course (4 messages):

Broken Quiz Links, Course Tips

Section 2 and 3 Quizzes Link to 404: Multiple members reported that the quiz links for sections 2 and 3 are broken, leading to a 404 error.
New student asks for course advice: A new student, spectrix.dev, is starting the course and asks for tips and what they’ve missed.
- Another member, noir_bd, advised the new student to not be shy about asking questions if they have doubts.

HuggingFace ▷ #agents-course (3 messages):

Agent Course

Agent Course Starters Launch Today: Multiple members announced they are starting the Agent Course today.
- They expressed excitement about the course and the topic.
Enthusiasm for Agent Course: Members express their interest in learning about agents and beginning the course.
- The course is anticipated to cover a range of interesting topics.

LM Studio ▷ #general (162 messages🔥🔥):

vLLM Parallel Requests, LM Studio and AVX2, LM Studio Reddit Ban, Qwen3-Omni Support, LM Studio Parallelism

vLLM Powers Parallel Requests: A member highlighted that vLLM can handle a large number of parallel requests, even with smaller models, due to its highly parallel nature and scalability with available memory.
- They shared an example of a 4070 running Qwen3 0.6B achieving an average generation throughput of 1470.4 tokens/s across 10 concurrent requests.
LM Studio needs AVX2: Some members discussed the requirement of AVX2 support for LM Studio, and to check compatibility, users can hit CTRL+Shift+H within LM Studio to open the hardware screen.
- It was suggested that a new PC made in the last 5 years with AVX2 instructions, 8GB of VRAM, and 16GB of RAM would meet the minimum requirements.
LM Studio server: headless is in progress: Members discussed setting up LM Studio on a server and accessing it remotely, with one noting that LM Studio allows running a server with API access (REST and Python APIs), but full headless mode isn’t yet supported.
- Another member suggested using an existing LM Studio WebUI like LMStudioWebUI to connect to the LM Studio API.
Qwen3-Omni multi-model on its way: Members discussed the support for Qwen-Omni, stating that it will be supported when llama.cpp does, with no action needed from the LM Studio team.
- One member noted the Qwen chat GUI already has a natively omni model.
Split Models across Multiple GPUs: Members talked about splitting models across multiple GPUs in LM Studio, noting that it primarily increases available VRAM rather than improving speed.
- One member planned to split gpt-oss-120b across two A6000 (48GB) GPUs.

LM Studio ▷ #hardware-discussion (33 messages🔥):

AMD 495+, Strix Halo, 3090 Memory Bandwidth, Mobile chips

Waiting for AMD 495+ with Large Unified Memory: A member is waiting for the AMD 495+ that supports 512GB of unified memory, emphasizing the importance of soldering the memory to the board to avoid bandwidth issues.
- They also noted that while AMD AI mobile chips are okay, their bandwidth of roughly 500 GB/s with system memory is not comparable to the 1 TB/s achievable on GPUs.
Strix Halo bandwidth is only 256GB/s: A member pointed out that the Strix Halo gets 256GB/s bandwidth, arguing against comparing it to GPUs like the 4090.
- Another member clarified that the 3090 has approximately 1 TB/s (936.2 GB/s), suggesting AMD’s R9700 with 500 GB/s is a less attractive option for AI, and that the W7900 achieves 864.0 GB/s.
Theoretical Limits vs Practical Usage: One member highlighted that the theoretical maximums are not always achievable, estimating the iGPU can do roughly 215-220GB/s and the CPU can do 128GB/s but that LLMs won’t be run on the CPU cores.
- They added, “Am tired of seeing people comparing things that are not meant to be compared in the first place, It is in its own product category”, emphasizing the PC’s suitability for larger MoE models, while also mentioning the cost-effectiveness of buying used hardware.
Second-hand Market for GPUs: The discussion shifted to the second-hand GPU market, with mentions of blower 3090s being around $900 and $600 3080 20GBs being available.
- A member suggested considering a $1K 4080 32GB from Alibaba, while another expressed preference for 4090s with 32GB VRAM for around $3-4K on eBay.

Latent Space ▷ #ai-general-chat (99 messages🔥🔥):

Amazon's next-gen AI devices, Ring-1T Model, Gemini 2.5 image editing, EigenCloud AI verification, Cerebras funding

Amazon Alexa Gets More AI: Amazon unveiled its next generation of AI-powered Echo devices and Alexa updates, featuring enhanced AI capabilities.
- One user expressed desire for a hackable Echo Dot Max where they can plug in and exchange LLMs.
1 Trillion Parameter Ring Model: Ant Ling unveiled Ring-1T-preview, the first 1-trillion-parameter open-source “thinking” model with SOTA math scores, achieving 92.6 on AIME25 and 84.5 on HMMT25 as benchmarked here.
Google’s Gemini 2.5 Gets Image Upgrade: Tim Brooks and Google DeepMind announced new native image editing and generation features for Gemini, highlighted for their consistency, instruction adherence, and creativity.
- Community reactions centered on the model’s strength in preserving non-edited elements.
Cerebras Got Billions in Funding: Cerebras announced a $1.1B Series G round led by Fidelity & Atreides, valuing the company at $8.1B, with funds earmarked for AI processor R&D, U.S. manufacturing expansion, and global data-center scaling via their press release.
- Community members requested LoRA/GLM-4.6 support.
Context is the New Prompt: Anthropic announced a new engineering-blog article explaining that context engineering—not prompt engineering—is the key to maximizing AI-agent performance, and developers are sharing experiences, techniques and questions around memory management for smarter runtime context using their blog post.

Latent Space ▷ #genmedia-creative-ai (4 messages):

Sora 2, Puppet Explainer Videos, Chris's New Gig

Chris’s Sora 2 Puppet Explainer Gig: Chris 🇨🇦 shares that his new role involves creating puppet explainer videos using Sora 2.
- He noted that the output is impressive despite a strange cutoff issue in this X post.
Sora 2’s Impressive Output: Despite a strange cutoff issue, Sora 2 is producing impressive results in creating puppet explainer videos.
- Chris 🇨🇦 highlighted the capabilities of Sora 2 in his recent announcement.

GPU MODE ▷ #general (13 messages🔥):

Learning PTX and CUDA, FP8 support in RTX 4090, Code organization strategies, GEMM implementation, cuBLAS performance

Gauging the Timeline for PTX and CUDA Proficiency: A member inquired about the typical timeframe to go from learning PTX and CUDA from Nvidia documentation to building interesting applications, and another member responded that it depends if libraries such as triton, cutlass are included, and the domain of application.
- One member stated that if lack of CUDA knowledge is the only thing holding you back, you can learn it really fast as a language, but finding what to write is the real problem.
Diving Deep into RTX 4090 FP8 Implementation: A member, who is a ML Engineer, had to learn CUDA deeper to test their ideas and bought a RTX 4090 due to FP8 support, and implemented flash attention for fp8 sm89.
- They noted that there was no FP8 support in 2023 when they bought it, and applied right after reading a paragraph in the docs.
Strategies for Optimal Code Structuring: A member sought advice on how to organize code, specifically when having custom code that is faster for 405b but slower for 8b models within a codebase similar to torchtitan.
- The question posed was whether to use if-else branches in the model code, maintain two versions of the model code, or apply monkey-patching post-processing.
Mastering GPU Utilization via GEMM Implementation: A member suggested that implementing a GEMM that achieves over 80% of cuBLAS performance is a good way to utilize a GPU fully, and that GEMM is always useful because you can recast arbitrary tensor contractions as a GEMM through matricization.
- They linked a classic blog post to guide others through achieving cuBLAS performance.

GPU MODE ▷ #triton (1 messages):

Triton Developer Conference 2025, GPU MODE State of Triton, NVIDIA Blackwell GPU backend for Triton, Triton-distributed computation

Triton Developer Conference 2025 Approaching: The Triton Developer Conference 2025 is just weeks away, with opportunities to attend in-person and connect with Triton enthusiasts.
- The conference aims to enable attendees to learn, share, and network with top leaders while discovering new advancements in the Triton community and can be registered at aka.ms/tritonconference2025.
GPU MODE Discusses State of Triton: Mark Saroufim from Meta and GPU Mode will present on the current state of Triton at the upcoming conference.
- Other talks include Phil Tillet and Thomas Raoux from OpenAI discussing Triton: Today and Beyond, alongside presentations from Meta, AMD, Nvidia, and Bytedance.
NVIDIA Blackwell GPU to get Triton Backend: Chris Sullivan from Nvidia will present on the NVIDIA Blackwell GPU backend for Triton.
- This presentation will showcase how the new Blackwell GPU architecture integrates with Triton for optimized performance.
Triton-distributed tackles computation and communication: Wenlei Bao from Bytedance will discuss Triton-distributed: computation and communication overlapping in distributed LLM training and inference.
- This talk will cover how Triton can be used to optimize distributed training and inference for large language models by overlapping computation and communication.

GPU MODE ▷ #cuda (2 messages):

Kernel Function Addresses, Tensor Core Evolution, Kimbo Chen

Kernel Addresses Exposed!: A blog post was shared about extracting addresses of kernel functions in host code, accessible at redplait.blogspot.com.
Kimbo Charts Tensor Core Timeline!: Kimbo Chen’s blog post on the evolution of NVIDIA Tensor Cores from Volta to Blackwell was highlighted, located at semianalysis.com.

GPU MODE ▷ #torch (1 messages):

Torch Dynamo Compile Times, Measuring impact of recompilations, Autotuning

Torch Dynamo: Decoding Compile Times: A member inquired how to interpret torch._dynamo.utils.compile_times() logs to better understand Dynamo’s performance.
- Specifically, they sought methods to measure the impact of compile times, recompilations, and autotuning on a per-step basis without completely resetting the Dynamo cache, linking to a gist for context.
Dynamo Cache Reset Dilemma: The user wanted to measure the impact of compile times per step, without having to reset the dynamo cache.
- This is in order to avoid the overhead of repeated recompilations and autotuning, but still understand the fine-grained performance implications.

GPU MODE ▷ #cool-links (2 messages):

Determinism, NVIDIA Determinism, Deep Learning Determinism, LLM inference determinism

Determinism References Dropped: A member posted a link to a NVIDIA presentation on determinism in deep learning from 2019.
- The member also posted links to other academic papers and articles on the topic of determinism, including one on defeating nondeterminism in LLM inference.
Nvidia’s CUDNN Backend Details Shared: A member shared a link to the NVIDIA cuDNN documentation detailing its backend.
- The documentation may be useful for those looking to understand how cuDNN operates and how to leverage its features for deep learning tasks.

GPU MODE ▷ #beginner (5 messages):

Benchmarking Guides, benchmarking opinions, youtube benchmarking

Guide Guru sought for Benchmarking: A member asked for a good guide for benchmarking in the beginner channel.
- Another member, also a guru with lots of opinions, shared a link to a YouTube video on benchmarking, claiming it was related to this paper.
Benchmarking insights: A member with lots of opinions shared insights.
- They claimed it was their best benchmarking effort.

GPU MODE ▷ #pmpp-book (4 messages):

Fundamentals Book, Hardware Instructions

Fundamentals Book Gives Solid Foundation: A member shared a link to a fundamentals book, noting it’s beneficial for understanding concepts like memory fences.
- They added that once you have a solid foundation, “hardware specific instructions or all the fancy new mma instructions can be understood p quickly”.
Practice Makes Perfect, Not Just Knowledge: A member realized that doing something for a job or research differs from learning something quickly.
- They suggested the former is “mostly practice and developing ‘muscle memory’ rather than knowing things”.

GPU MODE ▷ #rocm (1 messages):

AMD core file analysis, rocgdb debugging

Debugging AMD core dumps with rocgdb: A member inquired about how to analyze AMD core files, such as gpucore.219213.
- They attempted to use rocgdb --core=gpucore.71076 but received a warning: Couldn’t find general-purpose registers in core file.
Core file analysis alternatives?: The member seeks alternative methods for analyzing the core file.
- They are experiencing issues with rocgdb and are looking for suggestions.

GPU MODE ▷ #metal (1 messages):

bghira: now we just have to wait for a sub-685B model to run it with..

GPU MODE ▷ #self-promotion (1 messages):

Free AI learning, Free CS Learning, Open Source Educational Resources

Freelunch AI Serves Free Education: A member shared a link to a GitHub repository, Freelunch-AI/lunch-stem, describing it as the best place to learn AI & CS for free.
- The repository is positioned as an open-source educational resource.
Open Source STEM Education Initiative Launched: The GitHub repository Freelunch-AI/lunch-stem aims to provide a platform for free learning in AI and Computer Science (CS).
- It emphasizes open-source contributions to make quality education accessible.

GPU MODE ▷ #edge (1 messages):

Radiation Shielding, Orin Radiation Testing, Hardware Watchdogs

Low Orbit provides Radiation Shielding: Being in Low Earth Orbit provides less radiation, and the satellite chassis offers natural shielding.
Radiation Testing Safeguards Orin: Radiation testing has been conducted on the Orin to prevent radiation-induced latchups, which are runaway electrical charges that can damage electronics.
Hardware Watchdogs to the rescue: Radiation may freeze the device, but hardware watchdogs are in place to reset it.

GPU MODE ▷ #submissions (6 messages):

MI300x8, amd-gemm-rs Leaderboard, amd-ag-gemm Leaderboard

MI300x8 Personal Best Achieved: A member achieved a personal best of 593 µs on MI300x8 on the amd-gemm-rs leaderboard.
amd-gemm-rs sees MI300x8 success: A member was successful on MI300x8 with a time of 608 µs on the amd-gemm-rs leaderboard.
- Another submission reached 27.7 ms.
amd-ag-gemm Leaderboard entries: A member successfully submitted three times on MI300x8 to the amd-ag-gemm leaderboard.
- The submissions timings were 986 µs, 534 µs, and 884 µs.

GPU MODE ▷ #status (2 messages):

Triangle Multiplicative Update

Triangle Multiplicative Update writeups and code soon to come: The Triangle Multiplicative Update competition has ended and the writeups and code from winning submissions will be released soon.
altzhang confirms release: altzhang confirmed that the writeups and code would be released soon.

GPU MODE ▷ #factorio-learning-env (4 messages):

Debugging factorio-learning-env, Google Meet Link

Google Meet Hangout: A member shared a Google Meet link to a debugging session.
- The member said they’d be in the session for a while and others are welcome to join.
Debugging factorio-learning-env: A member mentioned that they are investigating factorio-learning-env and will join after dinner.
- They are debugging an issue that could be related to the tools involved, with the cause still undetermined.

GPU MODE ▷ #cutlass (34 messages🔥):

cute.nvgpu.warp.MmaF16BF16Op documentation, TiledMMA broadcasting, Distributed GEMM in CuTe, UMMA tensor core

wmma Documentation Debacle: Members questioned why cute.nvgpu.warp.MmaF16BF16Op documentation references mma instead of wmma and asserts mma shapes, leading to the clarification that wmma was a failed abstraction.
- Another member explained that UMMA is just our nick name for the Blackwell tensor core. Hopper was GMMA. Ampere was HMMA. dont’ read much into itwe and everyone should now just call it tcgen05, and we should delete all references to UMMA.
TiledMMA to broadcast, or not to broadcast?: It was confirmed that TiledMMA automatically “broadcasts” across K when used with cute.gemm if created with atom_layout_mnk and k > 1, as seen in the backend implementation.
- This reuses the layout across K.
Distributed GEMM can be implemented in CuTe DSL: Despite a member’s belief that there are no distributed primitives in CuTe DSL yet, it was asserted that one can implement that dist GEMM in CuTe DSL today just fine, using symmetric memory, and that NVL load/store semantics are transparent to the kernel.
- A member linked CUTLASS example 65 as an implementation and this presentation for more details on 1D TP GEMMs.
Study CuTe’s Source Code to Grasp Gemm: A member new to CuTe was encouraged to view the source code instead of looking at impls, saying it’s pretty easy to read and much more instructive than the examples, specifically mentioning the GEMM files.
- Additionally, they advised the member to practice with simpler examples, print layouts, and change parameters to see how they affect things.

GPU MODE ▷ #general (3 messages):

TriMul Competition, GPGPU solution, Operator Fusion, A100, MI300

TriMul Competition a Hit: A member thanked Alex and the team for the TriMul competition, noting it was fun from an operator fusion point of view and went beyond just GEMMs.
- The same member achieved the preliminary fastest solutions on A100 and MI300.
GPGPU Solution Write-Up: A member wrote a blog post about a general GPGPU solution at arseniivanov.github.io/blog.html, including code and kernels on GitHub.
- The member noted they saw someone ask for this in the status channel.
TriMul Swag Ready: Alex and another member indicated that the TriMul competition swag is ready.
- They mentioned they need to finish grading.

GPU MODE ▷ #multi-gpu (19 messages🔥):

NVLink Multicast, Multimem Instructions, NVShmem Wrappers, Peer GPU L2 Cache, NVLink SHARP

NVLink SHARP Multicast Memory: When multiple SMs try to access the same location on a peer GPU over NVLink, it doesn’t get broadcasted in hardware by default, unless you use multicast pointers and multimem instructions.
- A member noted that implementing dynamic coalescing for multicast would be microarchitecturally difficult, so you’d want to implement that as some global-to-global memory transfer over NVLink, followed by local splitting or loading to shared memory.
Multimem Instructions for Multi-Device Requests: ld_reduce multimem operation for loading and reducing over many peer GPUs can be useful if you’re asking about one GPU loading many times to the same peer address.
- Multimem multicast requires symmetric memory addressing, according to the discussion.
NVShmem Wrappers for User-Friendly Multicast: NVShmem gives you nice user-friendly wrappers for multicast behavior, according to one of the members.
- The member responsible for NVShmem gave a shameless plug.
Peer GPU L2 Cache Involvement: Memory access goes through the peer’s L2 cache if you don’t do anything special, such as volatile load, according to a member and hazyresearch’s post.
- One member wondered if the xbar on the current GPU broadcasts to all SMs which requested some memory range.

GPU MODE ▷ #llmq (2 messages):

Mega Kernel Projects, 20B Model Training, Gradient Norm Concerns

Mega Kernel Projects Spark Interest: Members find the Mega Kernel Projects very cool, noting their different style of work.
- One member expressed uncertainty about how much of the work was done in the open.
20B Model Finishes Training: A member reported that their 20B run finished in approximately 90 hours on 4x4090, achieving a final loss of 2.45 compared to 2.52 for the 10B model.
- The member noted accidentally overwriting the last bit of the log file, causing the grey curve to end prematurely.
Gradient Norm Slope Increases, Causes Worry: The gradient norm has an increasing slope, potentially causing problems for longer runs, according to a member’s observations during the 20B model training.
- They remembered that Karpathy had his runs explode sometime after 300B, even though they looked fine before.

Nous Research AI ▷ #general (80 messages🔥🔥):

Sequence expansion transformers, GLM 4.5 vs GLM 4.6, DDR5 RAM, Sora 2, Nous Chat Web

Sequence Expansion Transformers Seek Community Input: A member is currently working on sequence expansion transformers (SEX Transformers) and inquired about adding the project to the “community projects” section.
- Another member joked about whether to make the arrows red in a UI, saying shall i make the arrows red or can you see.
Teknium Demos SORA 2 locally: Teknium shared their first Sora 2 video and mentioned they might try running a 333B model on their 256GB RAM CPU setup.
- However, it was quickly pointed out that the model is actually 357B according to the HF page.
Exploring GLM Model Performance on CPU: Members discussed GLM 4.5 and 4.6 model sizes, with GLM 4.5 Air (100b) having 12B active parameters, and speculated on performance compared to Qwen 2 32b on CPU.
- One member stated they were able to get 4 tok/s on qwen 2 32b on their CPU with RAM.
Upgrading RAM: A Deep Dive into DDR5 and Memory Controllers: A discussion ensued regarding upgrading to 256GB DDR5 RAM, with concerns about potential clock downgrades and the limitations of Ryzen 7000 series exceeding 5200MHz with such configurations.
- One member shared they spilled soda on their computer and upgraded to a 9950x ryzen which supports 256GB now.
Sora 2: Region Locked and Trump Meme Resistant: Members noted that OpenAI’s Sora 2 seems to have a region-based rollout requiring VPNs, and humorously pointed out its inability to generate Trump memes.
- Despite the limitations, Sora 2 is considered the best video generation model currently available.

Nous Research AI ▷ #ask-about-llms (2 messages):

LLMs with Lower Cosine Similarity, GPT-OSS-120B

Cosine Similarity in LLMs Decoded: A member inquired about a preference towards LLMs with lower cosine similarity, such as gpt-oss-120b.
- Another member clarified that cosine similarity measures how similar the chains of thought (CoT) are in LLMs, where high similarity may indicate training on similar data.
Cosine Similarity measures CoT similarity: Cosine Similarity measures how similar the Chains of Thought are.
- If LLMs have a high cosine similarity, then their output is very similar, which count point to the fact that they have been trained on each other.

Nous Research AI ▷ #research-papers (2 messages):

Aigo.ai, Symbolica, Augmented Intelligence (AUI), After Thought (Stealth), Symbolic Architectures

VC Seeks AGI via Symbolic Architectures: A VC investor is exploring symbolic architectures like Aigo.ai, Symbolica, AUI (Augmented Intelligence), and After Thought (Stealth) in their quest for AGI.
- They consider Aigo.ai the most promising, and are asking what requirements must one have for working on the same?
Requirements for Working on Symbolic Architectures like Aigo.ai: A VC investor expressed interest in understanding the necessary qualifications and skills for contributing to symbolic architecture projects, particularly Aigo.ai.
- This inquiry stems from their investment search for AGI solutions and assessment of various symbolic AI approaches.

Nous Research AI ▷ #interesting-links (1 messages):

DeepSeek Sparse Attention, Mamba selectivity

DeepSeek dives into Sparse Attention: A new DeepSeek Sparse Attention (DSA) v3.2 model uses a mamba selectivity-like mechanism as a sub-attention block.
- The post speculates on how this might work.
Mamba merges with DeepSeek: DeepSeek’s Sparse Attention (DSA) is experimenting with incorporating a Mamba selectivity-like mechanism.
- This could potentially enhance the model’s ability to focus on relevant information and improve performance.