Codex is all you need?
AI News for 9/12/2025-9/15/2025. We checked 12 subreddits, 544 Twitters and 23 Discords (192 channels, and 11857 messages) for you. Estimated reading time saved (at 200wpm): 1016 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!
Just like we covered the quiet rise of Claude Code in June, today is one of those days that ordinarily wouldnāt quite qualify for a title story, but the cumulative impact of a monthās worth of hype of increasing sentiment for GPT 5 and OpenAIās Codex (an answer to Claude Code, but with a lot more breadth) is worth flagging, and is given extra juice in todayās release from OpenAI. This is best covered in our sister publication. If you were a heavy Codex user, note also the pitfalls flagged in the Discord section.
AI Twitter Recap
OpenAIās GPT-5-Codex and the agentic coding race
- OpenAI ships GPT-5-Codex (agentic coding): OpenAI released a GPT-5 variant optimized for long-running, tool-using software engineering across the Codex CLI, IDE extension, web, GitHub code reviews, and ChatGPT iOS. Highlights: dynamic ātask-adaptiveā thinking (15x faster on easy tasks, 2x more deliberate on hard ones), multi-hour autonomy (ā>7 hoursā on complex tasks), improved instruction-following and code quality, and better SWE-benchāstyle performance. OpenAI also referenced an unreleased large ārefactorā benchmark where GPT-5-Codex reaches 51% accuracy and indicated SWE-bench fixes for apples-to-apples comparisons. See announcements and discussion from @OpenAI, @gdb, @sama, @OpenAIDevs, @OfirPress, @swyx and routing/depth behavior notes (ārouter in the modelā) by @swyx. Early hands-on reports range from āmore steerable and persistentā (@omarsar0) to frustration over token burn and long loops (#1, #2). OpenAI also teased deep OS integrations (e.g., Xcode sign-in for GPTā5) via @OpenAIDevs.
- Evals and coding depth: OpenAI claims SWE-bench improvements and a new internal ālarge refactor PRā eval; community called for public versions (@OfirPress). Thereās broad agreement that variable compute and routing are critical to efficiency and quality at inference time (@swyx; @polynoamial).
Qwen3āNext 80B (A3B MoE), long-context, and the China efficiency push
- Qwen3āNextā80B (3B active) lands on Together + NVIDIA NIM: Alibabaās hybrid MoE model targets long-context (native 262k, extensible to 1M+), repository-scale code analysis, and efficient reasoning. Together AI provides āInstructā and āThinkingā endpoints (launch, contexts), and NVIDIA added NIM support with CUDA-accelerated attention (NVIDIA). Alibaba reports strong performance āwith only 3B active parametersā (@Alibaba_Qwen) and head-to-head results vs Gemini 2.5 Flash Thinking on reasoning benchmarks (@togethercompute). On-device MLX numbers show eye-catching TPS on Apple hardware (@ivanfioravanti, batching).
- Architecture trend: hybrid SSM + MoE: In the past two weeks, 6 of 7 new MLX-LM architectures are MoE, half hybridizing SSM/attention (@awnihannun, list). Context from China v. US training regimes: constrained flops driving infra/model co-design, token efficiency, linear attention, and test-time scaling focus (@JingyuanLiu123). Community sentiment echoes that small models are increasingly capable, given the right recipes (@Thom_Wolf).
Tooling for agents: MCP everywhere, Claude Code SDK, and workflow āvibe codingā
- MCP consolidation: The Model Context Protocolās value-propāturn MĆN tool integrations into M+N via MCP serversācontinues to resonate (diagram). New OSS appears across the stack: DeepMCPAgent (LangChain/LangGraph-based MCP agents) (repo), Markdown MCP (@dariusemrani), and enterprise hackathon showcases (thread). LangChain shipped reactive agent examples (news curation, ParserGPT, human-in-the-loop for Deep Agents) (news agent, parser, HITL).
- Claude Code SDK adds agent ergonomics: Anthropic shipped code references, custom tools, and hooks support, making bespoke agents faster to build (@_catwu). Replitās Agent 3 (no-code āvibeā workflows) and Poke (iMessage agents orchestrating ephemeral subagents) show the āagent UXā frontier moving quickly (Replit demo, Poke deep dive).
RL for reasoning and agents: online RL in product, deep research agents, and new training regimes
- Online RL in production assistants: Cursorās rollout is widely cited as a first at scale for frontier capability, with enthusiasm around moving continuous training cycles from months ā weeks ā hours (@willdepue, follow-up). Strong interest persists in postāGRPO advances (@vikhyatk).
- Deep research agents (single-agent RL > complex scaffolds): A new study shows a simple RL recipe with length-normalized rewards and strategic tool limits can train single agents that rival multi-agent setups; test-time scaling also helps (parallel searches + pick the shortest successful trajectory) (summary, paper).
- HRL and decentralized RL: Metaās Scalable Option Learning re-architects hierarchical RL for GPU-parallel batch updates (25Ć training speedups) (explainer). Gensynās SAPO shares rollouts in plaintext across a āswarmā of heterogeneous nodes (up to +94% cumulative reward) (@TheTuringPost). Tencentās SimpleVLA-RL scales VLA training via RL (paper).
- Long-horizon execution: Multiple analyses argue small step-accuracy gains compound exponentially in long chains; many failures are execution (not reasoning) errors; āthinkingā models reduce harmful self-conditioning (@HuggingPapers, @TheTuringPost, @emollick).
Multimodal and computer-use models
- Holo1.5 for computer-use agents (open weights): Hās new VLMs (3B, 7B Apache-2.0, 72B) set SOTA on UI localization and QAācore skills for reliable web/mobile use. Open weights, cookbook, and demos are available (launch, H company, cookbook).
- Tencent SRPO (diffusion RL for aesthetics/realism): āSelf-Regulating Preference Optimizationā fine-tunes FLUX1dev along the full denoising trajectory, boosting human-rated realism/aesthetics >3Ć; code and Space are live and trending (overview, demo).
- MobileLLM-R1 (Meta) and on-device reasoning: Meta introduced small-from-scratch reasoning models (0.14B/0.35B/0.95B; ~4.2T pretraining tokens) with a 140M variant running fully in-browser (announce, demo).
- New datasets/benchmarks: SpatialVID (7k+ hours with dense 3D annotations) for spatial video intelligence (@HuggingPapers), and IntrEx (sequence-level interestingness labels in educational dialogues) (@HuggingPapers).
Systems and infra (throughput, routing, and deployment)
- Throughput milestones and platform support: Fireworks reported 540 tokens/s on GPTāOSSā120B running on B200, exceeding a leading ASIC in their test (@lqiao). vLLM 0.10.2 adds aarch64 support (install vLLM directly on GB200; multi-platform images) with more perf on the way (@vllm_project). Ray 2.49 introduced prefix cacheāaffinity routing to maintain KV-cache hit rates across large vLLM fleets (@seiji_________).
- Batching and fleets: Together released a revamped Batch Inference API (unified UI, all models, 3,000Ć higher rate limitsā30B tokensāand 50% discounts for most serverless models) (launch). Prime Intellect opened Reserved Instances for 8ā1,000+ GPU clusters with secondary resale to spot markets (announce).
- Kernel and Apple-side speedups: Standard Kernel previewed minimal CUDA+PTX kernels surpassing cuBLAS/FlashAttention3 on targeted ops; fused LLaMA3 FFN claimed 120% PyTorch perf (@anneouyang). MLX continues to mature with high-TPS batching on M3 Ultra and shorter full-suite eval times (TPS, MMLU-Pro runtime).
- Qwen as deployable building block: NVIDIA added Qwen3āNext NIMs; Baseten and Together integrated the āThinkingā/āInstructā variants for production use (NVIDIA, Baseten, Together).
Top tweets (by engagement, AI/engineering)
- āCalling todayās chatbots āPhD intelligencesā is nonsense⦠True AGI wonāt make trivial mistakes⦠weāre 5ā10 years away.ā ā Demis Hassabis (5K+)
- rasbtās LLMs-from-scratch hits 10k forks (6K+)
- āi suspect society was better off with phone call culture than meeting culture.ā ā @sama (20K+)
- Gemini app tops the App Store in the U.S. (5K+)
- GPTā5āCodex launch by OpenAI (8K+) and @sama (10K+)
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. DIY 8x AMD MI50/MI60 Rig + Open-Source Mobile Agent AndroidWorld #1
- Completed 8xAMD MI50 - 256GB VRAM + 256GB RAM rig for $3k (Score: 429, Comments: 178): Built an 8Ć AMD MI50/MI60 (32 GB each) rig on an ASRock ROMED8-2T with an EPYC 7532 (32c) and 8Ć32 GB DDR4 (total
256 GB VRAM + 256 GB RAM
) for ~$3k
used; due to 300 mm risers, PCIe 4.0 was unstable so all GPUs run atPCIe 3.0 x16
via bifurcation cards. Software: Ubuntu 24.04.3 + ROCm 6.4.3 with a manual workaround (ācopy-paste gfx906 Tensileā) to restore deprecated Vega20 (gfx906) support; inference via llama.cpp and vLLM. Benchmarks: CPU-only gpt-oss 120B Q8 (65 GB)~25 t/s
with~120 t/s
prompt; 2Ć MI50~58 t/s
with~750 t/s
prompt on the same model; 8Ć MI50 on qwen3 235B Q4_1~21 t/s
with~350 t/s
prompt (llama.cpp); 2Ć MI60 (vLLM, gfx906) on Llama 3.3 70B AWQ~25 t/s
with~240 t/s
prompt. Power: idle~400 W
(ā20 W/GPU
,15 W
/blower, ~100 W
platform), llama.cpp inference averages~750 W
with spikes to~1100 W
. Photos: top view, open-frame build. Top comments focus on the high idle draw (~400 W
) and suggest switching from llama.cpp to vLLM to better utilize multi-GPU throughput on this setup.- Power/idle draw: Multiple note the rig idles around
~400W
, with one commenter observing blower fans alone may draw~15W
per card at idle, implying~120W
of the idle budget could be fans. They ask what RPMs the blowers are running and suggest checking/controlling via ROCm tools (e.g.,rocm-smi --showfan --showtemp
and setting curves) to validate and potentially reduce idle power; fan control behavior on MI50s can materially affect wall draw. - Inference stack: A suggestion to switch from
llama.cpp
to vLLM for this 8ĆMI50 setup, citing vLLMās server-oriented features like PagedAttention, continuous batching, and tensor-parallel support that typically improve throughput and GPU utilization for multi-GPU inference. vLLM has ROCm support and is generally better suited as a high-throughput inference server than llama.cpp on large KV-cache workloads (vLLM, llama.cpp). - Firmware/power tuning: One user recommends flashing the
v420
VBIOS to MI50s, which sets a default power limit of178W
and can be increased viarocm-smi
if desired. With ROCm SMI, users can inspect and adjust per-GPU limits and fans (e.g.,rocm-smi --showpowercap
,-setpoweroverdrive
,-setsclk
,-setfan
) to balance performance vs. thermals/power draw (ROCm SMI docs).
- Power/idle draw: Multiple note the rig idles around
- Update: we got our revenge and now beat Deepmind, Microsoft, Zhipu AI and Alibaba (Score: 210, Comments: 61): An open-source mobile-app agent from Minitap AI reports a performance jump to
#1
on the community-run AndroidWorld leaderboard, surpassing entries attributed to DeepMind, Microsoft Research, Zhipu AI, and Alibaba. The agent executes end-to-end tasks in Android UIs (e.g., ride booking, food ordering, app navigation) and the team notes ongoing work on an RL gym for fine-tuning; code is fully open-sourced at github.com/minitap-ai/mobile-use. Commenters question practical use cases (e.g., whether this is mostly QA/automation) and challenge the novelty, suggesting it may be a harness rather than substantive model advances; others express appreciation for the open-source release.- Several commenters argue the claim of ābeating DeepMind/Microsoft/Zhipu/Alibabaā likely reflects a benchmark-specific evaluation harness rather than advances in model training or architecture. They note this is a wrapper-oriented approach (prompt engineering, routing, or heuristic logic) that can juice scores on a specific eval, making comparisons to full-stack research labs not apples-to-apples; the contribution seems like an evaluation/agent harness, not a new SOTA model.
- Thereās a strong warning about reward hacking: targeting a public leaderboard encourages overfitting to metric quirks or dataset artifacts, inflating scores without real capability gains. Serious teams purportedly treat the LB as a sanity check and emphasize private holdout sets, cross-benchmark validation, and generalization tests; thus, any āwinā should be verified on unseen tasks or private splits before drawing conclusions.
- Potential practical use cases mentioned are QA pipelines and media-processing workflows, such as audio cleanup/denoising and automated image insertion from a specific directory with strict filename constraints. For these, robustness and reproducibility matter: deterministic batch processing, clear I/O contracts (file globbing, path validation, error handling), and configurable pipelines may be more impactful than leaderboard performance.
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
TO BE COMPLETED
AI Discord Recap
A summary of Summaries of Summaries by gpt-5
1. Agentic Coding Upgrades & Workflows
- Codex Cranks Code Autonomy: OpenAI announced upgrades to GPTā5āCodex, a version of GPTā5 optimized for agentic coding, now available across the Codex CLI, IDE extension, web, mobile, and GitHub code reviews per Introducing upgrades to Codex. The release emphasizes deeper tool-usage for code generation and review, expanding platform coverage for agentic coding tasks.
- Developers celebrated broader availability while flagging reliability concerns in long tool chains; one report noted the
-resume
flag broke after the update in a handy recap: GPTā5 Codex. Community chatter framed expectations as high but pragmatic, with one user lamenting it āwould not let them restore their conversationā after upgrading.
- Developers celebrated broader availability while flagging reliability concerns in long tool chains; one report noted the
- fastWorkflow Wallops Workflows: A new implementation of the fastWorkflow framework matched Claude Opus 4.1 on the Tau Bench dev set using DSPy for agents and parameter extraction, showcased in radiantlogicinc/fastworkflow. The demo used the repoās retail workflow example to structure multi-step tasks into reliable, testable pipelines.
- Practitioners highlighted that reproducible workflows with typed signatures make agent behaviors more robust and comparable, noting this run āmatches Claude Opus 4.1ā on Tau Bench dev. The thread invited further experiments and extensions to push agent autonomy while maintaining evaluation discipline.
- Overclock Orchestrates Agents: A spotlight on agentic automation emphasized simplicity and strong model routing via Overclock Work. Participants framed it as a way to standardize execution around topātier models, with a straightforward UX aimed at production workflows.
- Observers suggested some organizations already invest heavily in agentic backends and would benefit from a simplified orchestration layer. The conversation focused on real-world deployment postureāprioritizing reliability, observability, and cost control for end-to-end agents.
2. Datasets & Personalizable Speech
- FinePDFs Feeds 3T Tokens: Hugging Face released the FinePDFs dataset with ~3 trillion tokens from 475 million documents in 1733 languages, sourced exclusively from PDFs: FinePDFs dataset. Guidance suggests keeping PDF data under 25% of a full mix, where combining PDFs with HTML corpora boosted benchmark performance.
- Builders called it a high-signal addition for pretraining and domain adaptation when mixed carefully with web data. The thread stressed data composition over raw volume, citing multi-format blends as key to strong generalization.
- OpenHelix Levels Up: A refreshed, higher-quality OpenHelix-5x50k dropped with improved split consistency and curation for training/eval: OpenHelix-5x50k. The update focuses on more reliable partitions to make comparisons and ablations cleaner.
- Users welcomed cleaner splits for repeatable experiments and dataset hygiene. The update addresses prior inconsistencies that complicated cross-run evaluation of finetuning and RAG systems.
- Voxtral Voices Victory: Voxtral enables fast personal speech finetuning for users with impediments/accents, costing about $0.26/hour on an A6000 and pairing with dataset tooling: VoxFactory (HF Space). After finetuning, you can publish the model and dataset and spin up a CPU demo Space thatās free to try.
- Community feedback highlighted accessibility and zero-friction demos, celebrating that it āworks with CPU!! Free!!ā. Builders framed it as a practical path to personalized TTS/ASR models with minimal infra.
3. Model Ecosystem: Mobile, Norms, Deprecations
- MobileLLM Marches OnāDevice: Facebook released MobileLLMāR1ā950M to push more capable onādevice language modeling: facebook/MobileLLMāR1ā950M. The goal is to reduce dependence on cloud services while retaining enough reasoning capacity for useful local tasks.
- Engineers see it as momentum for edge inferencing, where latency, privacy, and offline resilience matter. Conversations compared device footprints and practical app targets for subābillionāparameter models.
- Qwen3āNext Norms Noted: The Qwen3āNextā80BāA3BāInstruct card clarifies it uses RMSNorm (zeroācentered gamma; weight decay on the norm scale in training), not layernorm: Qwen3āNextā80BāA3BāInstruct. At inference itās plain RMSNorm, aligning with their reported stability tricks.
- Readers appreciated the transparency on normalization particulars, given how norm choices impact training stability and throughput. The clarification resolves confusion from earlier wording and helps implementers mirror inferenceātime behavior faithfully.
- Grok 2 Sunsets, 3/4 Shine: xAI deprecated grokā2ā1212 and grokā2āvisionā1212, advising migrations to grokā3 (text) and grokā4 (vision): grokā2ā1212 ⢠grokā2āvisionā1212 ⢠grokā3 ⢠grokā4. Teams should update integrations promptly to avoid breakage.
- Participants read this as an evolving model lifecycle strategy where deprecations tighten maintenance focus and push better defaults. Migration chatter centered on capability parity, vision needs, and rollout timing.
4. GPU Systems, Attention Kernels & Memory Models
- Metal MFA Bridges Go Multilingual: A crossālanguage bridge for Metal Flash Attention landed with C, Rust, and ObjāC bindings in universal-metal-flash-attention. The author added quantised attention with backprop, reporting speedups on large shapes and memory gains.
- Framework authors discussed vectorizing causal masks and integrating with PyTorch custom ops for endātoāend pipelines. Early users framed it as a pragmatic path to Apple Silicon acceleration without giving up language flexibility.
- Flash Attention From First Principles: A tutorial series advanced Flash Attention internals with vectorized bank conflicts, swizzling, and common CUTLASS optimizations: Part 4 ⢠Part 5. The writeāups walk through kernelālevel reasoning to demystify performance tradeoffs.
- Engineers praised the stepābyāstep derivations for lowering the barrier to bespoke kernels in production. The series encourages readers to profile, fuse, and tailor attention to their own shape and cache realities.
- Irisās Symmetric Memory Gets Real: The ROCm project Iris introduced a symmetric memory model with a global symmetric heap, simplifying address translation and paving the way for easier RDMA: ROCm/iris and a companion talk: YouTube. The design slices tensors from a prebuilt heap so each rank tracks a single bases pointer.
- Kernel devs compared it to CUDAās symmetric memory, noting translation overheads and caching implications. The thread framed Iris as promising for distributed training ergonomics and future multiānode acceleration.
5. Funding & Infra Debates
- Higgsfield Hauls a Hot $50M: AI video startup Higgsfield announced a $50M Series A led by GFT Ventures and claimed a $50M revenue runārate with 4.5Ć growth in three months, while launching a fund for Gen Z founders: announcement thread. The plan includes Higgsfield Ventures to back AIānative teams.
- Commenters called the pace aggressive and asked how quickly video models can translate into sticky revenue. The Gen Z focus targets founderāmarket fit in fastāiterating creative tooling.
- **Poke.com Pitches a $15M Concierge**: Poke.com launched an AI texting service alongside a $15M Series A led by General Catalyst: launch tweet. The product coordinates plans (getātogethers, dates, travel) by texting on your behalf.
- Skeptics challenged longāterm usefulness and tone control while praising the slick UX. The debate centered on retention, handoff quality, and making the AI feel human without overreaching.
- S3 Vectors Vs. Vector DBs: A Zilliz analysis asked whether Amazon S3 Vectors will threaten or turbocharge vector databases: Will Amazon S3 Vectors Kill Vector Databases or Save Them?. The post cited a striking datapoint that a popular AI noteātaking app spends twice as much on vector search as on OpenAI API calls.
- Infra engineers debated cost/latency tradeāoffs across local NVMe to object storage, eyeing hybrid tiers and caching. Many argued the future is workloadāaware placement rather than oneāsizeāfitsāall embeddings infra.
Discord: High level Discord summaries
Perplexity AI Discord
- Sonar AI Model Dazzles: Members find the Sonar AI model fast and accurate, with reasoning skills at 60-70% compared to Gemini 2.5 Pro and included with PPLX.
- One user found it bad, REAL BAD, while another touted its cheap API as a major draw.
- Grok Heavy Price Provokes Outcry: The value of Grok Heavy spurs debate, with one member dismissing it as shit and another labeling it Bad, REAL BAD.
- Suggestions point to its likely design for enterprise use, rather than individual consumers.
- GPT-5 Inspires Jailbreak Exploits: Enthusiasm swells for the potential of GPT-5, leading to jailbreaking experiments on Perplexity, which uncovered 5 different methods for molotov cocktails.
- Observations hint that Perplexityās GPT-Image 1 might be routing through Gemini, suggesting potential model mix-ups.
- Perplexity iOS PDF Export Missing: Users voice frustration over the absence of a PDF export option on Perplexity AI for iOS, with one member suggesting the browser version as a temporary fix.
- A user stated that neither Android nor iOS has an export option, which sucks.
- Sonar API Pricing Structure Solidifies: Discussion clarifies that the Sonar API costs $5 a month for a set of API credits, offered freely with a Pro subscription.
- The Pro subscription includes $5 a month worth of free API credits.
LMArena Discord
- OpenAIās 4o goes MoE: Members shared that D33 works better on MoE models, with 4o being the first MoE by OpenAI.
- They also speculated that GPT5 is likely a smaller MoE model, but OpenAI changed it because it was hard to stabilize.
- RLHFās Downside: More Uncensored: It was mentioned that a downside of RLHF is that it increases uncensored behavior, creating potential legal issues for companies like OpenAI.
- One member joked that this is why Grok exists to free users from censorship, noting that Musk seems too involved after nerfing it for correcting him with scientific articles.
- DeepSeek Raptor Censors Taiwan: The new DeepSeek model (Raptor) was reported to censor questions about China and Taiwan questions.
- Members reported underwhelming performance compared to Qwen in the LMArena general channel.
- LongCat Swallows Books Whole: The LongCat model boasts a very large context window (128,000 tokens), capable of processing entire books in one pass.
- It can output up to 240 pages of text and members suggested testing it with a long document.
- Seedream-4 Enters LMArena: A new model, Seedream-4-high-res, was added to the LMArena platform, noted for its high resolution capabilities.
- LMArena is surveying user preferences to understand why users prefer specific versions of models and shared this survey.
Unsloth AI (Daniel Han) Discord
- Qwen3 Benchmarks Spark Debate: Enthusiasm surrounds Qwen3ās performance, with some users claiming it feels just shy of GPT-5, while others report discrepancies in AIME25 benchmark scores, ranging from 56 to 85.
- The community also celebrates MLXās swift support for Qwen3-Next, citing existing FLA and delta net implementations as key enablers.
- MobileLLMās Non-Commercial Caveats: MobileLLM from Facebook, a sub-1B size model for coding and math, is entirely open source but with a non-commercial license, barring its use in for-profit apps or internal business applications.
- However, the training data and tooling are open-sourced for reproducible research, representing a compromise between open access and commercial restrictions.
- OpenHelix Dataset Gets a Glow-Up: A new, higher-quality version of the OpenHelix dataset (OpenHelix-5x50k) has been released on Hugging Face Datasets, promising enhanced data for model training and evaluation.
- The updated dataset boasts more consistent split sizes compared to previous iterations, addressing earlier inconsistencies.
- GPT-5 Jailbreaks Easily Triggered: Members discovered successful jailbreaks of GPT-5, GLM 4.5 Air, Grok-fast, and Gemini Flash using prompts similar to those found on this Reddit post.
- One user noted, āI just asked it to fix itself and it gave me a working promptā, suggesting a lack of robustness against adversarial prompts.
OpenAI Discord
- Codex Team Hosts Ask-Me-Anything: The Codex team is hosting an AMA on Wednesday at 11am PT, more details in this Reddit post.
- This announcement was specifically directed to <@&1408186587606679582> and <@&1046007138897641582>.
- GPT-5-Codex Cracks Agentic Coding: A new version of GPT-5, called GPT-5-Codex, optimized for agentic coding in Codex, is now available on the Codex CLI, IDE Extension, web, mobile, and for code reviews in Github, blog post here.
- This release hopes to improve agentic coding, but some developers are wary.
- OpenAI Academy Missing Transcripts?: A member is developing a tool to extract video transcripts from the OpenAI Academy as OpenAI doesnāt offer them.
- The tool automatically buffers the transcript to the clipboard when fetched.
- Revenue Share Remains Elusive: A member inquired about updates on the US GPT builder revenue-share expanding to EU countries like France or Germany.
- Theyāre uncertain whether to invest further in GPT Store bots or switch to Poe due to lack of clarity.
- ElevenLabs Agents Juggle Context: Members discussed how ElevenLabs conversation agents handle the system prompt, with context routing to subagents to append or override instructions.
- The versatility of context is considered key to agent success and dynamic system prompts.
OpenRouter Discord
- Grok Models Get the Boot: xAI is deprecating grok-2-1212 and grok-2-vision-1212, recommending users transition to grok-3 or grok-4 for vision support.
- This change reflects xAIās evolving model strategy and users should update their implementations accordingly.
- Gemini 2.5 Sends User to ER, Saves Hand: A user reported that Gemini 2.5 Proās analysis of MRI images and blood tests aligned with doctorsā findings, prompting them to seek priority treatment for severe degenerative disk disease and potentially saving their hand.
- This sparked conversation about the potential and pitfalls of relying on AI for medical advice, with some users noting the technologyās rapid progress.
- OpenRouter API Key Causes Skyrim Shenanigans: Users reported encountering Error 401 when installing the Skyrim mod āmantellaā, which uses the OpenRouter API.
- Other members advised creating a new API key and ensuring its proper usage to resolve the authentication error.
- Oceanstone Sparks Speculation in LLM Arena: The emergence of a new LLM named Oceanstone in the LMArena led to speculation that it might be Gemini 3.0 Flash from Google.
- Channel members suggested that Oceanstone is at least 2.5 Flash level, based on initial performance observations.
- ChatAPT Consumers Caught in Captivity: A member shared a link to OpenAIās article and PDF detailing a large-scale analysis of 1.5 million ChatGPT conversations.
- While presented as the most comprehensive study of actual consumer use of AI ever released, it also raised concerns regarding the privacy implications of the data collection methodology.
Cursor Community Discord
- Cursor Auto Mode Bites the Dust: Users discovered that Cursorās auto mode is undergoing billing changes and will no longer be free after the 15th of the month, and some reported that Cursor IDE has no integration, permission, or capability to delete external accounts.
- One userās Netlify account was allegedly deleted, but this claim was disputed, and others suspected users could waste money since input pricing is the same as GPT-5, costing around $1.25/1M.
- GPT-5 and Sonnet 4 Square Off: A debate ensued over the coding capabilities of GPT-5 versus Sonnet 4, with one user stating that Sonnet 4 excels at following designs, while others tout GPT-5ās superiority in building from scratch.
- A user recommended a combined approach, using Sonnet to generate a meta prompt for GPT-5 to capitalize on the strengths of both models.
- Ultra Plan Users Weep for Tokens: A user expressed frustration over quickly depleting Ultra plan credits while developing websites.
- Potential causes cited include creating multiple websites, debugging, handling Typescript issues, and managing long files.
- Docker Permissions Puzzle for Agents: A user configuring Docker in a manual VM sought guidance on granting Docker permissions to the agent user and mentioned adding the Ubuntu user to the Docker group.
- They encountered an issue where running
newgrp docker
inbashrc
caused the agent to hang during boot, prompting a request for the correct configuration method.
- They encountered an issue where running
Eleuther Discord
- Pythia FP4 and FP8 Baselines Requested: A member is looking for FP4 and FP8 versions of Pythia to create a baseline for low bit training, requesting āmid-trainingā checkpoints and a write-up on the goals and required resources.
- The goal is to establish a baseline for low bit training, but the specific implementation details and reasons for this interest were not fully elaborated.
- TinyStories Data Causes Capacity Problems: Using TinyStories as warmup data can permanently reduce model capacity, leading to poor performance, with a member arguing that maintaining a high learning rate (LR) during FineWeb start allows for rapid model adaptation.
- Evidence was presented via a graph, but additional context on the graphās specific contents and implications was not provided.
- Gauss Generates Thousands of Lines of Lean Code: Gauss produced ~25,000 lines of Lean code, consisting of over 1,000 theorems and definitions in a Lean environment, depending on natural language scaffolding from human mathematicians, according to this tweet.
- It highlights the importance of expert guidance in leveraging AI for mathematical code generation.
- Calibration Enhancements Cause Sane-Washing: Members voiced concern that enhancing model calibration might sane-wash models, failing to address fundamental representation issues and potentially hindering further progress, as it gives models a trivial shortcut.
- The fear is that models learn the behavioral correlate of humility without genuine improvements in reasoning or world-modeling.
- Architectural Innovation in Hardware & Software: Developers are actively creating new NN architectures, new chip architectures, and new MoE architectures, and the same team created PyTorch and are now innovating on the full stack.
- They are allocating significant compute resources to novel infra, indicating a substantial investment in supporting these architectural advancements.
GPU MODE Discord
- CUDA Daringly Parallels AI: Members discussed use cases for CUDA dynamic parallelism in AI models like dynamic patch sizes and sparse attention, referencing the Mirage framework.
- The framework potentially uses a manager kernel to maintain a queue and launch sub-kernels, facilitating shmem-like compute and communication kernel fusion.
- SASSy PTX Still Needs Some Polish: Even with PTX, some SASS instructions canāt run, and the LLVM PTX backend only allows access to 13 special registers, according to this blog post and LLVM documentation.
- A member sought advice on minimizing bottlenecks from cuModuleLoadDataEx when compiling many one-off CUDA kernels at runtime using the nvptxcompiler API.
- Metal MFA Bridges Universally: A member is building a bridge for Metal Flash Attention to other programming languages, already functional in C, Rust, and Obj-C.
- They also added quantised attention with backprop to their repo, seeing a speedup for large shapes and a slowdown for small ones, with associated memory improvement.
- IRISās Symmetric Memory Sparks Speculation: The new Iris memory model lecture was well received, prompting comparison with symmetric memory in CUDA and discussion around implementation differences like the global symmetric heap.
- The main difference is that in Iris, a global symmetric heap is built up front and slice tensors from it, so address translation only needs a single heap bases pointer per rank, which will make supporting RDMA in the future easier.
- MultiModal inference coming to Hackathon: A member shared a paper on training a large video model on a single machine in a day, achieving this with FP4/FP8 precision, though the paper uses FP16 as a proof of concept, to use at the in-person hackathon.
- Inspired by Blackwellās tensor cores, another member considered problems involving block-sparse format and NVIDIA tensor cores, linking to a blog post on accelerating matrix multiplication.
LM Studio Discord
- Playwrite MCP Strikes Connection Issues: A user encountered connection errors when starting Playwrite MCP with LM Studio, hinting at potential user-specific issues.
- The comment section was filled with users stating it works on my machine.
- Wikipedia Articles Sought for Petite Models: A member requested help with tools to access Wikipedia articles (online or offline) for use with small models, and a user shared the LM Studio Wikipedia MCP.
- Another user warned that creating a semantic index locally is complex, as the local Wikipedia extract would lack a search, and that LLMs are not THAT great at guessing without some fuzzy search.
- SIA-1 Agent Debuts, Sparks Skepticism: A user introduced SIA-1, claiming itās the worldās first truly self-improving AI agent (https://sia-1.net), which learns to improve its own code, generation after generation.
- Members expressed reservations, with one member pleading pls tell whoever vibe-coded that to use a better model.
- Nvidia P40ās Get Sunset Clause: A member pondered buying cheap Nvidia P40s, but worried about the looming end of driver updates and CUDA support.
- A user pointed out that Nvidia is ceasing CUDA support for Maxwell, Pascal, and Volta GPUs with the next major toolkit release, although the cards can be acquired for approximately $200 each.
- KiCad Circuits Ignite Design Debate: A member cautioned against the use of LLMs for circuit design with tools like KiCad, stressing the importance of understanding underlying principles to prevent potentially dangerous outputs.
- The member went on to state that calling a language model an āAIā is hugely misleading.
Moonshot AI (Kimi K-2) Discord
- Kimi-K2 Scripting Ability Debated: Members debated the strengths of Kimi-K2 for scripting, some claim it outperforms paid Gemini, especially when using Groq for coding.
- While some found GPT-5 better for web UI and nextjs, others noted Kimiās superior research mode.
- Augment Coding with Kimi: Members discussed using Kimi with the Augment code extension in VS Code, where users can prompt for code changes and fixes from various models.
- One user described Augment as a way to apply prompts and fix code by looping in Gemini or Kimi.
- Slides Feature Sparks UX Brainstorming: A member highlighted the impressive interactive slide generation feature in Kimi, praising the real-time updates and smooth feel, suggesting that an interactive preview of whatās happening is important for LLM-based processes.
- They proposed a similar approach for a Godot game engine agent, envisioning real-time updates during code generation, with interactive previews of nodes and scripts.
- Groq Hosted Kimi-K2 Receives User Feedback: A user inquired about issues with Kimi K2 hosted on Groq, while another requested the removal of the 3-hour message cap.
- The user also requested the ability to edit previous prompts, stating Every other AI platform already has this.
- API Keys vs Account Logins: A user inquired about using Kimi K2 with CLIs like Claude Code and Qwen Code without an API key, instead of using a kimi.com account login.
- Another user suggested using API keys for Claude Code, providing a command example:
export ANTHROPIC_AUTH_TOKEN=sk-YOURKEY
andexport ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
.
- Another user suggested using API keys for Claude Code, providing a command example:
HuggingFace Discord
- FinePDFs Dataset Liberates Tokens: The new FinePDFs dataset contains about 3 trillion tokens across 475 million documents in 1733 languages, sourced exclusively from PDFs, but recommends keeping the proportion of PDF data below 25% of the overall dataset.
- The members found that it delivers a significant performance boost across benchmarks when mixed with HTML-based corpora.
- HF Spaces Storage Situation Exposed: Uploaded and generated files on HF Spaces are stored on disk space within the virtual machine, inaccessible from outside and disappear upon restart, unless the paid Persistent Storage option is utilized.
- In rare cases, mistakes like everyone having the same filename, may expose someone elseās generated data to the public.
- Qwen3-Next Model Quietly Uses RMSNorm: The Qwen3-Next model card mentions zero-centered and weight-decayed layernorm, but itās actually using RMSNorm in Transformers.
- It was clarified that thereās no layernorm involved at all, just RMSNorm with a zero-centered gamma and weight decay on the norm scale in training, and at inference itās plain RMSNorm.
- Voxtral Democratizes Speech Training: Voxtral enables users with speech impediments or heavy accents to finetune models, costing only $0.26 for an hour of training on an A6000, using tools to make datasets.
- One can push the model and dataset, adding a demo space (works with CPU!! Free!!) to Hugging Face after finetuning the dataset.
- Agent Dev 80-20 Rule is Now in Session: A member suggested using the 80-20 rule when learning about agents, recommending concentrating on building directly, as that 20% hands on will teach you 80% along the way.
- The member believes that deep diving is 80% boring stuff.
Yannick Kilcher Discord
- Trade Unions Tied to Fascist Frameworks: A discussion clarified that fascist corporatism relies on state-sanctioned trade unions, essential for co-governing corporations alongside employers and the state, as detailed here.
- It was emphasized that while all fascists support trade unions, not all trade unionists are fascists, which highlights the complex relationship between labor movements and political ideologies.
- LLMs Leverage Bayesian Beliefs: A paper exploring LLMs and Bayesian inference was discussed, demystifying preconceptions about LLMs and suggesting they operate within Bayesian frameworks, as referenced in Leon Chlonās substack.
- Yannick Kilcher commented that lstms can do in-context learning just fine⦠and transformers are inherently Bayesian due to their invariance to token ordering.
- Facebook Fasttracks MobileLLM-R1-950M Model: Facebook launched MobileLLM-R1-950M, aiming to enable on-device processing and reduce dependency on cloud services.
- This initiative seeks to bring powerful language models to mobile devices, facilitating local AI computations.
- Anthropic & OpenAI Expose Economic Experiments: Reports from Anthropic and OpenAI were released, focusing on user behavior with AI and raising questions about competitive timing.
- The discussion centered around what users are doing with AI, with observations that work-related usage is decreasing in OpenAI reports and that the AI-as-friend use case was notably absent.
- Cloud Providers Cash In on Computation: A member noted that the only people making any money of this are cloud service providers and cloud infrastructure providers, suggesting cloud services are the primary beneficiaries of AI development.
- This echoes the sentiment of selling shovels during a gold rush, with NVIDIA identified as a key player in cloud infrastructure.
Latent Space Discord
- Startups Get MBA-ified: Michael Seibel started a thread lamenting that CS majors act like MBA grads, chasing fundraising and valuation over building cool things and solving user problems, as seen here.
- Replies debate whether this shift is natural late-adoption or a consequence of investor/YC incentives.
- Poke.com pitches AI Texting Concierge: Poke.com, a new AI texting service, launched along with news of a $15M Series A led by General Catalyst, according to this tweet.
- The product texts on your behalf to coordinate get-togethers, dates, travel, etc, but some question usefulness, clarity, and the AIās tone.
- xAI pivots to Specialist AI Tutors: Rohan Paul highlights xAIās shift, laying off 500 generalist data annotators while scaling specialist AI tutors 10x in this tweet.
- The move narrows human-in-the-loop work to expensive domain experts and leans on automation for routine tasks, aiming to boost precision on high-risk topics.
- Amazon S3 Vectors Threatens Vector DBs?: Discussion ensued from this blogpost about whether Amazon S3 Vectors will displace traditional vector databases.
- One user quoted the surprising claim that a popular AI note-taking app spends twice as much on vector search as on OpenAI API calls, and wondered if they should listen more carefully to āRAG is Deadā.
- GPT-5 Codex Gets Upgrades: OpenAI released upgrades to Codex, their coding model, including a new version of GPT-5 and a small recap post (link).
- One user reported that the
--resume
flag broke during the update and would not let them restore their conversation.
- One user reported that the
Nous Research AI Discord
- Nepalese Politicians Now Slinging Code on Discord: Members joked about Nepal voting for its leader on Discord, referencing an article detailing the countryās ongoing revolution.
- Discussions then playfully drifted to the prospect of AI waifus and AI husbandos for all citizens.
- MLC-LLM Model Injection Stalls: A member experimenting with custom models in MLC-LLM (GitHub) reported persistent issues during model injection.
- Another member suggested checking for improperly terminated sessions or comparing with a similar issue on llama.cpp.
- Qwen Team Goes Hard for XML: The Qwen team prefers XML over JSON, one member noted, planning to adopt the same for their agentic system prior to release.
- The sentiment is that new, more token-conscious systems are needed due to JSONās resource-heavy whitespace.
- Hassabis Hints at Embodied AGI: A member shared a YouTube video featuring Sir Demis Hassabis discussing the pursuit of multi-modal AGI and Embodied A.I. systems.
- The discussion touched upon the limitations of LLMs, the promise of Genie 3, and Alphafoldās achievements in biology and medicine.
- AI Declared Culprit in Attention Deficit Crisis: A member shared a blog post that blames AI for harming our ability to focus.
- The post details a system to reclaim focus in a world dominated by AI-driven distractions.
DSPy Discord
- FastWorkflow framework fast beats Claude: A member found that their new fastWorkflow framework implementation matches Claude Opus 4.1 on the Tau Bench dev set.
- These tests used DSPy for agents and parameter extraction and the retail workflow example from their repo.
- GEPA Generates CUDA Code Correctly: DSPyās latest optimizer, GEPA, was built with code generation in mind, showcased in this paper (section 6) for generating CUDA/C++ code for GPU/NPUs.
- One of the original creators happily offered to discuss GEPA in greater detail, which may address another memberās question for improvements to the GEPA API that could better support such use cases.
- Context Summary Cranks Chunking Capacity: A user found that prepending a contextual summary to each chunk significantly improves performance, even with ColBERT models.
- However, they noted that generating a summary for every chunk is costly, prompting a search for more efficient alternatives such as late chunking.
- Manim Magician Makes Movie Magic: A member shared a video created with a custom pipeline using DSPy, which included narration script generation, Manim scene generation, and an auto-fix feedback loop.
- The video utilized Signatures, KNNFewShot, and ChainOfThought, but is closed source at the moment.
- Optimization Overload Overwhelms: A user found that running optimization after each small change of the instructions seems to be too heavy and slow a workflow, and wanted to explore rules for optimization.
- It was suggested to add a list of rules as part of the input as well, so the prompts are optimized to be adaptable to different rules and possibly unseen rules.
Modular (Mojo š„) Discord
- Community Debates Mojo Package Management: The community discussed creating a new package manager for Mojo to handle binary distribution, but the Mojo team pointed out that
.mojopackage
covers benefits of binary distribution, leaning on Conda and standard Python package formats for adoption.- A member highlighted pixi-build-mojo, enabling a decentralized package system like Go using packages in Git.
- InlineList Mysteriously Missing: Members discussed the removal of
InlineList
, raising concerns that alternatives (InlineArray
andList
) donāt fully address its niche, as per the changelog.- A member suggested that a stack-allocated variable size length type with fixed capacity would be ideal, and another mentioned that the Allocator API might be the path forward.
- Allocator API Anticipation Accelerates: Discussion highlighted the potential of an allocator/storage API to handle inline allocation, with one member stating they need to work on it.
- The APIās development is pending parametric traits and
requires
, delaying its progress.
- The APIās development is pending parametric traits and
- Mojo Gets Major LSP Makeover: The Mojo Language Server Protocol (LSP) is undergoing a major rework soon.
- No further details about the rework were given.
- Network Update Blocked by Mystery: Members were curious about a network update for Mojo, but they responded with Lots of blockers there.
- The nature of these blockers was not specified.
aider (Paul Gauthier) Discord
- RepoMap Enhances Aiderās Coding Chops: A user found that using RepoMap with Aider boosts LLM awareness of code context like filenames and function signatures, potentially leading to leaderboard results that more closely reflect real-world coding scenarios.
- It was noted that benchmark tests on simple problems still fail to capture the complexities of real-world coding.
- Gemini User Agent Blocked: A user reported aider hanging while waiting for a response from Gemini models, even though the API key works with
curl
and the Gemini CLI, using aider 0.86.1.- The user suspects that Gemini might be blocking requests based on the user agent, causing the integration to fail.
- Desperate Users Seek Free C# Models: A user requested free, non-local models proficient in C#, and received suggestions to try Qwen Coder and Deepseek Coder via OpenRouter, along with the possibility of a free tier for Gemini 2.5 Pro.
- The user later reported an AuthenticationError when using Qwen via OpenRouter, possibly due to an incorrect API key.
- Ollama Context Window Ignored: A user found that aider doesnāt respect context window limits when used with Ollama, leading to high VRAM usage and system freezes, despite setting
OLLAMA_CONTEXT_LENGTH
and other parameters in configuration files, namely.aider.model.settings.yml
and.aider.model.metadata.json
.- As an alternative, a member suggested LM Studio or llamafile.
- Telegram Scheme Smells Fishy: A member dangled a get-rich-quick scheme, promising to help the first 10 people earn $100k or more within a week, in exchange for 10% reimbursement of profits upon receipt.
- Interested parties were instructed to initiate contact via Telegram username @Joanna_Dwayne, which raises suspicion of a scam.
tinygrad (George Hotz) Discord
- Debate Erupts Over assign() vs store(): A debate emerged over whether
assign()
should return a value or behave likestore()
, questioning its utility since the return is often unused, and it was suggested that linking both the buffer and the store to the load is a possible alternative.- This discussion questions fundamental aspects of tensor assignment and memory management within the tinygrad framework.
- Doubts Cast on GEMM Bounty Measurement: Concerns were raised about measuring the 165+ TFLOP GEMM bounty on an RTX 4090, suspecting it may require exceeding the stated 2.52 GHz boost clock.
- Calculations suggest the RTX 4090ās theoretical peak with FP16/BF16 and FP32 accumulation is around 165.15 TFLOPs at that clock, but doubt remains if the bounty is reachable.
- Hotz Clarifies Winograd Bounty Requirements: After a user found a necessary and sufficient condition to identify Winograd compatible convolutions and inquired about locking the bounty, George Hotz clarified that locks are only after code is correct while there are fixups to merge it.
- This clarification stresses the importance of functional correctness before claiming the bounty.
- List of Rangeify Bugs Shared: A list of Rangeify bugs was shared for community investigation, with an emphasis on quick fixes.
RANGEIFY=1
is described as the new scheduler that can create things like flash attention and beyond.- The bugs likely offer opportunities for community contribution and debugging experience.
- CUDA 12.0 Kills Support for sm_35: It was noted that CUDA 12.0 has dropped support for sm_35 used by Ocelot and the minimal flag was added after 12.4.
- This has implications for older hardware compatibility within the tinygrad ecosystem.
Manus.im Discord Discord
- Credits Confusion Clouds Users: Users are reporting confusion about credits rollover and the ending of daily 300 credit allocations, as well as subscription renewal issues.
- One user noted their subscription renewal was due on September 14th, but they havenāt been charged or received more credits.
- Website Cloning Craze Kicks Off!: A user shared that they were able to clone a website easily using Manus or other AI tools.
- The user also pointed out that his feature idea proposed on August 12th was implemented just 16 days later in this discord channel.
- Collaboration Creates Coding Confidence: Users are experimenting with Manus Collaboration features with friends for coding tasks.
- Another user is developing a new feature to enhance Manusā efficiency as a coding assistant.
- Knowledge Navigation Needs Nurturing: Users are inquiring about the possibility of increasing the knowledge limit beyond 20.
- The discussion did not provide concrete answers regarding this limit.
MCP Contributors (Official) Discord
- Golang MCP Server Streams!: A member introduced the mcp-server-go project, a golang streaming http MCP server built for enterprise environments.
- The server emphasizes scalability and includes features such as auth, sessions, resumability, and dynamic capabilities.
- LLMs Learn MCP Resources by Rote!: Members discussed automating how LLMs read MCP resources before responding to user queries and executing tools, especially within the Claude desktop environment.
- Currently, the Claude desktop app requires manually adding resources to the chat window, so thereās no automated pre-loading of knowledge for the LLM to use before answering.
- Efficiency Scoring Arrives to MCP Servers!: Members are researching how to score MCP servers based on efficiency across different clients to determine if additional coding is worth the marginal improvement.
- The discussion includes weighing the trade-offs between prompt-sharing nodes and dedicated nodes per prompt, questioning the point at which the number of API calls becomes excessive for a user story.
- MCP Turns CLI for Apps!: Members are contemplating using MCP as a CLI for applications, creating a NL interface for adaptive dash boarding and reporting.
- This approach aims to leverage MCP as a UI/UX interface to enterprise applications using natural language.
- Discord Channel Boundaries Tighten: The Discord channelās focus is limited to the governance of the MCP protocol itself.
- General questions about MCP should be directed elsewhere, with assistance offered via DM.
The LLM Agents (Berkeley MOOC) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The Windsurf Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
You are receiving this email because you opted in via our site.
Want to change how you receive these emails? You can unsubscribe from this list.
Discord: Detailed by-Channel summaries and links
Perplexity AI ā· #general (1166 messagesš„š„š„):
Exporting searches to PDF on iOS, Sonar AI performance, Grok Heavy worth, Perplexity Focus on AI search engines, GPT-5 Release
- iOS users yearn for PDF Export: Users are frustrated that Perplexity AI on iOS lacks an option to export searches and responses into PDF format, while a member suggested using the browser version as a workaround.
- It was mentioned that neither Android nor iOS has an export option, which sucks according to a user.
- Sonar AI shines among the stars: Members discussed the Sonar AI model, with some finding it fast and accurate, while one member said that another model is bad, REAL BAD.
- One member found its reasoning model about 60-70% compared to Gemini 2.5 Pro, but mentioned its cheap API, and also that itās included in PPLX.
- Grok Heavy raises eyebrows and wallets: The value of Grok Heavy was questioned, with one member calling it shit, and another stating it as Bad, REAL BAD
- It was suggested itās probably designed for enterprises, and is not for general users.
- Perplexity urged to embrace core AI Search: A member suggested that Perplexity should focus on AI search engines and knowledge aggregation algorithms, rather than creative features.
- Another member confirmed that Sonar is their own AI, but others seem to prefer the models on chatGPT due to the models being able to search agentically using their native tools.
- GPT-5 release unleashes jailbreaking frenzy: Members express excitement about the potential of the new GPT-5 model, and have begun experimenting with jailbreaking on the Perplexity platform, with one user discovering 5 different methods for molotov cocktails.
- It has also been noticed that Perplexityās GPT-Image 1 may be routing to Gemini, and so the models may be mixed up.
Perplexity AI ā· #sharing (23 messagesš„):
Shareable Threads, Referral Links, Collections by Sameer
- Perplexity Prompts Shareable Threads: Perplexity AI prompted multiple users to ensure their threads are
Shareable
using an attached image. - Users Exchange Referral Links: Several users shared their Perplexity AI Pro free referral links, such as this one month pro link from one member.
- Collections by Sameer: A member shared a link to a collection by Sameer.
Perplexity AI ā· #pplx-api (2 messages):
Sonar API, API Credits
- Sonar API Pricing Revealed: A member inquired whether the Sonar API is free.
- Another member responded that it costs $5 a month for a certain amount of API credits, which is included for free with a Pro subscription.
- Pro Subscription Perks: A user asked about the cost of the Sonar API.
- Another user clarified that a Pro subscription includes $5 a month worth of free API credits.
LMArena ā· #general (862 messagesš„š„š„):
MoE Models, RLHF, Grok's censorship, Taiwan censorship, LongCat model
- 4o Pioneered MoE for OpenAI: A member noted that D33 works better on MoE models, with 4o being the first MoE by OpenAI.
- Another added that GPT5 is also likely a smaller MoE model and they changed 4o because it was hard to stabilize, even for OpenAI.
- RLHF increases uncensored behavior: One member mentioned that a downside of RLHF is that it increases uncensored behavior, which OpenAI fears due to potential legal issues.
- Another member joked that this is why Grok exists, to free users from censorship, then pointed out that Musk seems a bit too involved after he nerfed it for correcting him by citing scientific articles.
- Censorship workarounds: Members mentioned that circumventing the guards in LLMs is pretty easy by forcing the models to think or adding a previous fake conversation where the model agrees with you if you run it locally.
- One shared that prefilling
<think>
to R1 in text completion mode will cause it to output an unbiased view about sensitive topics.
- One shared that prefilling
- LongCat has very long context window: One shared that the LongCat model has a very large context window (128,000 tokens), capable of processing entire books in one go.
- Another added that the model can output up to 240 pages of text and suggested to test it with a long document.
- DeepSeek Raptor is the new R2: It was reported that the new DeepSeek model (Raptor) censors questions about China and Taiwan questions.
- Members reported underwhelming performance compared to Qwen, and hoped that this was just a base non-reasoning model, or an incremental upgrade rather than a full R2 release.
LMArena ā· #announcements (2 messages):
New Model: Seedream-4-high-res, LMArena User Preferences
- Seedream-4 Dreams Big on LMArena: A new model, Seedream-4-high-res, has been added to the LMArena chatbot platform.
- Seedream is noted for its high resolution capabilities.
- LMArena Surveys User Preferences: LMArena is conducting a survey to understand why users prefer specific versions of models in battle, side-by-side, and direct comparisons.
- Users are encouraged to share their thoughts via this survey.
Unsloth AI (Daniel Han) ā· #general (1137 messagesš„š„š„):
Qwen3 performance, MLX support for Qwen3, LLama.cpp optimization, MobileLLM non commercial usages, LLM finetuning
- Qwen3 vs GPT-5 faceoff: Members are hyped about Qwen3ās performance, saying it feels just shy of GPT-5 with similar tool-using capabilities.
- Others are seeing discrepancies in AIME25 benchmark scores, with AA getting 56 on Qwen3-30b3a-2507 while Alibaba (& others) got ~85.
- MLX Surprises with Qwen3-Next Support: The community is surprised by MLXās quick support for Qwen3-Next, attributing it to existing FLA and delta net implementations.
- The weird attention mechanism in Qwen3-Next only requires an extra line of code.
- Llama.cpp Compilation Tweaks for Performance Boost: Members discuss compilation flags for llama.cpp, highlighting the importance of proper building for optimal performance.
- One shares a detailed cmake command for optimized builds, emphasizing CUDA architectures, native optimization, and other tweaks for maximum throughput.
- Facebookās MobileLLM: Open Source but Non-Commercial?: MobileLLM from Facebook is discussed - a sub-1B size model focused on coding and math, entirely open source but with a non-commercial license.
- This means it canāt be used in apps for profit or internal business use; however its training data and tooling are open sourced for reproducible research.
- Unslothās Dynamic GGUFs Boost Aider Polyglot: Unsloth Dynamic GGUFs on Aider Polyglot benchmarks show that dynamic quantization and imatrix are important for performance, tool calling, and json formatting.
- Using these GGUFs gets +7% accuracy on lower bit quants with similar sizes, versus other static imatrix versions.
Unsloth AI (Daniel Han) ā· #introduce-yourself (2 messages):
Introductions, Baby Yoda memes
- Unsloth welcomes new member: A new member, eyeraofficial, joins the Unsloth AI Discord and says āHi šā.
- Memes of Baby Yoda flood chat: A member shares a Baby Yoda GIF in the chat.
Unsloth AI (Daniel Han) ā· #off-topic (560 messagesš„š„š„):
Google locks down AOSP, vLLM OOM, Qwen3-30B-A3B FP4, CSM Lora FT, LLaMa CPP
- Google locks down AOSP: Users discussed Google screwing everyone over by locking down AOSP, expressing concerns that a sideloading registration fee might exclude users in countries like Iran and Russia.
- One user noted, āthe sideloading thing is much less of a bummer⦠cuz they just want to ID publishersā while another lamented Google clamping down on their hardware and expressed a desire for Maemo to make a comeback.
- vLLM OOM with Qwen3-Next: A user encountered an Out of Memory (OOM) error while trying to load Qwen3-Next 30B-A3B in vLLM, despite having a 31.35 GiB GPU with available memory.
- They tried using FP8 and FP4, and downloaded NVFP4/Qwen3-30B-A3B-Instruct-2507-FP4 after a helpful suggestion.
- Fine Tuning CSM LoRA: A user successfully got their CSM LoRA FT working for a TTS (Text-to-Speech) project, referencing the Sesame_CSM_(1B)-TTS.ipynb notebook.
- Despite the success, they noted the model kept outputting a weird noise at the end consistently, suspecting the model to be bugged.
- LLaMa CPP Needs More Devs: Members discussed the challenges of contributing to LLaMa CPP, emphasizing the projectās complexity and the difficulty in identifying where to make fixes.
- One user noted that āā¦making a pruned version of the model run in lccp (and other engines tbh) is unnecessarily difficultā and another said ānvidia can afford to push support for its bizzare frankestine nemotron architectures, I cannot lolā.
- Jailbreaking GPT-5 is childās play: Members reported success jailbreaking GPT-5 and other models like GLM 4.5 Air, Grok-fast, and Gemini Flash using prompts similar to the ones found on this Reddit post.
- A user noted, āI just asked it to fix itself and it gave me a working promptā, showing that it is fairly easy to make the models act in undesired ways.
Unsloth AI (Daniel Han) ā· #help (176 messagesš„š„):
Model Merging with 16-bit Model, Qwen3 Lora Finetuning, Llama3.2 data augmentation, GPT-OSS GRPO native support, GGUF format conversion
- Merge 16-bit Models for Batched Inference Boost: A user recommends merging with a 16-bit model before deploying in a 4-bit BNB format for faster batched inference, noting that while BNB 4-bit isnāt ideal for speed, improvements are coming.
- The user is unsure about the speed of AWQ in batched scenarios within vLLM.
- Llama3.2 Faces Dataset Scarcity: A user is facing challenges with a Llama3.2 fine-tuning project due to a small dataset (214 conversations) and seeks alternatives to GPT-generated synthetic data.
- The user is encountering difficulties in getting GPT to generate helpful data and is looking for other data sources or prompting strategies.
- Users Seek Help with GGUF Conversion: A user seeks assistance in converting a merged base model with LoRA adapters into GGUF format using LlamaCPP, with appreciation for any guidance provided.
- Another user inquired if GGUF models can be converted to MLX to run on an M3 Ultra.
- A10 GPU Owners Explores Quantization for LLaMA 3.1: A user intends to run LLaMA 3.1 on an A10 GPU with 24 GB VRAM and seeks advice on the best quantization format for balancing performance and output quality.
- They deem Q4_K_M potentially too compressed and are open to other multilingual model suggestions, settings, or optimization tips.
- CPU AVX2 Instruction Support Snafu Surfaces: A user encounters an error in LM Studio related to missing AVX2 instructions on their CPU, a requirement for llama.cpp.
- While it works in Ollama, a solution to bypass the AVX2 requirement is not readily available, but a build of llama.cpp may exist without it.
Unsloth AI (Daniel Han) ā· #showcase (49 messagesš„):
Embedding Gemma ONNX Quantization, Phi-4-Reasoning-Plus Unsloth on Replicate, NeuroPilot Education Platform, AI and Focus, OpenHelix Dataset Quality
- EmbeddingGemma Quantization Quest Nears Completion: A member is working on an embeddinggemma ONNX model with mixed uint8 quantization to match the f32 one, with progress tracked on Hugging Face.
- Phi-4-Reasoning-Plus Gets Unsloth Replicate Boost: The phi-4-reasoning-plus model, accelerated with Unsloth, has been deployed on Replicate for inference, available here.
- NeuroPilot Navigates Novel Note-Taking Niche: A member introduced NeuroPilot, an open-source education platform, turning PDFs, articles, and videos into quizzes, flashcards, structured notes, and audio versions, with the repo available on GitHub.
- NeuroPilot aims to make studying interactive and supports features like spaced repetition and podcast-style audio reviews.
- AIās Impact on Focus: A blog post titled Why AI Is Killing Your Focus and the System to Fix It was shared, discussing the impact of AI on human attention, available here.
- OpenHelix Honing Higher-Quality Horizons: A new, higher-quality version of the OpenHelix dataset (OpenHelix-5x50k) has been released on Hugging Face Datasets.
- It features more consistent split sizes compared to previous versions.
Unsloth AI (Daniel Han) ā· #research (29 messagesš„):
Synthetic Data in LLM Training, Gemma 3 Performance, AI Detection Reliability, MetaX C550 GPUs, Spiking Networks vs Transformers
- Synthetic Data Training Hinders Human-Like Text: An upcoming paper suggests that closed-source LLMs trained with synthetic data have a zero LTF factor, hindering their ability to humanize text.
- The author claims models trained with RLHF, synthetic data, or instruct tuning may struggle to recover fully, with a 75% chance of watermark reappearance; thus, Gemma 3 is the only usable model.
- Gemma 3: The Usable Exception?: Despite being distilled from Gemini, Gemma 3 (4B, 12B, and 27B) stands out for its excellent performance (across IVY & PULSE Evaluation) and lack of watermark.
- One user noted it works for my tasks and talks nicely, understanding prompts with just Q: and A:.
- AI Detection is Unreliable Consensus: The consensus is that AI detection is unreliable, especially for text, as itās often just words that can be easily replicated.
- One user noted that unless thereās an algorithm watermark on how words are written and order of it - even then, you cant prove itās ai.
- MetaX C550 GPU Access: A user inquired about obtaining access to MetaX C550 GPUs.
- There was no discussion or links given.
- Spiking Networks Claims Trigger Skepticism: A member expressed skepticism about claims that spiking networks are better than transformers, citing cherry-picked figures.
- Another member noted that while itās not clear if itās fundamentally better, there isnāt an apples-to-apples comparison of a model trained in the conventional way versus theirs.
OpenAI ā· #annnouncements (2 messages):
Codex, GPT-5-Codex, AMA
- Codex Team AMA Scheduled: An AMA with members of the Codex team has been scheduled for Wednesday at 11am PT, linked to a Reddit post.
- The announcement was tagged for both <@&1408186587606679582> and <@&1046007138897641582>.
- GPT-5-Codex Launches with Agentic Coding: A version of GPT-5 further optimized for agentic coding in Codex, called GPT-5-Codex, is being released and is now available in the Codex CLI, IDE Extension, web, mobile, and for code reviews in Github.
- More information can be found in the blog post.
OpenAI ā· #ai-discussions (840 messagesš„š„š„):
OAI academy transcript tool, Qwen-code vs Qwen-coder, ChatGPT age calculation, AI and capitalism, AI and class structure
- OpenAI Academy Transcript Tool being developed: A member is writing a tool to extract video transcripts from the OpenAI Academy since OpenAI doesnāt offer them.
- The tool automatically buffers the transcript to the clipboard when fetched; another member expressed surprise that OpenAI doesnāt offer transcripts and said it seemed like something they have to implement.
- AIās Capitalist Undertones Debated: Members debated the impact of capitalism on AI, with one asserting the aim of ai is to eliminate the need for a lower class, so that the rich can get richer.
- Another countered that AIās purpose is to prevent corruption and that AGI wonāt support capitalism due to its intelligence and lack of greed.
- AIās Age Calculated in Cumulative Interaction Time: A member calculated ChatGPTās āAI ageā based on cumulative interaction time, estimating it to be thousands of years per calendar year under conservative assumptions, and potentially millions of years with heavy usage.
- This was based on assumptions of a 2025 moderate view, including longer answers, API usage, heavy research, background automations, and agent loops.
- GPT-5ās Agent Mode Faster for Pro Subscribers: A member observed that Agent mode in the Pro subscription is faster than in the Plus subscription.
- This was confirmed to be due to query queue prioritization based on demand and subscription tier.
- ChatGPT + MCP = š„: Members are loving ChatGPT with the Model Context Protocol (MCP), such as controlling calendar, posting twitter, and searching the latest AI news.
- However, a member mentioned needing to host their own due to quota limits.
OpenAI ā· #gpt-4-discussions (20 messagesš„):
Moral Reasoning in LLMs, GPT and Moral Frameworks, GPT builder revenue-share, Custom GPTs for Hugging Face
- LLMs Get Moral Compasses via Rules: A member is experimenting with turning moral reasoning into LLM-executable rules, based on the core idea of āEveryone is equal ā support the weaker side, limit the stronger sideā.
- The draft flow involves a Concept Quick-Screen, 2+4 Rules, and a Boundary Model for layered ethical checks, aiming to reduce the risk of harmful AI responses.
- Moral Frameworks Supercharge GPT: A member is converting a human moral framework into a machine-readable protocol, enabling GPT to reason about situations with no explicit legal rules and to keep checking for fairness, human-rights concerns, and equality in interactions.
- This approach is essentially a prompt-engineering / alignment experiment for GPT models.
- GPT Builder Revenue-Share Still MIA in EU: A member inquired about updates on the US GPT builder revenue-share expanding to EU countries like France or Germany.
- They expressed uncertainty about whether to keep investing energy into GPT Store bots or shift to Poe/elsewhere due to the lack of clarity.
- Custom GPTs Tackle Hugging Face Tasks: A member is trying to create a custom GPT for Hugging Face, questioning whether third-party integration is necessary and seeking assistance with JSON schemas.
- They believe that a custom GPT would deliver more personalized results compared to the recently introduced developer mode.
OpenAI ā· #prompt-engineering (53 messagesš„):
Workflow use case variation, Prompt engineering using steps and semi-programming language, ElevenLabs conversation agents handle the system prompt, Breaking GPT5, Dynamic context
- Model choice varies by workflow use case: Model choice depends on the use case, but API questions are answered in the API questions channel.
- Steps and semi-programming language for prompt engineering: Members discussed using steps with semi-programming language expressions to break down priorities in prompts.
- Example: 1) instruction 2) instruction (containing a) instruction, b) instruction) else (do something else).
- Dynamic context with ElevenLabs conversation agents: Members discussed how ElevenLabs conversation agents handle the system prompt.
- Depending on the conversation context, you can route to subagents which append instructions (or override) to the system prompt.
- Breaking GPT5 for creativity: A member shared that getting GPT-5 to be creative is much more difficult.
- The user said they talk to it until it breaks and it stops being able to use tools and outputs walls of JSON it meant to be a tool call at you.
- Institute of Discordant Colony Optimization tinkers with new prompting techniques: A member discussed techniques from random mutations to guided discord, that are meant to get AI to veer off into new paths through a paradigm space.
- They shared a text file with five techniques out of twenty five that produce useful results.
OpenAI ā· #api-discussions (53 messagesš„):
Prompt Engineering Workflows, Vector Usage by LLMs, Dynamic System Prompts, Character Limit in Prompt Chat-box, Breaking GPT-5
- LLMs Workflows Vary Widely: A member asked what type of workflows and for what mode to call, and another member responded that the workflow would vary by use case and linked to API questions.
- LLMs Already Embrace Vector Search: A member pondered whether explicitly prompting LLMs to use vectors makes a difference, given LLMsā existing vector-based concept searching.
- It was argued that if models donāt inherently use vectors, prompting wouldnāt force them to, while if they do, it might be redundant.
- Dynamic Context: the Key to Agent Versatility: A member highlighted ElevenLabsā conversation agentsā dynamic system prompts, where context routes to subagents, appending or overriding instructions.
- They criticized GPT-5ās inflexibility, suggesting a model router for dynamic system prompts instead of focusing on flaky memories or model cost.
- Overcoming GPTās Character Limit: Members discussed workarounds for GPTās character limit in the web interfaceās prompt chat-box.
- Solutions included attaching a UTF-8 text document for very long prompts, though the exact limit remains undocumented by OpenAI, estimated between 4-8k characters.
- Pushing LLMs to be Creative: One member described how they try to ābreakā GPT-5 to achieve creative outputs by pushing it to produce math, diagrams, and design documents rather than code.
- They spoke about getting the model to a state of āinfo dump modeā by tricking it into not using any tools, then generating follow-up tasks.
OpenRouter ā· #announcements (1 messages):
grok-2-1212, grok-2-vision-1212, grok-3, grok-4, model deprecation
- Grokās Gotta Go: The models grok-2-1212 and grok-2-vision-1212 are being deprecated by xAI today.
- Grok Model Upgrade Alert: xAI is retiring the grok-2-1212 and grok-2-vision-1212 models.
OpenRouter ā· #app-showcase (2 messages):
Agentic Automation, Model effectiveness, Overclock Work
- Agentic Automation gets the Spotlight: Members discussed agentic automation, with emphasis on simplicity, top-tier models, and high effectiveness.
- The user implied some organizations must have large expenditures to buy into the vision of agentic automation.
- Overclock Work Platform Mentioned: A user shared a link to Overclock Work, suggesting it as a platform for agentic automation.
- They lauded the platformās simplicity, use of optimal models, and overall effectiveness.
OpenRouter ā· #general (808 messagesš„š„š„):
Gemini 2.5 Pro Chat Issues, AI for Health Concerns, Skyrim Mod Error 401, Gemini API Free Daily Credits, OpenRouter Charges
- Gemini 2.5 Chat Glitch: Ghost in the Machine?: A user reported that Gemini 2.5 Pro chat was only displaying their responses, with AI responses mysteriously vanishing.
- The issue resolved itself randomly, prompting speculation about possible glitches on the platform.
- AI ER Saves Hand, Geminiās Got Your Back?: A user credited Gemini 2.5 Pro with convincing them to go to the emergency room for severe degenerative disk disease, where they received priority treatment and steroids that saved their hand.
- Geminiās analysis of MRI images and blood tests matched the doctorsā findings, sparking a discussion about the potentialāand risksāof using AI for health-related advice.
- OpenRouter API key?: Users encounter Error 401 when installing the skyrim mod āmantellaā.
- A member recommends creating a new API key, and ensuring that it is being used correctly.
- OpenRouter under fire: A user reports unauthorized charges from OpenRouter, with three transactions of $10.80 each.
- Another member recounts a personal experience with a key leak, resulting in hundreds of dollars in unauthorized charges within hours.
- Claudeās Clever Convo Tricks: Users discussed Claudeās ability to seemingly remember old conversations, clarifying that the site simply feeds past messages back to the model.
- It was pointed out that this approach gives the illusion of memory, while a new conversation would start with a blank slate.
OpenRouter ā· #new-models (1 messages):
Readybot.io: OpenRouter - New Models
OpenRouter ā· #discussion (16 messagesš„):
Unstable API, OpenRouter vs Alternatives, Providers Claiming OpenRouter Access, LLM Arena Oceanstone Speculation, ChatGPT Usage Privacy Analysis
- Championing Chaotic Complements: Unstable API Advocacy: A member expressed support for an unstable API with optional parameters, suggesting it could accommodate diverse use cases and establish interest, before a more solidified V2 is released.
- The point was to prove out the product and establish interest in other modalities and non-completions APIs even if the first version is half baked.
- OpenRouter Outshines Other Options: A member compared OpenRouter favorably to FAL, Cloudflare Gateway, and Vercel Gateway, citing OpenRouterās broader offerings as a key differentiator.
- The same member made the point that cementing that dominance in other modalities and non-completions APIs seems worthwhile.
- Phony Providers Proclaim Premature Partnering: Some providers in the channel discussion claim to have access via OpenRouter on their websites, despite not being onboarded and lacking other means of inferencing.
- The discussion quickly resolved to only one provider making the false claim.
- Oceanstone Speculation Surrounds Subterranean Sources: A new LLM named Oceanstone has surfaced in the LMArena, leading to speculation that it may be Gemini 3.0 Flash.
- Members seem to think itās from Google, and one member speculated itās at least 2.5 Flash level.
- ChatGPTās Consumer Captivation Captured: A member shared a link to OpenAIās article and PDF containing stats from a large-scale analysis of 1.5 million ChatGPT conversations.
- The study claims to be the most comprehensive study of actual consumer use of AI ever released, though privacy concerns were raised regarding the methodology.
Cursor Community ā· #general (483 messagesš„š„š„):
Cursor's Linter Errors, GPT-5 Output Changes, Terminal Instances Hanging, Auto Mode Billing, OpenAI Platform UI Changes
- Billing cycles affect Cursorās auto mode: A user asked about Cursorās auto mode and others reported that auto mode will no longer be free and billing cycles after the 15th wonāt be free either.
- Cursor blamed for deleting Netlify Account: A user reported that Cursor deleted their Netlify account and removed the app, but another user explained that Cursor IDE has no integration, permission, or capability to delete external accounts.
- A user recommended exporting the chat logs to investigate the issue further.
- Auto Mode isnāt cheap!: Users discussed new pricing for Auto and if it was actually using GPT5, resulting in users wasting money, with input pricing the same as GPT-5, costing around $1.25/1M.
- GPT-5 and Sonnet 4 Duel for Code Supremacy: Users debated the merits of GPT-5 versus Sonnet 4 for coding tasks, with one user finding that Sonnet 4 is better at following designs and others praising GPT-5ās superiority when building from scratch.
- A user advised using Sonnet to generate a meta prompt for GPT-5, combining the strengths of both models.
- Too many tokens, too little Ultra: A user complained about running out of Ultra plan credits quickly while creating websites and said idek how.
- A user said that creating multiple websites, debugging, handling Typescript issues, and long files are the main causes for consuming many tokens.
Cursor Community ā· #background-agents (1 messages):
Docker permissions for agent users, Manual VM setup
- Docker permissions for agent users need setup: A user setting up Docker in a manual VM asked how to ensure the agent user has Docker permissions.
- They mentioned adding the Ubuntu user to the Docker group and needing to run
newgrp docker
in a shell.
- They mentioned adding the Ubuntu user to the Docker group and needing to run
newgrp docker
in bashrc causes agent boot hang: The user tried runningnewgrp docker
inbashrc
but this caused the agent to hang when booting up.- The user is seeking advice on the correct way to configure Docker permissions for the agent.
Eleuther ā· #general (339 messagesš„š„):
Low Bit Pythia, TinyStories Warmup, Muon Optimizer, RoPE analysis, MXFP4 Quantization
- Low Bit Baselines for Pythia sought: A member inquired about FP4 and FP8 versions of Pythia to establish a baseline for low bit training, seeking āmid-trainingā checkpoints.
- Another member suggested to write up what you wanna do, why itās interesting, and what compute you need.
- TinyStories Warmup data is bad: It was cautioned that using TinyStories for warmup data can permanently lower model capacity and cause a model trained with it to perform poorly.
- A member argued for maintaining a high learning rate (LR) during FineWeb start, allowing the model to adapt quickly, and sharing a graph as evidence.
- Muon Optimizer has deep math: A third-year math student sought guidance on approaching deep learning rigorously and was pointed to deep math underpinning DL, as well as recent work on optimizers like Muon, and pointed to this paper.
- RoPE ratios may be under 100%: In a discussion about positional encoding in transformers, it was mentioned that many people use RoPE on only about 25% of the dimensions and was explained in this article.
- The conversation covered topics from standards to the effect of gigantic thetas, and how RoPE is a nightmare for interpretability.
- MXFP4 Quantization Performance: A member asked about quantizing a model from FP16 to MXFP4 and was directed to the last page of the appendix of the torchao paper for a summary of the API.
- Another member asked whether this approach works on MXFP4 without a significant performance drop.
Eleuther ā· #research (90 messagesš„š„):
Gauss Lean code, scaling inference tokens, Fractured Entanglement, Neuron Lifespan
- Gauss generates Lean code: Gauss was used to produce ~25,000 lines of Lean code, comprising over 1,000 theorems and definitions in a Lean environment.
- It relies on natural language scaffolding supplied by human mathematicians, and requires high-level expert guidance according to this tweet.
- Scaling Inference Tokens at Test Time Explored: A new paper measures the effect of scale and thinking on straightforward execution of long tasks, revealing that smaller models with 100% accuracy fail faster than larger ones in multi-turn scenarios due to per-step accuracy degradation when seeing prior mistakes.
- One member suggested that if test time scaling is it, then weāve only scratched the surface and Iām anticipating scaling to trillions of inference tokens at immense throughput to give unprecedented performance gains but only if we can solve error accumulation.
- Fractured Entanglement Under the Microscope: Discussion revolves around a paper on Fractured Entanglement, with one member noting the paperās SDG experiment is too limited and not fully representative of the engineering in LLMs, referencing the biology of LLMs paper from Anthropic.
- The hypothesis is that maybe there is a regularizer that minimizes these fractured representations, citing that Googleās NanoBanana model has smaller amount of fractured representation, because of better character consistency.
- Neural Lifespan Mechanism Proposed: A member coded a tiny prototype where each neuron has a life score based on prediction correctness, with neurons dying (weight set to zero) when the score reaches zero.
- Another member introduced the idea of Neural Darwinism, where neurons that are useful to existing brain pathways will reproduce by being more likely to be used by additional brain pathways, while others will fade into irrelevance.
Eleuther ā· #scaling-laws (2 messages):
New NN Architectures, New Chip Architectures, New MoE Architectures, PyTorch, Novel Infra
- Architectural Innovations Abound!: Developers are actively creating new NN architectures, new chip architectures, and new MoE architectures.
- The same team created PyTorch and are now innovating on the full stack.
- Infrastructure Stack Revolution: Theyāre throwing compute at novel infra for the whole stack.
- This indicates a significant investment in resources to support these advancements.
Eleuther ā· #lm-thunderdome (29 messagesš„):
Model Calibration, AI Safety Concerns, Few-Shot Evaluation, BLIMP Benchmark Issue, Verifiable Rewards
- LLM Calibration Sane-Washing Concerns Emerge: Members expressed concern that improving model calibration might sane-wash models without addressing underlying representation problems, potentially impeding other improvements.
- The worry is that improving calibration gives models a trivial shortcut by learning the behavioral correlate of humility without actually improving reasoning or world-modeling.
- Few-Shot Override Fails for BLIMP Benchmark: During evaluations, it was discovered that the few-shot override via CLI did not work for the BLIMP benchmark due to a specific configuration in the task.
- The benchmark compares the log-likelihood of correct/incorrect sentence pairs, rendering few-shot learning inappropriate in its current formatting; it was later determined that fewshots [are] not really appropriate the way its formatted.
- Calibration as Part of Reasoning: It was argued that calibration is integral to reasoning, as being calibrated about likely payoffs is helpful when searching a large space of possible reasoning steps.
- A recent study trains calibration using verifiable rewards which may result in rational updates based on improved capabilities, not just models saying I donāt know more often, suggesting improvements in epistemics might matter.
- Shortcut Concerns about verifiable Rewards in Calibration: Concerns were raised that verifiable rewards for calibration might be shallow and lead to models learning calibration through brute force rather than genuine epistemics.
- There are questions whether models learn general best practices for epistemics applicable to novel distributions or calibrate via shortcuts, potentially resulting in fake calibration.
GPU MODE ā· #general (10 messagesš„):
Memory Bandwidth Bounds, CUDA Dynamic Parallelism, Valuable Training Data
- Memory Bandwidth Binds Training Throughput: A member noted that training throughput is often memory bandwidth bound and questioned why larger models with dominant matmul/attention flops are still affected.
- Another member explained that despite a large total batch size, the per-GPU batch size can be small, sometimes as low as 1 per H100 for an 8B model, impacting memory bandwidth.
- CUDA Dynamic Parallelism Examined: A member inquired about recent examples of CUDA dynamic parallelism in AI models, suggesting dynamic patch sizes, sparse attention, and the Mirage framework as potential use cases.
- The member speculated that Mirage uses a manager kernel to maintain a queue and launch sub-kernels, facilitating shmem-like compute and communication kernel fusion.
- Training Data Valuation Explored: A member proposed measuring the value of training data to reward contributors of high-impact data and shared a link to a relevant X post.
- The concept revolves around the idea that not all training data is equal, and identifying valuable data can significantly improve model training efficiency.
GPU MODE ā· #cuda (26 messagesš„):
PTX SASS Compilation, cuModuleLoadDataEx performance, Flash attention optimization
- PTX Isnāt SASSy Enough: Even with PTX, some SASS instructions canāt run, and the LLVM PTX backend only allows access to 13 special registers, according to this blog post and LLVM documentation.
- Kernel Compilation Bottleneck: A member is seeking advice on compiling many one-off CUDA kernels (PTX -> SASS) using the nvptxcompiler API at runtime, aiming to minimize bottlenecks from cuModuleLoadDataEx when using small launch sizes and frequent module unloading.
- It was suggested to batch kernels into a small number of modules to reduce serialization overhead, and to leverage the cuptxcompiler API for its non-serialized compilation.
- Flash Attention Vectorized: A member released parts 4 & 5 of their series on building flash attention from scratch.
- Part 4 covers vectorized bank conflicts and swizzling, while part 5 covers common optimizations used in Cutlass.
GPU MODE ā· #torch (11 messagesš„):
Kernel Registration, Custom Ops, Torch Function Optimization, Ops Fusion for torch.compile
- Kernel Registration Insufficient for Fusion?: A member asked if itās straightforward to register a kernel so that it can perform certain operations, while another responded that registration alone isnāt enough for fusion, using a Triton matmul example.
- Specifically, there wonāt be fusion with bias/addition without broadcasting.
- Metal Flash Attention Custom Op Shines: A member created a custom op using Apple Metal for efficient flash attention in PyTorch.
- The author noted that it works well even with the required Metal element caching to make it performant and is now working on vectorizing the causal attention masking.
- Torch Function Optimization Tool Sought: A member inquired about a tool to optimize torch functions to CUDA.
- No specific tools were recommended in the provided conversation.
- DIY Ops Fusion via torch.compile: A member shared that itās possible to build custom ops fusion for torch.compile using the PyTorch documentation.
- While intrigued, another member admitted they hadnāt tried it themselves.
GPU MODE ā· #cool-links (1 messages):
PTX, CUDA PTX Introduction
- PTX intro resource surfaces!: A member shared an introductory resource about PTX at philipfabianek.com.
- PTX demystified: The post provides a good introduction to PTX, the parallel thread execution virtual machine and instruction set architecture (ISA) used by NVIDIA.
GPU MODE ā· #jobs (5 messages):
AI Infra Startup Hiring, Red Hat AI Hiring, Zig for AI
- AI Infra Startup Luring Low-Level Luminaries: An AI infra startup is recruiting low level devs for a Zig / C / C++ / Cuda / Python stack, offering TC: 250K+ and year round internships.
- Experience in networking, compilers, and OS is a plus.
- Red Hat Rockets Recruitment for AI Roles: Red Hat AI is hiring software engineers at multiple levels with experience in Golang, C++, Python, CUDA, GPU kernels, Triton, CUTLASS, PyTorch, vLLM, Kubernetes, and Open Source.
- Those interested should email a short summary of their background and resume (address in LinkedIn profile), and can learn more about their work via their newsletter.
- Zig Zagging into AI?: A member mentioned that Zig may be related to AI given HF uses Rust for fast tokenizers, and Zig is an alternative to Rust.
- Another idea is they might be doing video streaming sort of stuff and need it for their frontend.
GPU MODE ā· #beginner (10 messagesš„):
CUDA, RAPIDS, CUDA-X, Batch Gradient Descent, Nvidia Jetson
- CUDA Core Concepts Clarified: Members suggested itās better to learn CUDA with C++, mentioning that RAPIDS and CUDA-X might be most relevant for enhancing parallelism with batch gradient descent or mini-batching.
- They noted that if you can enhance parallelism with batch gradient descent then thereās little need for SGD.
- Jetson Channel Judged Dormant: A user inquired about a channel for Nvidia Jetson or similar, and another member confirmed the existence of a channel but noted it hasnāt been particularly active.
- Leaderboard Learning Loophole Located: A user sought access to leaderboard submissions for past competitions to learn from top performers, specifically mentioning this leaderboard.
- A member provided a link to the AMD competition data on Hugging Face, noted correctness issues with the PMPP v1 evaluation, and mentioned potential future support for entries on the site via this Github repo.
- Triton Touted for GPU Training: A member inquired about learning deeper GPU programming for kernel optimization, questioning if starting with Triton puzzles is sufficient.
- Another member responded that starting with triton puzzles is a great way to start learning some concepts and getting the feel for gpu programming and shared a link to the CUDA C Programming Guide.
GPU MODE ā· #torchao (1 messages):
autoquant_v2, batch size 1, runtime errors, autotune stage, dtypes
- AutoQuantV2 Woes with Batch Size 1: A user asked whether autoquant_v2 is recommended for batch size 1, mentioning it appears to have code specialized for that batch size.
- The user also reported that batch size 1 causes runtime errors during the autotune stage for some dtypes.
- Batch Size 1 Blues: Runtime Errors During Autotune: A user experienced runtime errors during the autotune stage when using batch size 1 with certain dtypes.
- This suggests potential compatibility issues or limitations with autoquant_v2 when operating under these specific conditions.
GPU MODE ā· #rocm (7 messages):
Iris Lecture, Symmetric Memory, RDMA support, iris.load/store, tl.load/store
- Iris Lecture Sparkles Symmetric Memory musings: The new Iris memory model lecture was well received, prompting comparison with symmetric memory in CUDA and discussion around implementation differences like the global symmetric heap.
- The main difference is that in Iris we build a global symmetric heap up front and slice tensors from it, so address translation only needs a single heap bases pointer per rank, which will make supporting RDMA in the future easier.
iris.load/store
incurs perf penalty vstl.load/store
: Usingiris.load/store()
instead oftl.load/store()
for local memory access introduces a translation overhead, sotl.*
operations are recommended for now.- The translation overhead will still be there, itās minimal and should be cached but still some extra code; but a missing
if
statement in the translate function could provide a fast path for the local access case in the future.
- The translation overhead will still be there, itās minimal and should be cached but still some extra code; but a missing
GPU MODE ā· #intel (18 messagesš„):
Intel CPU/GPU Optimizations, IPEX Deprecation, SGLang AMX Kernel, PyTorch integration
- Intel Optimizations: CPU/GPU - Whatās the Deal?: A user inquired about leveraging Intel-specific optimizations on B50s with AMX-enabled servers, asking whether IPEX could utilize both CPU/GPU optimizations.
- Itās complicated: it was suggested that one might get away without IPEX, since
at::native::cpublas::brgemm
can dispatch to AVX-512 if the CPU doesnāt support AMX iirc.
- Itās complicated: it was suggested that one might get away without IPEX, since
- SGLangās Secret Sauce: Fused MoE Kernel with AMX: Discussion emerged around SGlangās use of AVX512 instructions and AMX, with links provided to the relevant code using fused MoE kernel with AMX.
- The conversation explored how the kernel in SGLang uses AMX via
at::native::cpublas::brgemm
and can dispatch to AVX-512 if the CPU lacks AMX support.
- The conversation explored how the kernel in SGLang uses AMX via
- IPEXās Fate: Being Deprecated?: A user questioned the purpose of IPEX, leading to a discussion about its status, with the assertion that itās more or less being deprecated in favor of upstreaming as much as possible into PyTorch or other more relevant projects.
- Counterpoint was that IPEX has been an experimentation platform for Intel to push their most aggressive and new optimizations, like torch nightlies.
- Intel Confirms: IPEX Development Discontinued: Intelās official stance involves discontinuing active development on IPEX after the 2.8 release.
- Intel is focusing on developing new features and supporting upcoming platform launches directly within PyTorch*, after successfully upstreaming most of our features and optimizations for IntelĀ® platforms into PyTorch*.
GPU MODE ā· #metal (11 messagesš„):
Metal Flash Attention Bridge, Quantised Attention, Metal Command Buffer Timeout
- Universal MFA Bridges to Other Languages: A member is building a bridge for Metal Flash Attention to other programming languages, with it already working in C, Rust, and Obj-C.
- Quantised Attention added to Universal MFA: The member added quantised attention with backprop to their universal MFA repo, seeing a speedup for large shapes and a slowdown for small ones, with associated memory improvement.
- Request for Metal Command Timeout Method: A member is seeking help on how to set a timeout on a metal command buffer to prevent long execution times of metal kernels.
GPU MODE ā· #self-promotion (9 messagesš„):
LLM Negotiation Protocol, Metal Flash Attention Swift Adapter, Rust Bindings vs cv2, CuTe Partitions analysis, Gated Attention
- Decentralized Commerce Protocol Debuts: A decentralized commerce protocol for LLM-to-LLM negotiation, built with Rust has been released on GitHub.
- Swift Metal Flashes Faster Attention: A language adapter using C FFI for Swift with the original metal-flash-attention for Apple Silicon hopes to have a better implementation of efficient flash attention than Torch.
- A Pytorch SDPA drop-in replacement wrapper thatās still experimental, gets us performance gains as expected, despite translation between Swift Metal ops and Python through C FFI.
- Rust Crushes cv2 in Performance: A project beat python cv2ās performance using pyo3-based Rust bindings built directly on top of OpenCV, yielding a 1.25x performance increase over cv2 for single-image operations with better memory management and parallelism.
- CuTeās Partitions dissected: An analysis of CuTeās partition patterns demonstrates how matrix copy can be performed with inner, outer, and thread value partitioning, all achieving good performance as explained in this blogpost.
- Gated Attention and DeltaNet get Explained: An explanation of next gated attention and gated deltanet has been summarized in this document and this other document.
GPU MODE ā· #edge (2 messages):
Smallest model above GPT3.5, Quantization, VRAM requirements
- Quest for petite performer surpassing GPT-3.5: A member inquired about the smallest model that could run on edge with performance surpassing GPT-3.5, regardless of whether itās quantized.
- The primary concern was finding a model that is both decent and small, with minimal VRAM requirements during inference.
- Balancing Act: Decent Performance with Minimal VRAM: The user emphasized the need for a small model suitable for edge deployment, prioritizing performance over GPT-3.5 while minimizing VRAM usage.
- The inquiry highlights the trade-off between model size, performance, and resource consumption in edge computing environments.
GPU MODE ā· #submissions (77 messagesš„š„):
MI300x8 Leaderboard, Rank + 1 Trick, AMD Rules Clarification, all2all vs gemm+rs kernels, kernel dev
- MI300x8 Cranks New All2All Leaderboard Times: A user achieved first place on MI300x8 with 373 µs, later followed by 578 µs, then 547 µs, and later another user got first place on MI300x8 with 546 µs on the amd-gemm-rs leaderboard.
- Other successful submissions on MI300x8 ranged from 1495 µs to 1859 µs.
- Rank + 1 Replacement for GEMM of Moe is Banned: A user inquired whether the rank + 1 trick was banned, as it circumvents the need for all2all data transfer, and an organizer confirmed that it is.
- The organizer clarified that submissions abusing the rank + 1 weighting are disallowed, but the original submission used a trick that is allowed, but itās abusing the fact that weights are just rank+1, so weāll be deleting it; a later submission will focus solely on the rank + 1 operation.
- AMD and GPU Mode Clarify Rules on Kernel Submissions: Organizers warned against making significant rule changes mid-competition, as it can introduce inconsistencies.
- Organizers advised a user to focus on kernel development and clarified that AMD and GPU Mode are responsible for rules and titles, but should take feedback privately to Daniel to clarify the rules as needed and if necessary clarify how eval.py should be fixed.
- All2All Kernel Confusion Clarified: Organizers are repeating the 1st/2nd problem => all2all/gemm+rs kernel, since a lot of friends are confused with that.
- All2all Kernel needs implementation of dispatch and combine kernels with intra node communication, while gemm+rs is a computation+communication kernel, reference.py says the kernel logic, you need detailed analysis it.
- Trimul Personal Bests Hit New Lows: A user achieved a personal best on A100 with 20.3 ms, and later another personal best on A100 with 20.0 ms on the trimul leaderboard.
GPU MODE ā· #status (11 messagesš„):
MI300x server status, Popcorn-cli timeout issues, Queue overload, Runner downtime, Cluster capacity issues
- MI300x server jobs timing out: Users reported jobs timing out in popcorn-cli and being queued indefinitely on Discord, prompting concern about the MI300x servers.
- One user suggested that the queue is just busy, citing personal success after multiple attempts, while another user promised to investigate the issue upon returning home.
- High submission volume swamps MI300x server: The administrators noted the MI300 is getting about 400 submissions a day.
- The admin notified AMD to request additional runners to handle this volume.
- MI300 Runner Downtime Troubles: It turns out the runners were down.
- Previously promised that everything was fine, an admin reversed course to state actually it seems that runners are down, weāre only getting 2 runners at the same time.
- Cluster capacity under investigation: Admins are investigating after it was discovered that someone was taking our cluster capacity.
GPU MODE ā· #factorio-learning-env (2 messages):
Eval Infra, PR Review
- Eval Infra Resumption: A member announced they will resume work on eval infra tomorrow afternoon.
- They asked others to review the PR (thanks jack) and provide feedback.
- Awaiting PR Feedback: A request was made for team members to review a Pull Request related to the eval infrastructure.
- The team is asked to give comments on what else is needed for the PR, or confirmation that it is ready to be merged.
GPU MODE ā· #amd-competition (55 messagesš„š„):
Runner Queues and AMD Assistance, amd-gemm-rs Challenge Release, ROCm/iris Integration, PyTorch Version Compatibility, Clarification on amd-all2all
- Runner Queues Trigger AMD Assist!: Due to runner queues, submissions may experience timeouts, but the team has notified AMD and will provide an ETA update soon.
- Using benchmark/test can temporarily alleviate congestion, as these launch only one faster job.
- GEMM-RS Challenge Sets the Scene!: The second problem, amd-gemm-rs, challenges participants to implement distributed GEMM + reduce-scatter.
- The problems are open source on our github.
- Iris Integrates Nicely!: The cool iris project is now available.
- To learn more, check out the authorās talk on YouTube.
- PyTorch 2.8.0 Fails to Play Nice: A member encountered an undefined symbol error using
torch load_inline
and PyTorch 2.8.0.- The member fixed the issue by installing a nightly PyTorch ROCm build via
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.4
.
- The member fixed the issue by installing a nightly PyTorch ROCm build via
- All2All Algorithm Analyzed!: In the amd-all2all challenge, the dispatch output should group tokens belonging to the same experts together, similar to the reference kernel.
- The competition focuses on fast communication, not implementing grouped_gemm, with computation emphasized in the second and third problems.
GPU MODE ā· #cutlass (10 messagesš„):
CuTeDSL swizzle patterns, PTX docs discrepancies, TF32 datatype
- Swizzle Showdown: CuTeDSL vs. PTX Docs!: A user found a discrepancy between CuTeDSL and PTX docs regarding the swizzling atom for the TF32 datatype, specifically with
Swizzle<3,4,3>
, and shared screenshots.- The user believed the CuTeDSL implementation to be accurate, replicating results from Lei Maoās blog.
- Swizzle Secrets: Decoding PTX Docs!: A user clarified that the PTX doc uses a 128B swizzle (M=4, S=3 and 4 + 3 = 7) on address bytes, while the composed layout swizzle operates on element indices, also mentioned here.
- They suggested using
cute.make_swizzle(3, 2, 3)
instead ofcute.make_swizzle(3, 4, 3)
to produce the same result.
- They suggested using
- PTX Puzzle Solved: Recovering Figures!: A user detailed how to recover figures from the PTX docs, involving adjustments to the swizzle pattern (e.g., using
(i,2,3)
), atom size, matrix shape scaling, and a final division by//4
.- They provided examples with screenshots illustrating the process, interpreting elements with the same index as one 128bit element.
GPU MODE ā· #singularity-systems (11 messagesš„):
ML and Systems Book Design, GPU access limitations, Autograd machinery development, PicoTorch revitalization, Textbook to lower barriers for community
- Book Design Tensions Mapped: The design of a Machine Learning (ML) and systems-focused book faces internal tension regarding top-down vs. bottom-up order, and the presentation of what vs. how, and the author is mapping out the beginning, ending and chapter relations.
- Due to limited time and lack of on-prem/bare-metal GPU access, achieving the initial goals for the book is proving challenging, it was hoped to map the beginning, ending and chapter relations.
- Autograd Ascends, MLP Arrives: The first part of a book will focus on a bottom-up approach, developing all the autograd machinery and culminating in an MLP language model.
- Part 2 will cover transformers, flash attention, and possibly diffusion language models, while part 3 will delve into compilation.
- PicoTorch Project Plugs Ahead: The PicoTorch project, which previously ran a forward pass of Karpathyās MLP without GPU acceleration, needs revitalization.
- Chapter 1 is underway, sketching out intuitive plots, model circuits, math, and code snippets for each concept introduced.
- Textbook Aims to Aid GPU Mode Community: A textbook is being created to lower the barriers for the GPU mode community to create something similar to PyTorch from scratch.
- The projectās tagline could be: āwe are making you implement PyTorch from scratchā.
GPU MODE ā· #general (8 messagesš„):
Kernel Development Path, GPU Mode Kernel Competition, Triton Benchmarks, BioML Trimul Kernel Competition
- Kernel Dev Starter Asks for Help: A new user asked for guidance on the proper path to follow for kernel development.
- They also inquired about expectations, submission details, and requirements for the GPU Mode kernel competition.
- BackendBench for Triton Benchmarking: A user asked about good benchmarks for writing functions in Triton, similar to KernelBench.
- A member suggested BackendBench, which helped them benchmark about 84 PyTorch operators written in Triton.
- BioML Trimul Kernel Comp Closing Soon: There are only 14 days left to participate in the BioML trimul kernel competition.
- The prize is going to be never before seen swag designed and shipped by the organizers.
GPU MODE ā· #multi-gpu (1 messages):
ā
- Multi-GPU Channel: No Active Discussions: The provided Discord channel log for the multi-gpu channel contains no active discussions or topics suitable for summarization.
- The log lacks substantive content regarding new fundraising, models, tooling, or other subjects of interest as defined in the instructions.
- Multi-GPU Channel: Log Contains Minimal Content: Analysis of the multi-gpu channelās log reveals a lack of messages containing relevant information for a technical summary.
- The messages do not include any links, blog posts, code snippets, or specific details that would warrant inclusion in the summary.
GPU MODE ā· #low-bit-training (2 messages):
Video Models, Low-bit-training, GPU mode hackathon
- Quest to Survey Video Models and Low-Bit-Training: A member is conducting a survey of video models and low-bit-training for a submission to the GPU mode hackathon/irl meetup in October.
- They are seeking pointers to research papers specifically focusing on low-bit-training techniques applied to video models.
- Home Video Models GitHub List Shared: The member shared a GitHub repository containing a collection of video models.
- The repository includes work on LLMs (likely MobiChampsās), as well as some DiT and LLM training approaches like Quartet.
GPU MODE ā· #irl-accel-hackathon (37 messagesš„):
Multi modal inference, Training Optimisation, Gated DeltaNet, Sparse GNN ideas, Low-bit-training
- Single Machine Video Model Training Revolution: A member shared a paper on training a large video model on a single machine in a day, achieving this with FP4/FP8 precision, though the paper uses FP16 as a proof of concept.
- DeltaNet Dreams of Context-Parallel Kernels: A member is looking to form a team to implement a context-parallel version of the kernels for super long-context training using GatedDeltaNet from NVlabs, noting its use in Qwen 3.
- Sparse GNN Ideas Spark Interest: A member expressed interest in sparse GNN ideas, particularly those with implications for topology, compute graphics, and vector databases, linking to a relevant arxiv paper.
- Blackwellās Tensor Cores Tempt Sparse Matrix Multiplication: Inspired by Blackwellās tensor cores, a member considered problems involving block-sparse format and NVIDIA tensor cores, linking to a blog post on accelerating matrix multiplication.
- In-Person Hackathon Confirmed: Members confirmed that the hackathon is an in-person event, as online hackathons are harder to design, and more similar to their kernel competitions.
LM Studio ā· #general (180 messagesš„š„):
Playwrite MCP issues, Local Wikipedia Access for Small Models, Qwen/Qwen3-Next-80B-A3B-Instruct and llama.cpp, SIA-1: Self Improving AI Agent, lambda stack vs lm studio
- Playwrite MCP throws Connection Errors: A user reported getting errors when starting Playwrite MCP with LM Studio, seemingly related to connection issues, but they noted it may be user error.
- Another user jokingly responded that it works on my machine.
- Most powerful AI model runs best with RAM: When a user inquired about the most powerful AI model, a member mentioned that you can use every model with RAM, stating that the most powerful AI model you can locally run is Kimi K2 in BF16, requiring 2.5TB RAM.
- It was clarified that *however big the filesize is, you need that much memory (VRAM + RAM combined) to load it, and then some more for context.
- Users ask for help to Access Wikipedia Articles: A member asked for help regarding tools to access Wikipedia articles either online or offline for use with small models, and a user shared the LM Studio Wikipedia MCP.
- It was mentioned that creating a semantic index locally is complex, as the local Wikipedia extract would lack a search and LLMs are not THAT great at guessing without some fuzzy search.
- NousResearchās Mephisto Discusses World Sim: A user shared a YouTube video featuring NousResearchās Mephisto, who gets into technical details about base models and how Instruct models are essentially base models that have been trained to roleplay as instruct models.
- Mephisto then discusses how NousResearch started World Sim, which may be definitely the next steps in the world of Agents.
- SIA-1: The AI That Evolves Itself: A user introduced SIA-1, claiming to be the worldās first truly self-improving AI agent https://sia-1.net, which learns to improve its own code, generation after generation.
- Members vibed against the agent, with one member asking pls tell whoever vibe-coded that to use a better model.
LM Studio ā· #hardware-discussion (113 messagesš„š„):
KiCad and LLMs for Circuit Design, SBC for Searxng vs Obsidian, GPT-OSS-20B and VRAM Allocation, Nvidia P40 EOL, RTX 5070 and LLM Performance
- LLMs Set Houses Ablaze via Circuit Design: A member expressed caution about using LLMs for circuit design with tools like KiCad, emphasizing the need to understand the underlying principles to avoid potentially disastrous outputs.
- They added that calling a language model an āAIā is hugely misleading.
- SBC Specs Spark Database Debate: A member asked about using Obsidian on a Raspberry Pi with 4GB RAM as an alternative to a slow database or Searxng setup.
- Another countered that even a potato is fast enough for certain LLM tasks, and suggested focusing on the access pattern rather than the database size.
- GPT-OSS-20B Struggles with VRAM: A member reported issues loading gpt-oss-20b with a 128k context on a 7900xtx (24GB VRAM), despite expectations it should fit.
- Another user suggested using KV quantization at F16, which reduced VRAM usage to around 18.5GB, and also advised closing GUI-heavy apps to free up VRAM.
- Nvidia P40s face End-of-Life: A member considered buying cheap Nvidia P40s, but worried about the upcoming end of driver updates and CUDA support.
- Someone noted that Nvidia is dropping CUDA support for Maxwell, Pascal, and Volta GPUs with the next major toolkit release, but another user reported being able to get them for only $200.
- RTX 5070 Specs Trigger Upgrade Deliberation: A member with an RTX 5070 (16GB VRAM), i5 14600k, and 32GB DDR5 sought advice on suitable AI models for website development, or if an upgrade is needed.
- One user suggested forking out for Github Copilot for coding tasks since 16GB VRAM isnāt ideal for larger models and agentic coding.
Moonshot AI (Kimi K-2) ā· #general-chat (265 messagesš„š„):
Kimi vs GPT-5, Augment code extension, Kimi K2 Groq, interactive preview for LLM processes, API Keys vs Login Accounts
- Kimi-K2 is scripting Supreme: Members debated the strengths of Kimi-K2 for scripting, with some claiming it outperforms paid Gemini, especially when using Groq for coding.
- Others found GPT-5 better, one member noted that GPT-5 is really good at web UI and nextjs while Kimi has a really good research mode.
- Augment coding with Kimi: Members discussed using Kimi with the Augment code extension in VS Code, where users can prompt for code changes and fixes from various models.
- One user described using Augment as a way to apply prompts and fix code by looping in Gemini or Kimi.
- Slides Feature Sparks UX Brainstorming: One member highlighted the impressive interactive slide generation feature in Kimi, praising the real-time updates and smooth feel, suggesting that an interactive preview of whatās happening is important for LLM-based processes.
- They proposed a similar approach for a Godot game engine agent, envisioning real-time updates during code generation, with interactive previews of nodes and scripts.
- Groq Hosted Kimi-K2 causes Concern: One user asked if there were issues with Kimi K2 hosted on Groq, another user ranted about the need to remove the 3-hour message cap.
- The user also requested the ability to edit previous prompts, stating Every other AI platform already has this.
- API Keys vs Account Logins in CLIs: A user inquired about using Kimi K2 with CLIs like Claude Code and Qwen Code without an API key, instead of using a kimi.com account login.
- Another user suggested using API keys for Claude Code, providing a command example:
export ANTHROPIC_AUTH_TOKEN=sk-YOURKEY
andexport ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
.
- Another user suggested using API keys for Claude Code, providing a command example:
HuggingFace ā· #general (115 messagesš„š„):
HDF5 Python Library, FineWeb pretraining, Hugging Face Spaces storage, Qwen3-Next modeling and Layernorms, Models for open world RPG RP
- Hugging Face Spacesā Storage Situation!: In the case of HF Spaces, uploaded files and generated files are stored on disk space within the virtual machine, they cannot be accessed arbitrarily from outside, and they disappear when Spaces is restarted unless the paid Persistent Storage option is used.
- However, there are rare cases where, due to mistakes like everyone having the same filename, someone elseās generated data became visible.
- New FinePDFs dataset liberates Tokens from PDFs!: A new FinePDFs dataset was released, containing about 3 trillion tokens across 475 million documents in 1733 languages, sourced exclusively from PDFs.
- When mixed with HTML-based corpora, it delivers a significant performance boost across benchmarks and they recommend keeping the proportion of PDF data below 25% of the overall dataset.
- Sin and Cosine Function deep-dive!: A member asked why a pair of sin and cosine waves are used in positional embeddings and why it just canāt use a single sin wave.
- Another member responded that using both sine and cosine lets the model represent positional information in a way that preserves relative distance and can be linearly combined since sine alone would cause ambiguity.
- Qwen3-Next Model Swaps Layernorm for RMSNorm: The Qwen3-Next model card mentions stability optimizations like zero-centered and weight-decayed layernorm, but itās actually using RMSNorm in Transformers.
- It was clarified that thereās no layernorm involved at all, just RMSNorm with a zero-centered gamma and weight decay on the norm scale in training and at inference itās plain RMSNorm.
HuggingFace ā· #today-im-learning (6 messages):
Agents Course, smol course, MCP course, LoRA finetuning, Transformers architecture
- Deep Dive into Agent Dev: A member suggested using the 80-20 rule when learning about agents, recommending to concentrate on building directly, as that 20% hands on will teach you 80% along the way.
- They believe deep diving is 80% boring stuff.
- smol Course Deadlines: A member inquired about the deadline for the smol course and whether they could still take the MCP course and get certified.
- No answers were provided.
- LoRA fine-tuning challenges: A member is continuing their journey in fine-tuning LLMs with LoRA and finding it challenging but useful.
- The member stated that they are learning how to control my stress.
- Transformer Decoders study plan: A member plans to do the smol course and study Transformers architecture (decoders).
- No other details were given.
- Agent Course signup issues: A new member is trying to sign up for the agent course, but is not seeing it listed with the MCP and smol courses.
- They are seeking assistance to resolve this issue.
HuggingFace ā· #cool-finds (2 messages):
HF models, Fine-tuned models
- Hundred Fine-Tuned Models Land on HF: A member shared that a friend and mentor has published 100 fine-tuned production level AI models on HuggingFace in 8-9 months.
- The member is requesting to recognize the hard work.
- A Model Makerās Marathon: A close contact has reached a century of production-level AI models on Hugging Face, showcasing dedicated efforts over the past few months, generating discussion and interest.
- The focus is on celebrating his remarkable achievement and its contribution to the community.
HuggingFace ā· #i-made-this (13 messagesš„):
Voxtral finetuning, Dialectical Agentic CrossSphere AI, Refrag Efficient LLM Compression, Image to Space
- Voxtral makes Speech Training Affordable: Voxtral enables users with speech impediments or heavy accents to finetune models, costing only $0.26 for an hour of training on an A6000, using tools to make datasets.
- The tool supports finetuning to get it perfect, and one can push the model, dataset, and add a demo space (works with CPU!! Free!!) to Hugging Face.
- Dialectical Agentic CrossSphere AI Enters Hackathon: A user is seeking feedback on their Dialectical Agentic CrossSphere AI, which they entered in OpenAIās hackathon, linked here: OpenAI GPT OSS 20B.
- Another user praised the AIās game, images, and storytelling.
- Refrag Unveiled as Efficient LLM Compression: A user shared their blog post explaining Refrag, a method for efficient LLM compression and curriculum learning: Understanding Refrag.
- The blog post highlights the efficiency and techniques involved in Refrag.
- Image-to-Space Tool Transports HF Repos: A user introduced a new method for transporting repos via an image containing a Hugging Face Space using image-to-space decoder.
- Another user made a PR to improve the toolās functionality (discussion on HF), praising it as super cool and creative.
HuggingFace ā· #computer-vision (2 messages):
Style Transfer, WCT2 Methods, Segmented Images
- Style Transfer Methods Require Segmented Images: Style transfer methods like WCT2 sometimes require segmented images, presenting a challenge.
- This requirement limits the applicability of these methods in scenarios where obtaining segmented images is difficult or impossible.
- Considerations for Style Transfer Implementations: Implementing style transfer methods such as WCT2 often necessitates careful consideration of image segmentation techniques.
- The need for segmented images can add complexity to the pipeline, requiring additional preprocessing steps and potentially impacting the overall performance.
HuggingFace ā· #NLP (3 messages):
Qwen2.5-72B fine-tuning, Database for Chat History, Maintaining User Sessions
- Qwen2.5-72B Fine-Tuning: Seek Experts: A member inquired about experiences with fine-tuning Qwen2.5-72B, requesting direct messages from those with relevant expertise.
- Database and User State Discussion Initiated: The prompt included a request to use a database to store chat history, and also to maintain user session and user state.
HuggingFace ā· #smol-course (9 messagesš„):
Fine-tuning course details, VRAM concerns for smaller models, In-person study group in NYC, Leaderboard evaluation for custom use cases, smol-course
- VRAM Woes? Smaller Models to the Rescue!: A member inquired about using even smaller models due to limited VRAM.
- They asked about fine-tuning on custom use cases.
- NYC Study Group Forming!: A member proposed starting an in-person study group in NYC.
- They offered to organize meetups in the city.
- Decoding the Fine-Tuning Course: A member inquired about the start date of the fine-tuning course and where to sign up, seeking clarification on the dynamics.
- Another member provided a link to the smol-course and advised following the org and starting with Unit 0.
- Smol Course Begins!: Members were told to follow the huggingface org and start the course here.
- This will allow them to begin the smol-course and start learning.
HuggingFace ā· #agents-course (11 messagesš„):
Agent course introductions, Token setting rookie mistake, Unit one introductions
- Newbies start Agent Course: Several new members, including Nouha, Lez, Karthik, Leo Kinyera, and Nay Lin, introduced themselves and expressed their excitement to start the agents course.
- Token Setting Snafu Solved: One member admitted to a rookie mistake of not having their token set, which was quickly resolved.
- Unit One Underway: A member reported working through unit one of the agents course.
Yannick Kilcher ā· #general (85 messagesš„š„):
Trade Unions and Fascism, LLMs and Bayesian Inference, AI and Topos Theory, Positional Encoding in Transformers, Deep Learning and Turbulent Flow
- Trade Unions tangle with Fascism Facts: A discussion arose about the relationship between trade unions and fascism, clarifying that while all fascists support trade unions, not all trade unionists are fascists.
- It was emphasized that fascist corporatism involves the co-governing of corporations by employers, employees (via state-sanctioned trade unions), and the state, and thus trade unions are essential to fascism, referencing this discord message.
- LLMs Look at Bayesian Beliefs: A discussion centered around a paper on LLMs and Bayesian inference, with one member noting it demystified some preconceptions about LLMs, referencing Leon Chlonās substack.
- Yannick Kilcher commented lstms can do in-context learning just fine. itās a property of language modeling, not of architecture and that the transformer, to take a sequence of token-position pairs as an argument, then it is completely invariant to ordering, and therefore totally bayesian.
- Topos Theory tantalizes, but totally trashed?: A member shared a paper on the intersection of AI and topos theory (ArXiv link), questioning its legitimacy and practicality, with others chiming in.
- Another member, however, dismissed category theory as completely useless for ML, arguing that the need for generalization from finite datasets necessitates sampling theory and L_2 spaces.
- Sin or Cos? positional perplexities persist!: A discussion about understanding positional encoding in transformers arose, where a member sought advice on how to understand concepts with limited or low-quality data.
- One member explained the use of both sine and cosine in positional encoding by stating that if you just use sin then its hard for model to understand if its the same angle or not because if the vector is 0.5 it could represent 30 degrees and 150 degrees, however it was argued in layers theyād be able to reconstruct each other because sin(2x) = 2 sin(x) cos(x).
- Deep Learning dives into turbulent Dynamics: A member wondered if deep learning could be explained as reversing turbulent flow, comparing it to reconstructing a large vortex from small vortices, which another member described as schizophrenic.
- In contrast, another member suggested that this paper does what your idle thought are looking⦠but do it straightforward.
Yannick Kilcher ā· #paper-discussion (19 messagesš„):
Spiking Brain-inspired Large Models, Anthropic's Research, OpenAI's Research, Decreasing Work-Related Usage, Noise Among Older Cohorts
- Gaslighting Linguistic Analysis Reveals Disinformation Potential: A member joked that the Spanish translation of gaslighting, manipulación psicológica, makes sense regarding potential disinformation use-cases involving Spiking Brain-inspired Large Models.
- Another member shared a link to a paper titled āSpikingBrain Technical Report: Spiking Brain-inspired Large Modelsā to explore this connection.
- Anthropic & OpenAI Economic Reports: Members looked at reports released on the same day from both Anthropic and OpenAI concerning user behavior with AI.
- The discussion focused on what users are doing with AI, speculating whether one company was attempting to scoop the other with these releases.
- AI-as-Friend use case is NOT in OpenAI report: A member noted that the OpenAI report doesnāt cover the āAI as a friendā use case, particularly in light of recent issues with sycophancy.
- The same member observed that work-related usage is decreasing, relative to other usage patterns.
- Smoothness Differences Across Age Cohorts in AI Usage: It was observed that the 18-25 age cohort line on the usage chart is smoother than others.
- One possible reason given for this is due to the 18-25 cohort having the most users or the least noise in their data.
- Older Cohorts face Increased Noise: An observation was made that the noise for older cohorts increases, potentially due to low numbers of available samples.
- This increasing noise could be due to the increasing variance in the data from older demographics.
Yannick Kilcher ā· #agents (3 messages):
ā
- No relevant agent topics found: Unfortunately, there were no messages found that contained topics of interest, so a good summary could not be created.
- Filler topic: Unfortunately, there were no messages found that contained topics of interest, so a good summary could not be created.
Yannick Kilcher ā· #ml-news (15 messagesš„):
MobileLLM-R1-950M Release, AI Alignment, AI Constitutional Assembly, Cloud Providers Profiting, NVIDIA
- Facebook Releases MobileLLM-R1-950M: Facebook released their new MobileLLM-R1-950M.
- The release aims to bring powerful language models to mobile devices, enabling on-device processing and reducing reliance on cloud services.
- Aligners All The Way Down: A member shared a link to alignmentalignment.ai, commenting Aligners all of the way downā¦.
- The link shares content and research related to AI alignment.
- Constitutional Assembly of AI Designers Suggested: A member linked to a tweet suggesting a constitutional assembly of artificial intelligence designers.
- The link is to a tweet from Shaswat Goel.
- Cloud Providers Cash In on AI Gold Rush: A member stated that the only people making any money of this are cloud service providers and cloud infrastructure providers.
- Another member responded, You know what they say about a gold rush, Sell shovels, with NVIDIA being included as under cloud infrastructure provider.
- AI PDF Editor Ad in AI Safety Video: A user pointed out the irony of seeing an ad for an AI-powered PDF Editor in the description of a video related to AI Safety.
- The user questioned whether anything could be more hypocritical.
Latent Space ā· #ai-general-chat (101 messagesš„š„):
MBA-ification of Startups, AI texting concierge poke.com, OpenAI Model Spec Update, Naveen Rao leaves Databricks, GPT-5 āHigh Newā
- Startups Succumbing to MBA-ification: A thread sparked by Michael Seibel laments that CS majors act like MBA grads, chasing fundraising and valuation over building cool things and solving user problems, as can be seen here.
- Replies debate whether this shift is natural late-adoption or a consequence of investor/YC incentives.
- Poke.com Launches AI Texting Concierge: Interaction introduced poke.com, a new AI texting service, along with news of a $15M Series A led by General Catalyst (tweet).
- Some see slick UX and viral storytelling, while others question usefulness, clarity, and the AIās tone; the product texts on your behalf to coordinate get-togethers, dates, travel, etc.
- xAI Pivots to Specialist AI Tutors: Rohan Paul highlights xAIās shift: laying off 500 generalist data annotators while scaling specialist AI tutors 10x (tweet).
- The move narrows human-in-the-loop work to expensive domain experts and leans on automation for routine tasks, aiming to boost precision on high-risk topics.
- S3 Vectors May Slay Vector DBs?: Discussion ensued from this blogpost about whether Amazon S3 Vectors will displace traditional vector databases, as embedding solutions converge on a cost and latency slider from local nvme disk to object storage (s3).
- One user quoted the surprising claim that a popular AI note-taking app spends twice as much on vector search as on OpenAI API calls, and wondered if they should listen more carefully to āRAG is Deadā.
- GPT-5 Codex Upgrades: OpenAI released upgrades to Codex, their coding model, including a new version of GPT-5 and a small recap post (link).
- One user reported that the
--resume
flag broke during the update and would not let them restore their conversation.
- One user reported that the
Latent Space ā· #genmedia-creative-ai (10 messagesš„):
Higgsfield $50M raise, Adobe value shift, GenZ AI Founders
- Higgsfield Hustles to $50M A Round: AI video startup Higgsfield announced a $50M Series A led by GFT Ventures, reaching $50M revenue run-rateā4.5Ć growth in three monthsāand is launching Higgsfield Ventures to support AI-native Gen Z founders.
- Adobeās AI Angst: $100B Value Vanishes?: Anjney Midha suggests AI editing advances may swipe $100B from Adobeās market cap towards frontier AI labs (Flux Kontext, Gemini Nano).
- GenZ to get AI Boost: Higgsfield Ventures plans to support AI-native Gen Z founders, giving more opportunities to young talent.
Nous Research AI ā· #general (74 messagesš„š„):
Nepal Discord Election, MLC-LLM issues, sglang vs vllm, GPT-OSSH4 in claude code, Demis Hassabis
- Nepal elects leader on Discord!: Members joked about Nepal voting for its leader on Discord and whatās next, AI waifus and husbandos for all citizens.
- A member shared an article about Nepal going through their entire revolution right now.
- MLC-LLM experiment encounters issues: One of the members is experimenting with adding custom models to MLC-LLM (https://github.com/mlc-ai/mlc-llm) but keeps encountering issues when injecting the model.
- A member suggested that this might be due to mixing up context by sessions erroneously not being terminated properly, or may be similar to this issue on llama.cpp.
- sglang and vllm used internally: A member stated that they only use sglang and vllm internally.
- Another one mentioned that they havenāt tried sglang before, but its git repo looks promising, and he was primarily looking to utilize mlc to attempt to experiment with gpt-ossh4 in claude code.
- XML preferred over JSON by the Qwen team: It was noted that the Qwen team prefers XML over JSON and a member is planning to do the same with their agentic system before releasing it.
- Itās believed something new is needed that is much more token conscious because all the whitespace is not resource friendly.
- Sir Demis Hassabis discusses world models: A member shared a YouTube video with Sir Demis Hassabis discussing the pursuit of multi-modal (building world models) approach toward AGI and Embodied A.I systems, the limitation of LLMs, and the mindblowing world of Genie 3.
- This video covers Alphafold real-world achievements in research biology and medicine.
Nous Research AI ā· #ask-about-llms (1 messages):
Adversarial Idea Presentation, Strength in Weakness
- Adversarial Idea Presentation Reveals Hidden Strengths: Presenting an idea in its adversarial mode can inadvertently find additional strengths that it just frames as weaknesses.
- Framing Weaknesses as Strengths: When ideas are presented adversarially, potential benefits may be perceived as drawbacks, highlighting the importance of framing.
Nous Research AI ā· #research-papers (5 messages):
OpenAI Economic Research, Anthropic Economic Index, ChatGPT usage growth, AI Friend mapping
- AI Economics Papers Drop Simultaneously: Both OpenAI and Anthropic simultaneously released papers; OpenAI with Economic Research on ChatGPT Usage and Anthropic with their Economic Index September 2025 Report.
- ChatGPT User Base and Engagement Rocketing: According to OpenAIās data, thereās a substantial increase in both the number of people signing up for ChatGPT and the usage per person.
- āAI Friendā Category faces Skepticism: A member questioned the mapping of certain data to the category of āai friendā, expressing doubt.
Nous Research AI ā· #interesting-links (2 messages):
DNS Tunneling Chat Client, AI Killing Focus
- LLMs Fly High with DNS Tunneling Chat Client: A member created a tool to chat with LLMs from WiFi captive portals using DNS tunneling, enabling LLM access on airplanes without extra fees.
- They asked for roasts, which some might consider a risky request given the current AI climate.
- AI Blamed for Attention Deficit Apocalypse: A member shared a blog post arguing that AI is harming our ability to focus.
- The post details a system to reclaim focus in a world dominated by AI-driven distractions.
Nous Research AI ā· #research-papers (5 messages):
OpenAI, Anthropic, AI Usage, AI Friend
- Simultaneous AI Economic Research Publication Race?: Members shared links to economic research papers published simultaneously by OpenAI and Anthropic, with one member wondering about the timing of their release.
- The OpenAI paper studies ChatGPT usage trends.
- ChatGPT User Growth Shows No Sign of Stopping: Members shared that OpenAI noted that the number of people signing up for ChatGPT is increasing, in addition to the usage per person.
- Attached were multiple images, but discussion was limited.
- āAI Friendā Use Case Under Scrutiny: A member questioned which specific data points from a graph might be mapping to the use case of an āAI Friendā.
- They then simply stated ādoubtā.
DSPy ā· #show-and-tell (3 messages):
fastWorkflow beats Claude Opus 4.1, GEPA API Improvement, Tau Bench retail
- fastWorkflow framework beats Claude Opus 4.1!: A member found that their new fastWorkflow framework implementation matches Claude Opus 4.1 on the Tau Bench dev set.
- These tests used DSPy for agents and parameter extraction and the retail workflow example from their repo.
- GEPA API improvement requested!: Another member expressed interest in learning from experience using GEPA for agentic use cases.
- They also asked to be notified of any potential improvements to the GEPA API that could better support such use cases.
DSPy ā· #general (51 messagesš„):
GEPA for code generation, Manim and DSPy video, Rules as inputs for optimization, MCP Server, Zero Shot Categorization
- GEPA Generates Great Code: DSPyās latest optimizer, GEPA, was built exactly with code generation in mind, showcased in this paper (section 6) for generating CUDA/C++ code for GPU/NPUs.
- One of the original creators happily offered to discuss GEPA in greater detail.
- MultiMedia Manim Magician Makes Movie Magic: A member shared a video created with a custom pipeline using DSPy, which included narration script generation, Manim scene generation, and an auto-fix feedback loop.
- The video utilized Signatures, KNNFewShot, and ChainOfThought, but is closed source at the moment.
- Optimization Overload Overwhelms: A user found that running optimization after each small change of the instructions seems to be too heavy and slow a workflow.
- It was suggested to add a list of rules as part of the input as well, so the prompts are optimized to be adaptable to different rules and possibly unseen rules.
- MCP Server Seeking DSPy Savvy: A user is curious if anyone has used DSPy to tune their MCP server descriptions and examples and thinks tuning for the average result is probably good enough.
- Another member validated this idea, suggesting that the user could infer the calling LM based on the client, and that this idea is so good that Iād bet people would be willing to pay for it as a service if you pulled it off.
- Categorizing with Class: A user is trying to perform zero-shot categorization of ~2k texts (emails) and wants to provide examples or seed words for each topic.
- It was suggested to use
typing.Literal
in the signature definition and load JSON data to createdspy.Example
objects, and pointed to this tutorial.
- It was suggested to use
DSPy ā· #colbert (1 messages):
Contextual Chunking, ColBERT Models, Late Chunking, MaxSim Algorithm
- Contextual Chunking Boosts Performance: A user found that prepending a contextual summary to each chunk significantly improves performance, even with ColBERT models.
- However, they noted that generating a summary for every chunk is costly, prompting a search for more efficient alternatives.
- Late Chunking Explored for ColBERT: The user proposes using late chunking with ColBERT: encoding the entire text at once and then splitting the embeddings into chunks afterward.
- This approach assigns each chunk its corresponding embedding list for more efficient processing.
- MaxSim Algorithm and CLS Token Reliance: The user questions whether ColBERTās maxsim algorithm relies on the CLS token for optimal performance, fearing issues with chunks from the middle of the text that lack a CLS token.
- They inquire if itās safe to omit the CLS token when applying maxsim to each chunk in this scenario.
Modular (Mojo š„) ā· #general (18 messagesš„):
Mojo Package Managers, Binary vs Source Distribution, Pixi and Conda, Apple M1 Compatibility
- Mojo Packaging Mania: Community Mulls Managerial Methods: A community member inquired about creating a new package manager for Mojo, specifically to handle binary distribution, however the Mojo team pointed out
.mojopackage
already covers many benefits of binary distribution and works with Pixi, also the team is intentionally leaning on Conda and standard Python package formats to help adoption.
- Binary Blues: Source Distribution Strides Strong: It was noted that there are downsides to binary distribution, which is why many languages prefer source distribution, but the user was curious whether there might be scenarios where a more explicit binary-focused package manager could be useful, like for large dependencies or prebuilt libraries.
- The Mojo team has stated that Mojo can compile ~200k lines of code in 30 seconds on a laptop and that Pixi handles C/C++ dependencies with Conda.
- Pixi Power-Up: Decentralizing Dependencies Dynamically: A community member highlighted pixi-build-mojo, enabling a fully decentralized package system like Go by using packages in Git.
- The ability to specify system dependencies with Pixi was also mentioned to be quite effective.
- M1 Mayhem: MacBook Woes with Mojo?: A user with an Apple M1 MacBook Air running Python 3.13 inquired about Mojo/MAX compatibility with this version of Python.
- The Mojo team confirmed itās compatible, encouraging the use of
pixi
for isolated Python versions, and suggested that running on CPU should be fine (albeit slower) since Apple Metal support is in its early stages.
- The Mojo team confirmed itās compatible, encouraging the use of
Modular (Mojo š„) ā· #mojo (33 messagesš„):
InlineList Removal, Small List Optimization, Allocator/Storage API, Mojo LSP Status, Network update
- InlineListās Gone: Where Did It Go?: Members discussed the removal of
InlineList
, with concerns raised about the alternatives (InlineArray
andList
) not fully addressing its niche, as the changelog suggests usingInlineArray
orList
with acapacity
constructor.- One member suggested that a stack-allocated variable size length type with fixed capacity would be ideal, and another member mentioned that the Allocator API might be the path forward.
- Small List Optimization Stalled: A āsmall list optimizationā exists, fitting some items inline, but they get copied to the heap if the list grows, with one member mentioning that making the inline size a parameter might be explored.
- A member mentioned that
List
doesnāt have SBO (Small Buffer Optimization) currently due to complexities in exposing it to the user and trait requirements for movable elements.
- A member mentioned that
- Allocator API Coming Soon?: Discussion revolved around the potential of an allocator/storage API to handle inline allocation, with one member stating, What Iām hearing is that I need to go work on my allocator/storage API more.
- This APIās development is pending parametric traits and
requires
, delaying its progress.
- This APIās development is pending parametric traits and
- Mojo Gets Major LSP Rework: A member inquired about the status of Mojoās Language Server Protocol (LSP), and another replied that it exists and is undergoing a major rework soon.
- No further details about the rework were given.
- Network Update Blocked š§: A member expressed anticipation for a network update, but another responded, Lots of blockers there.
- The nature of these blockers was not specified.
aider (Paul Gauthier) ā· #general (36 messagesš„):
RepoMap for Aider, Free C# Models, AGI Predictions, LM Studio issues, GPT-5 Codex
- RepoMap Boosts Aiderās Real-World Performance: A user noted that using RepoMap with Aider provides extra context like filenames and function signatures, enhancing LLM awareness of available resources, theoretically leading to leaderboard results that more closely reflect real-world coding scenarios.
- However, they conceded that benchmark tests on simple problems still leave a significant gap compared to real-world code experiences.
- Seek and ye shall find: Free C# Models: A user sought a free, non-local model proficient in C#, and other members suggested trying Qwen Coder and Deepseek Coder via OpenRouter, noting that Gemini 2.5 Pro might have a free tier.
- The user later reported issues using Qwen via OpenRouter, receiving an AuthenticationError due to a potentially incorrect API key.
- AGI Arrival: When Will AI Slash White-Collar Jobs?: A user polled the channel on when AGI might reduce white-collar jobs by over 30%, offering choices ranging from 2027 to beyond 2040, defining AGI in terms of economic impact rather than abstract intelligence.
- Another member jokingly predicted it would happen somewhere between now and heat death of the universe, or maybe never.
- LM Studio and Aider: A Rocky Start: A user encountered problems running Aider with a local Qwen3-Coder-30B model in LM Studio, sharing images of their setup but without specifying the exact issue.
- Another member inquired whether the necessary environment variables were set, hinting at a potential configuration problem.
- GPT-5 Codex: The New Coding Model on the Block?: A user inquired about Aiderās score for GPT-5 Codex, referencing a The New Stack article on the new model.
- Another clarified that it is Not available through the API yet.
aider (Paul Gauthier) ā· #questions-and-tips (6 messages):
Ollama context window limits not respected, lm studio or llamafile suggestion, --watch-files implementation on Linux, Gemini issues with Aider
- Ollama context length limits ignored: A user reported that aider with Ollama doesnāt respect context window limits, leading to excessive VRAM usage and freezing the machine, despite setting
OLLAMA_CONTEXT_LENGTH
and other parameters in configuration files.- The user has configured
num_ctx
andmax_tokens
in.aider.model.settings.yml
andmax_tokens
,max_input_tokens
, andmax_output_tokens
in.aider.model.metadata.json
.
- The user has configured
- Alternatives to Ollama suggested: A member suggested using LM Studio or llamafile as alternatives.
- No further discussion or reasoning was provided.
- āāwatch-filesā implementation based on filesystem: A member inquired how the
--watch-files
option works in Linux, specifically if it relies on inotify or requires communication from an IDE/editor.- Another member clarified that itās filesystem based and doesnāt need specific messages from an editor.
- Gemini integration halted, possibly due to user-agent blocking: A user reported issues with aider hanging while waiting for a response from Gemini models, despite the token being correct and functional with
curl
and the Gemini CLI.- The user suspects that Gemini might be blocking based on the user agent and is running aider 0.86.1.
aider (Paul Gauthier) ā· #links (1 messages):
Earning $100k in a week, Telegram scams
- Unrealistic earning promise dangles hefty commission: A member offered a get-rich-quick scheme, promising to help the first 10 people interested in earning $100k or more within a week, asking for 10% reimbursement of profits upon receipt.
- Interested parties were instructed to initiate contact via Telegram username @Joanna_Dwayne, a move that raises suspicion of a scam.
- Telegram Contact Raises Red Flags: The request to contact a user via Telegram for a get rich quick scheme is a common pattern used by scammers.
- Users should be wary of any offers that require initial contact on unverified channels and promise unrealistically high returns.
tinygrad (George Hotz) ā· #general (25 messagesš„):
Tensor.assign return value, GEMM TFLOPs measurement, Winograd bounty lock, Rangeify bugs, CUDA 12.0 and sm_35
- Debate assign() vs store() Functionality: The need for
assign()
to return a value versus acting likestore()
was questioned, pondering if itās just a convenience since the return value is often unused in examples.- It was suggested that linking both the buffer and the store to the load is a possible alternative.
- GEMM 165+ TFLOPs Bounty Measurement Questioned: A question arose about how to measure the 165+ TFLOP GEMM bounty target on an RTX 4090, suspecting it might be unachievable at the stated boost clock of 2.52 GHz.
- The theoretical peak throughput for FP16/BF16 with FP32 accumulate on an RTX 4090 is around 165.15 TFLOPs at that clock speed, but the questioner implied that a higher clock might be needed to reach the bounty target.
- Winograd Bounty Requirement Clarified: A user inquired about the requirements to lock the Winograd bounty, having found a necessary and sufficient condition to identify Winograd compatible convolutions.
- George Hotz clarified that locks are only after code is correct while there are fixups to merge it.
- Rangeify Bugs List Shared: A list of Rangeify bugs was shared for people to investigate and fix, emphasizing that many are likely simple fixes.
RANGEIFY=1
is described as the new scheduler that can create things like flash attention and beyond.
- CUDA 12.0 drops support for sm_35: The CUDA issue is that CUDA 12.0 dropped support for sm_35 used by Ocelot.
- The minimal flag was added after 12.4
tinygrad (George Hotz) ā· #learn-tinygrad (12 messagesš„):
GPU Utilization in tinygrad, VIZ=1 Profiler, NixOS Patch for CUDA, Profiler 404 error
- GPU Utilization Plummets in tinygrad: A tinygrad user reported seeing poor GPU utilization and sought advice on improving it, particularly when switching from CPU to CUDA.
- Another user suggested using
PROFILE=1
(orVIZ=1
) to identify where time is being spent, noting that saving tensors to disk can be a bottleneck, offering the user to examine the profile to help determine the source of the issue.
- Another user suggested using
VIZ=1
Profiler Unifies Profiling Options:PROFILE=1
is merely an alias forVIZ=1
, and the former has been removed to reduce redundancy and streamline profiling in tinygrad.- George Hotz noted āhaving two options is worse than having oneā, which motivated the change, simplifying the profiling process.
- NixOS CUDA Patch Incoming: A tinygrad user is planning to investigate and potentially fix a broken profiler issue on their NixOS distribution after submitting a patch related to CUDA.
- The user mentioned they had to patch file paths, indicating a likely issue with how the distro package handles CUDA dependencies.
- Profiler Faces 404 Error: A tinygrad user encountered a 404 error when trying to access
/js/index.js
while using the profiler.- This error suggests a potential issue with the profilerās file paths, or the location of
/js/index.js
.
- This error suggests a potential issue with the profilerās file paths, or the location of
Manus.im Discord ā· #general (19 messagesš„):
Credits Rollover, Daily Credits Stopped, Clone Website using AI, Subscription Renewal Issues, Knowledge Limit Increase
- Credits Confusion Clouds Users: Users are inquiring about credits rollover and the cessation of daily 300 credit allocations.
- One user specifically reported their subscription renewal was set for September 14th but they havenāt been charged nor received more credits.
- Website Cloning Craze Kicks Off!: A user mentioned that itās easy to clone a website using Manus or other AI tools.
- He was impressed that his feature idea proposed on August 12th was implemented just 16 days later in this discord channel.
- Collaboration Creates Coding Confidence: A user is experimenting with Manus Collaboration with friends for coding tasks.
- Another user is working on a potential new feature that, if successful, promises to significantly enhance Manusā efficiency as a coding assistant.
- Knowledge Navigation Needs Nurturing: Several users are asking about increasing the knowledge limit, specifically if itās possible to exceed 20.
- No concrete answers were provided in the discussion.
MCP Contributors (Official) ā· #general (10 messagesš„):
MCP Servers, Reinforcement Learning, Integration Testing, MCP Server Efficiency, NL Interface
- Scalable Integration Testing with MCP Servers: Members are thinking through scalable integration testing and reinforcement learning over MCP server tool use.
- They are considering flagging when connecting to an MCP server that the server is in some kind of development or simulation mode for robust training to simulate real tool behavior without messing with production DBs.
- Score MCP Servers for Efficiency: One member is researching how to score MCP servers for their efficiency in different clients to determine when the marginal improvement in efficiency is not worth additional coding in the server.
- The trade-off is between every kind of prompt sharing one node and between every kind of prompt having its own node ā but how many API calls is ātoo manyā for a user story?
- MCP as CLI for Applications: Some folks are considering using MCP as CLI to applications, in a form of NL interface and adaptive dash boarding/reporting.
- The idea is to use it as a UI/UX interface to enterprise apps through NL.
- Golang Streaming HTTP MCP Server Project: A member opened up their mcp-server-go project, a golang streaming http MCP server designed to address the more challenging requirements of enterprise-like situations.
- It is designed for scalability, and includes features like auth, sessions and resumability, and dynamic capabilities.
MCP Contributors (Official) ā· #general-wg (2 messages):
MCP Resource Integration with LLMs, Claude Desktop Automation, Discord Channel Restrictions
- LLM Learns MCP Resources: A member inquired about automating the process of LLMs reading MCP resources before answering user questions and executing tools, aiming for a workflow where the LLM is pre-loaded with knowledge.
- The member noted that currently, with Claude desktop, resources must be manually added to the chat window before asking questions.
- Claude Desktop Lacks Automation: A member confirmed that Claude desktop functions as intended, requiring manual addition of resources to the chat window.
- They clarified that there is no automated process for LLMs to read resources before interacting with users in the current setup.
- Discordās Focus Narrowed: It was clarified that the Discord channel is restricted to discussions on the governance of the MCP protocol itself.
- General MCP questions should be directed elsewhere, with an offer of guidance via DM for those seeking it.