Whisper is all you need.

AI News for 5/13/2025-5/14/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (214 channels, and 4313 messages) for you. Estimated reading time saved (at 200wpm): 428 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

We try to keep coverage to model- and code-specific news that we’re pretty sure engineers will someday use at work, but occasionally smaller product launches are interesting fodder for commentary on the broader AI landscape, especially if the launches involve highly regarded work products like Notion or Granola.

There’s an ongoing joke in biology that everything evolves into crab. The same is happening in AI wrapper land - just because they’re now recognized to be valuable, doesn’t stop them from still being easy to clone. Bolt inspires Figma Make, Claude Code inspires OpenAI Codex, Deep Research inspires Deep Research inspires Research inspires DeepSearch, and on and on. Ideas are worth nothing, may the best distribution + execution win.

The occasion of Granola’s $43m Series B (at $250m valuation) is their time to launch “Granola 2.0”, their collaborative version with a surprisingly… Notiony UI.

This is a day after Ivan Zhao launched… an interesting Granola-lite feature.

AI Twitter Recap

Language Models and Releases

GPT-4.1 availability: @OpenAI announced that GPT-4.1 will be directly available in ChatGPT for Plus, Pro, and Team users, with Enterprise and Education users gaining access in the coming weeks, specializing in coding tasks and instruction following. @kevinweil noted that GPT 4.1 mini is replacing GPT 4o mini everywhere in ChatGPT, including for free users.
Claude models: @scaling01 expressed excitement about the upcoming Claude Opus, anticipating further models like Ultra and GPT-4.5 based reasoning models. @steph_palazzolo shared information on Anthropic’s upcoming Claude Sonnet and Claude Opus releases, noting their different reasoning models. However, @andersonbcdefg criticized that Claude is braindead now, with O3 making random stuff up and sending you down rabbitholes of hallucinations.
Qwen models: @Alibaba_Qwen shared the Qwen3 Technical Report, detailing model specifics and complete assessments. @iScienceLuvr highlighted that Seed1.5-VL delivers state-of-the-art results on 38 out of 60 public VLM benchmarks. @reach_vb congratulated the team, and @Yuchenj_UW commended the Qwen team’s great work. @qtnx_ also expressed respect for the Qwen team’s impressive and hilarious throwing of thirty six TRILLION tokens on a 600M.
Meta’s AI efforts: @AIatMeta announced new releases from Meta FAIR, including models, benchmarks, and datasets for molecular property prediction, language processing, and neuroscience. However, @Yuchenj_UW criticized Meta’s AI, particularly Llama 4, noting issues with ignoring attached pictures and login failures.
@_akhaliq announced that AM-Thinking-v1 just dropped on Hugging Face, calling it an advancement on the Frontier of Reasoning at 32B Scale.
@RisingSayak announced a new Diffusers-compatible training script for SANA Sprint.
Gemini 2.0 Flash Preview: @ArtificialAnlys reported that Gemini 2.0 Flash Preview image generation delivers a modest upgrade over the 2.0 Flash Experimental release, but still remains well below the state-of-the-art threshold. @HamelHusain said that Gemini one shotted these chapter summaries w/amazing accuracy.
Stability AI just dropped Stable Audio Open Small on Hugging Face. @_akhaliq noted that is a Fast Text-to-Audio Generation with Adversarial Post-Training.

Agent Development and Tooling

LangChain Interrupt: @LangChainAI provided updates from Interrupt 2025, focused on evals, quality, and reliability, emphasizing that quality is still the biggest blocker of bringing agents to production. @LangChainAI also introduced the Open Agent Platform, an open-source, no-code agent building platform. @LangChainAI announced that the LangGraph Platform is now generally available, designed for deploying, scaling, and managing agents.
LlamaIndex Memory Component: @llama_index introduced a new, flexible Memory API that blends short-term chat history and long-term memory via plug-and-play blocks.
Runway References Update: @c_valenzuelab shared the cool uses cases coming up with this latest References update.
@LiorOnAI announced a debugging tool from @PatronusAI that scans full execution traces, detects 60+ failure types, and suggests prompt fixes, working with Langchain, CrewAI, OpenAI SDKs, and more.
Model Context Protocol: @AndrewYNg announced a new course on MCP with Anthropic, focusing on building rich-context AI Apps. @DeepLearningAI announced a new course with Anthropic on MCP. @jerryjliu0 introduced a new abstraction for agentic memory, modeling it as a set of “blocks” in a waterfall architecture.
@nerdai introduced FedRAG, a framework for fine-tuning RAG systems, highlighting its focus on simplifying fine-tuning across centralized and federated architectures.
@LiorOnAI noted OpenAI quietly released their GPT-4.1 Prompting Guide, saying it’s a must read if you’re using agents or LLMs.
@steph_palazzolo observed that coding assistants are moving towards always-on agents that constantly search for bugs and vulnerabilities in the background.

AI Infrastructure and Tools

Hugging Face and Integrations: @reach_vb noted that you can now directly use any model from Hugging Face directly over on Kaggle notebooks. @ClementDelangue noted that it is Very cool to see @PyTorch contributing on @huggingface. @_akhaliq reported that Blazingly fast whisper transcriptions with Inference Endpoints.
vLLM Enhancements: @ClementDelangue reported 8x faster/cheaper @openai Whisper API thanks to Hugging Face Inference Endpoints & @vllm_project!. @vllm_project congratulated FlashInfer, and @danielhanchen shared a New GRPO notebook for Qwen3 Base, saying vLLM 0.8.5 is also supported now with Unsloth!
Keras Updates: @fchollet discussed creating KerasHub pretrained components straight from the base classes.
Model Context Protocol: MCP makes AI development less fragmented and standardizes connections between AI applications and external data sources, explained @AndrewYNg.
Deep Learning AI launched course 4 of the Data Analytics Professional Certificate, which includes Data I/O and Preprocessing with Python and SQL. Throughout the course, you’ll learn how to use generative AI to help debug and optimize your data pipeline. @DeepLearningAI
@skypilot_org reported on spinning up Qwen3 @Alibaba_Qwen + SGLang @lmsysorg on H100 in one command.

AI and Research Concepts

AlphaEvolve for Algorithm Discovery: @GoogleDeepMind introduced AlphaEvolve, a Gemini-powered coding agent for algorithm discovery, capable of designing faster matrix multiplication algorithms, finding new solutions to open math problems, and making data centers, chip design, and AI training more efficient. @GoogleDeepMind further noted that in 75% of cases, it rediscovered the best solution known so far.
Auto-regression critiques: @francoisfleuret shared a hot take that Auto-regression sucks and is impressive as a parlor trick, saying any spark of intelligence from an LLM reflects that it moved beyond, and built a factorized model with meaningful latents.
Evaluation methods: @BorisMPower stressed that creating evaluations is the most effective way to improve model performance in any domain.
Importance of Implementation: @hyhieu226 highlighted that deep learning is ~10% idea and ~90% implementation.
LLMs and Grammar: @LiorOnAI explained why LLMs trained on 90% English still perform incredibly well in other languages. They learn shared grammar concepts, models don’t just memorize word-level patterns.
@shaneguML shared Bruce Lee’s famous LLM researcher quote: “I fear not the LLM who has practiced 10,000 questions once, but I fear the LLM who has practiced one question 10,000 times.”
Type-constrained Code Generation: @mathemagic1an shared about “Type-constrained Code Generation with Large Language Models”, saying it uses the LSP/type system to constrain valid output tokens during code generation and reduces compilation errors by >50% with 30B models.

Industry, Business, and Economic Impacts

AI in Enterprises: @AIatMeta shared their study on CATransformers is a carbon-driven neural architecture and system hardware co-design framework, discovering greener CLIP models that achieve an average of 9.1% reduction potential in total lifecycle carbon emissions.
AI skillsets: @NandoDF said that, if you’re a good data engineer, or an engineer who loves looking at data creating datasets for games, video, images, audio, text … please send me a message.
@ID_AA_Carmack believes that more of the world than many might imagine could run on outdated hardware if software optimization was truly a priority.
Overcoming Conscientiousness as an entrepreneur is a common theme. @scottastevenson thinks that conscientious people are constantly drawn to easy structured dopamine rewards like cleaning their desk, running an errand.
Importance of Demand: @rishdotblog summarized @ejames_c post, noting that the inability to find authentic demand kills startups.
Impacts on Employment: @cto_junior believes that most “software engineers” here are only code monkeys with no insight of how the overall system works and they will for sure get replaced without upskilling.

Humor and Miscellaneous

@sama declared that brian is the most auteur founder of this generation, and it really shines through in how he does launches!
@typedfemale said “I’m in my later 20s now (and female btw). And this will sound weird, but I really think God put me on this earth to bring warmth to the lives of mildly autistic men”.
@arankomatsuzaki said “I’m glad you’re keeping track of my Ls 😂”.
@victkyatk wrote “being the greatest ML researcher of all time must be really annoying”.
@francoisfleuret declared “I am worth whatever salary just for my enthusiasm.”.

AI Reddit Recap

/r/LocalLlama Recap

1. Benchmarking AMD Strix Halo and Qwen3 Models for Local LLM Inference

AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance (Score: 104, Comments: 43): The post benchmarks the AMD Strix Halo (Ryzen AI Max+ 395) GPU, featuring 40 RDNA3.5 CUs and a peak of 59.4 FP16/BF16 TFLOPS, for LLM inference on Linux using llama.cpp and other frameworks. Raw compute efficiency with hipBLASLt reaches 36.9 TFLOPS (>60% theoretical), but llama.cpp’s HIP backend underperforms (e.g., 348.96 tokens/sec for Llama-2-7B Q4_0), drastically below expected efficiency versus Vulkan (881.71 t/s), Apple M-series, and both 780M and 7900 XTX GPUs. The Vulkan backend—with recent Flash Attention (FA) support—delivers the best prompt and token generation speeds (e.g., 884 t/s for Llama-2-7B Q4_0), while HIP+rocWMMA+FA excels for long contexts (almost no perf drop at 8K context). Testing also includes Qwen3-30B/109B and Llama 4 (up to 57.9 GiB models), showing that Vulcan delivers high tg128 for massive models and that ROCm and software stack maturity (esp PyTorch FA) remain bottlenecks. ROCm 6.4, AOTriton, and Composable Kernel are confirmed to build and work, but PyTorch Flash Attention still fails on this hardware. Useful reference: Strix Halo benchmarking results. Commenters highlight that for Llama-2-7B Q4_0, the GPU achieves 79% of theoretical memory bandwidth and 87% for Qwen3 32B Q8—higher efficiency than most conventional systems, per synthetic benchmarks. Others request testing with higher precision models (e.g., Qwen 32B Q8 at large context) and follow ongoing ROCm and PyTorch development threads (ROCm#4499, ROCm/TheRock#244).
- Ongoing efforts to improve PyTorch support for AMD GPUs are highlighted, with direct reference to active ROCm development discussions and issue tracking. Technical readers are pointed to ROCm/ROCm issue #4499 and ROCm/TheRock discourse #244, indicating a focus on library and compatibility optimizations for PyTorch users on AMD hardware.
- Benchmark results show that the Llama-2-7B-GGUF Q4_0 model achieves throughput at 79% of theoretical memory bandwidth, while Qwen3 32B Q8 reaches 87%, which is significantly higher than most conventional systems where synthetic benchmarks often perform worse. Reference provided to memory bandwidth benchmarks and discussion.
- There is an interest in RPC latency tests for Strix Halo systems, comparing the potential value proposition of using these new devices as single RPC servers versus scaling multiple cheaper systems. The inquiry seeks technical details on RPC performance testing, particularly for LLM inference deployments, and whether such benchmarks were conducted using one or several units as hosts/clients.
Qwen3-30B-A6B-16-Extreme is fantastic (Score: 120, Comments: 44): The Qwen3-30B-A6B-16-Extreme model (Hugging Face link) is a MoE (Mixture of Experts) LLM variant which increases the active experts from 8 to 16 (out of 128 total) compared to the original Qwen 30B-A3B specification. The model is not actually finetuned but instead has had its expert count changed via configuration, as clarified on the model card, which can impact inference depth and potentially accuracy. There is also GGUF quantization support (link) and a 128k context-length variant is available. Technical debate in the comments centers on the impact of increasing the number of experts without retraining: some users question whether simply activating more experts yields performance gains or requires retraining, and call for benchmarks to quantify the improvement. One commenter points out that contrary to the model card, escalating the number of experts does not constitute a proper ‘finetune’ but is just a configuration change.
- Discussion centers on model architecture for Qwen3-30B-A6B-16-Extreme, specifically increasing active MOE (Mixture-of-Experts) experts from 8 to 16 out of 128 without retraining. Technical users confirm you can change expert count via configuration (not weights), e.g., “—override-kv qwen3moe.expert_used_count=int:24” (llama.cpp) or through LM Studio settings.
- The SHA256 checksum for the safetensors file remains unchanged, indicating only the config file is altered to use more experts per token and the model weights themselves are unmodified. This suggests increased expert count is simply a runtime configuration, not a genuine finetune, despite some model cards erroneously describing it as such.
- Questions remain about the performance impact of increasing experts. Commenters request benchmark comparisons between different expert counts and debate whether simply activating more experts yields better results, or if further training is necessary to maximize gains.
Embrace the jank (2x5090) (Score: 101, Comments: 48): The OP upgraded a mining rig by adding a second NVIDIA RTX 5090 GPU to an existing 4x3090 setup, noting improved availability and reduced pricing of the 5090. They report compatibility challenges due to the physical length of the Gigabyte 5090, but observed that the ROPs are robust (indicating late-batch cards) and that cable/power thermals remain safe with power limits set (400W for 5090s, 250W for 3090s). Use-case includes simultaneous LoRA training on one 5090 and image generation on another via ComfyUI, with inference planned via vllm or sglang on the 3090s. Commenters highlight the prohibitive cost of high-VRAM cards like the 5090 in some regions, and suggest further technical analysis such as system noise measurement.
- A user discusses simultaneous workload capability by running a LoRA training session on one RTX 5090 and image generation in ComfyUI on another 5090, with TabbyAPI operating on 4x3090s. The workload is described as mild, and the user intends to test higher-demand scenarios with vllm or sglang inference later, pointing to interest in assessing real multi-GPU performance under more intensive AI serving tasks.
- Thermal management concerns are highlighted, specifically regarding the risk of connector melting when running high-end GPUs like the 5090. The user queries about the type and quality of thermal camera used for monitoring, suggesting attention to hardware safety and reliability in extreme or enterprise GPU setups.

2. MAESTRO Local-First AI Research App Release and Benchmarks

Announcing MAESTRO: A Local-First AI Research App! (Plus some benchmarks) (Score: 149, Comments: 34): MAESTRO is a modular, local-first AI research app supporting document ingestion, hybrid-search RAG pipelines, and a multi-agent system (planning, research, reflection, writing), configurable for both local and cloud/API-based LLMs. Benchmarks, available in the repository’s VERIFIER_AND_MODEL_FINDINGS.md, use a panel of LLM ‘verifiers’ to assess and match local and remote models for agent roles, reporting per-task performance (e.g., note-taking, synthesis) and providing deployment recommendations. The system supports both Streamlit UI and CLI interaction, tracks resource/cost usage, and is actively evolving towards enhanced UIs and agentic frameworks. See code and details. Notable commenter questions include surprise at certain benchmark outcomes (e.g., ‘qwen3 8b performs better than 32b?’), requests for broader websearch API support (e.g., SearxNG, DuckDuckGo, Google), and critiques of example outputs as conventional, with suggestions to test agents on more novel research domains for true value assessment.
- The OP is questioned whether Qwen3 8B truly outperforms 32B models, reflecting skepticism and interest in the posted benchmarks. This highlights community attention to surprising performance results, especially since conventional wisdom would expect significantly larger models (32B) to outperform smaller ones (8B) unless there are dramatic efficiency or instruction-tuning improvements in the model architecture.
- A critical bug report is provided referencing a PyTorch error relating to custom class instantiation. The user receives: RuntimeError: Tried to instantiate class '__path__._path', but it does not exist! Ensure that it is registered via torch::class_. This suggests a potential packaging or extension issue where a required TorchScript/PyTorch custom class was not registered or is missing.
Wan-AI/Wan2.1-VACE-14B · Hugging Face (Apache-2.0) (Score: 118, Comments: 6): Wan2.1 (Hugging Face repo) is an open-source, state-of-the-art video foundation model suite covering text-to-video, image-to-video, video editing, text-to-image, and video-to-audio, with models ranging from 1.3B to 14B parameters. It demonstrates SOTA performance (outperforming both open and commercial peers), supports consumer-grade GPUs (5-second 480p render on RTX 4090 in 4 minutes with the 1.3B model), and features robust bilingual (Chinese/English) text generation, a high-quality, temporal-preserving video VAE (Wan-VAE), and tight integration with Diffusers and ComfyUI, supporting advanced distributed/multi-GPU inference via FSDP and Ulysses/Ring strategies. All code and weights are released under Apache 2.0, optimized for LoRA/finetunable workflows, with quantization and speed optimizations supplied. Commenters request a MoE (Mixture of Experts) 14B variant to significantly improve inference speed for practical deployment (potentially achieving 10× speedup with ~90% performance retention), and request clarification on naming and feature distinctions between Wan2.1’s previous and current variants (ITV/TTV vs. VTV components).
- A user discusses the potential impact of a MoE (Mixture of Experts) 14B version, noting that even at 90% of the original model’s performance, a speedup by 10x would drastically improve practical inference times, especially for consumer use-cases (e.g., reducing a 20-minute render to 2 minutes, or 10 minutes to 1 minute on an RTX 4090 with optimizations).
- Technical highlights from the model card are cited: the T2V-1.3B model operates on just 8.19GB VRAM, enabling compatibility with consumer GPUs, and can generate a 5-second 480p video in ~4 minutes on an RTX 4090 without quantization. Wan2.1 reportedly outperforms other open and commercial models across benchmarks, excels in multi-modal tasks, and is the first video model to robustly generate both Chinese and English text, with a highly efficient Wan-VAE for 1080p video handling.
- There is confusion about the naming conventions and versions in the Wan series, with users questioning the transition from ITV/TTV to potentially VTV, indicating the need for clearer documentation or changelog on model progressions and architecture changes.

3. BitNet R1 Ternary Model Finetune and Community Tools

BitNet Finetunes of R1 Distills (Score: 274, Comments: 65): A novel method enables direct fine-tuning of existing FP16 Llama and Qwen checkpoints into ternary BitNet (weights limited to {-1, 0, 1}), by inserting an input-side RMSNorm before every linear layer. Models bitnet-r1-llama-8B (converged in ~300M tokens) and bitnet-r1-qwen-32B (~200M tokens) were trained using BF16 AdamW on 8×H100 GPUs with all linear weights quantized (including lm_head for this release). PyTorch/Transformers support is available via a PR (repo), allowing use and further fine-tuning with only a quant_config change; checkpoints are hosted on Hugging Face (bitnet-r1-llama-8b, bitnet-r1-qwen-32b). This approach reduces memory requirements and training costs, achieving competitive loss trends, with a roadmap including convergence, keeping the output head in full precision, and RMS patch upstreaming. Expert comments highlight that this method enables BitNet weights to be achieved with minimal additional training—cheaper than retraining from scratch—and express interest in whether performance surpasses 4-bit quantization, with requests for further benchmarks and broader hardware evaluations.
- The core innovation detailed is the addition of an input-side RMSNorm layer before each linear operation in existing FP16 Llama or Qwen models, allowing direct fine-tuning into the highly compressed 1-bit BitNet format. This method enables rapid adaptation (convergence in roughly 200-300M tokens) at a fraction of the original full training cost, with minimal runtime impact as the extra RMSNorm can be fused into the quantization process post-training.
- All linear weights, including the critical lm_head, were quantized in these experiments to stress-test stability—a choice expected to produce suboptimal perplexity versus approaches keeping lm_head in full precision. The authors note that future iterations will retain full-precision lm_head and aim for better convergence and compatibility with original model weights, eventually supporting drop-in replacement for standard checkpoints.
- Training was performed using BF16 AdamW and DeepSpeed ZeRO-3 on 8x H100 GPUs. While BitNet weight packing reduces memory and can offer faster inference in memory-bound scenarios, some hardware may incur overhead due to de-quantization. The checkpoints are experimental and a slight perplexity gap is expected until further training continues; code modifications are available in a custom fork of Hugging Face Transformers for early adopters.
I updated the SmolVLM llama.cpp webcam demo to run locally in-browser on WebGPU. (Score: 205, Comments: 14): The post announces an update to the SmolVLM/llama.cpp webcam demo: it now runs completely in-browser using WebGPU and Transformers.js, eliminating the need for local installations or a server backend. The demo leverages client-side WebGPU acceleration for real-time inference and is deployed on Hugging Face Spaces, with minimal implementation (single index.html file) available in the files section (demo link: https://huggingface.co/spaces/webml-community/smolvlm-realtime-webgpu). Technical discussion in comments is minimal; most feedback is anecdotal and focused on demo output rather than on implementation details or performance characteristics.
- A user inquired about the size of the 500M SmolVLM model, prompting discussion around its storage requirements for local/in-browser execution. While the exact figure isn’t stated in the thread, 500M parameter models typically range from ~1GB to 2GB in FP16 precision, which is crucial for those considering in-browser deployment on resource-constrained devices.
US issues worldwide restriction on using Huawei AI chips (Score: 177, Comments: 166): The US has implemented a worldwide export restriction on the use of Huawei AI chips, expanding its extraterritorial controls beyond domestic borders to curb access to advanced semiconductor technology. The move is intended to prevent Huawei from supplying AI chips for AI and HPC applications that could compete with leading US-based suppliers such as Nvidia, citing national security and competitive advantage concerns. This action further extends previous restrictions on chip manufacturing equipment and semiconductor design tools beyond the US, impacting global supply chains and non-US entities (see Reuters coverage). Top comments note the implicit technological threat posed by Huawei to Nvidia, reflect skepticism about enforceability outside US jurisdiction, and interpret the restriction as a form of technological endorsement for Huawei’s AI chips.
- Comments reference the potential technical competitiveness of Huawei’s AI chips, with speculation that US restrictions indicate these chips could surpass Nvidia’s offerings in terms of price and performance in certain global markets. This inference aligns with recent industry analysis suggesting that Huawei’s new Ascend series chips are becoming viable alternatives for AI workloads, especially where Nvidia’s high costs or supply constraints are prohibitive.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo

1. AlphaEvolve and DeepMind Breakthroughs in Coding and Science AI

DeepMind introduces AlphaEvolve: a Gemini-powered coding agent for algorithm discovery (Score: 1103, Comments: 274): DeepMind announced AlphaEvolve, an automated coding agent leveraging Gemini LLM ensembles (Flash for exploration, Pro for depth) for novel algorithm discovery and optimization (DeepMind blog). AlphaEvolve iteratively generates and tests code solutions, yielding performance gains—e.g., a 23% speedup in Gemini’s matrix multiplication kernel (yielding 1% overall training time reduction), 0.7% compute recovery in data center scheduling, and hardware improvements for TPUs. On mathematical tasks, it rediscovered state-of-the-art solutions in 75% of 50+ open problems and outperformed prior best in 20%, including improvements to the kissing number problem. The agent reduces kernel optimization from weeks to days via automated, unsupervised search. Technical debate centers on implications for LLM-based scientific discovery, with some viewing AlphaEvolve as a counterexample to claims that LLMs cannot autonomously discover new algorithms (contrasting with talks from experts like Yann LeCun). Commenters also anticipate that such advances signal near-term breakthroughs in unsupervised self-improvement for algorithmic discovery.
- AlphaEvolve, when tested on over 50 open problems in mathematical domains including analysis, geometry, combinatorics, and number theory (e.g., the kissing number problem), was able to rediscover the best-known solutions in 75% of cases and surpass previous solutions in 20% of cases, resulting in verifiable new discoveries. (source)
- The system used Gemini-powered approaches to optimize matrix multiplication kernels, which accelerated this key operation by 23% and led to a measurable 1% decrease in the training time for Gemini models. This efficiency gain translates into reduced computational expense, and AlphaEvolve’s automation reduces kernel optimization cycles from weeks of manual tuning to days of automated runs, thus speeding research cycles.
- AlphaEvolve’s optimization strategies are directly applied to core infrastructure, including Google’s data centers and AI chip design, as well as the very model architectures (such as those powering AlphaEvolve) that it is intended to improve, making for a self-optimizing feedback loop within the AI development stack.
Meet AlphaEvolve, the Google AI that writes its own code—and just saved millions in computing costs (Score: 455, Comments: 52): Google AI has introduced AlphaEvolve, an AI system ostensibly capable of autonomously generating novel computer algorithms, with claims of significant cost savings (millions) in computing expenses. The announcement further asserts AlphaEvolve is able to make new discoveries within computing and mathematics, echoing the ambition of prior breakthroughs like AlphaFold and AlphaGo. Technical readers flag the need for concrete evidence supporting these claims, such as reproducible benchmarks, accessible datasets, or details about algorithmic innovation. Commenters express skepticism about the magnitude of the claims, specifically demanding evidence for AlphaEvolve’s ability to invent entirely new algorithms. Others note historical precedent—AlphaFold and AlphaGo also achieved unconventional results by extensive search and self-play, but substantive empirical results validated their significance.
- AlphaEvolve reportedly combines Gemini Flash and Gemini Pro models as its core framework, allowing modular upgrades as new SOTA models become available. This design enables adaptability and suggests sustained improvements in efficiency and capabilities as underlying models advance.
- The impact of AlphaEvolve on complex Google infrastructure (chip design, networking, DC deployment, cloud compute) is highlighted: AI-driven optimization here could lead to significant efficiency gains across the company’s extensive tech stack, potentially outpacing competitors who lack similar vertical integration.
- Commenters debate the legitimacy of Google’s claims, noting that Google/DeepMind historically shares less-inflated performance metrics compared to some competitors. There is demand for concrete benchmarks or evidence, especially regarding claims about algorithmic invention and mathematical discoveries.
DeepMind unveils ‘spectacular’ general-purpose science AI (Score: 246, Comments: 29): DeepMind’s newly announced AlphaEvolve system integrates large language models (LLMs) with automated algorithmic evaluators to autonomously evolve novel, high-performance algorithms across scientific domains. In benchmarking, AlphaEvolve discovered matrix multiplication routines surpassing the long-standing Strassen’s algorithm, as well as improved designs for tensor processing units (TPUs) and cloud resource allocation. The architecture uniquely combines LLM-driven proposal generation with evolutionary algorithm selection, enabling domain-general problem-solving capabilities (for details, see the Nature announcement). Commenters highlight AlphaEvolve’s ability to outperform classic algorithms (e.g., Strassen’s for matrix multiplication) and speculate that such advances demonstrate DeepMind’s leadership toward artificial general intelligence (AGI), linking this progress to DeepMind’s recent ‘after AGI’ specialist hires.
- DeepMind’s new AI, AlphaEvolve, develops matrix multiplication algorithms that can surpass the speed of Strassen’s algorithm, a method that has remained the fastest-known since 1969. This demonstrates the AI’s ability to discover novel optimization strategies in computational mathematics, a key benchmarking domain in computer science.
- A critical observation is that AlphaEvolve’s capabilities are inherently tied to domains with scalable and cheap validation, such as program analysis where computational correctness and speed are straightforward to measure. For fields like astronomy, particle physics, medicine, or business—where validation is costly or limited—the impact of these AI-driven discoveries remains minimal; the limiting factor shifts from idea generation to experimental validation.
- While AI-driven improvements in computational tasks like matrix multiplication suggest a compounding effect (flywheel effect) across technically tractable domains, the broader implication is a proof-of-concept for AI systems that could theoretically drive rapid and recursive innovation if applied across interconnected foundational technologies. This is relevant to discussions about the feasibility of a technological singularity driven by general-purpose science AIs.

2. Anthropic Claude Sonnet/Opus Model Release Anticipation and OpenAI Model Rollout

Looks like we can expect an Anthropic release in the coming weeks (Score: 218, Comments: 61): **The image depicts a formal presentation, likely by an Anthropic representative, highlighting the upcoming release of the Claude Sonnet and Claude Opus models. These models are noted for their distinctive reasoning abilities, suggesting significant advancements in Anthropic’s AI architecture. Both the post and comments indicate that credible sources (**1. AlphaEvolve and DeepMind Breakthroughs in Coding and Science AI
- DeepMind introduces AlphaEvolve: a Gemini-powered coding agent for algorithm discovery (Score: 1103, Comments: 274): DeepMind announced AlphaEvolve, an automated coding agent leveraging Gemini LLM ensembles (Flash for exploration, Pro for depth) for novel algorithm discovery and optimization (DeepMind blog). AlphaEvolve iteratively generates and tests code solutions, yielding performance gains—e.g., a 23% speedup in Gemini’s matrix multiplication kernel (yielding 1% overall training time reduction), 0.7% compute recovery in data center scheduling, and hardware improvements for TPUs. On mathematical tasks, it rediscovered state-of-the-art solutions in 75% of 50+ open problems and outperformed prior best in 20%, including improvements to the kissing number problem. The agent reduces kernel optimization from weeks to days via automated, unsupervised search. Technical debate centers on implications for LLM-based scientific discovery, with some viewing AlphaEvolve as a counterexample to claims that LLMs cannot autonomously discover new algorithms (contrasting with talks from experts like Yann LeCun). Commenters also anticipate that such advances signal near-term breakthroughs in unsupervised self-improvement for algorithmic discovery.
  - AlphaEvolve, when tested on over 50 open problems in mathematical domains including analysis, geometry, combinatorics, and number theory (e.g., the kissing number problem), was able to rediscover the best-known solutions in 75% of cases and surpass previous solutions in 20% of cases, resulting in verifiable new discoveries. (source)
  - The system used Gemini-powered approaches to optimize matrix multiplication kernels, which accelerated this key operation by 23% and led to a measurable 1% decrease in the training time for Gemini models. This efficiency gain translates into reduced computational expense, and AlphaEvolve’s automation reduces kernel optimization cycles from weeks of manual tuning to days of automated runs, thus speeding research cycles.
  - AlphaEvolve’s optimization strategies are directly applied to core infrastructure, including Google’s data centers and AI chip design, as well as the very model architectures (such as those powering AlphaEvolve) that it is intended to improve, making for a self-optimizing feedback loop within the AI development stack.
- Meet AlphaEvolve, the Google AI that writes its own code—and just saved millions in computing costs (Score: 455, Comments: 52): Google AI has introduced AlphaEvolve, an AI system ostensibly capable of autonomously generating novel computer algorithms, with claims of significant cost savings (millions) in computing expenses. The announcement further asserts AlphaEvolve is able to make new discoveries within computing and mathematics, echoing the ambition of prior breakthroughs like AlphaFold and AlphaGo. Technical readers flag the need for concrete evidence supporting these claims, such as reproducible benchmarks, accessible datasets, or details about algorithmic innovation. Commenters express skepticism about the magnitude of the claims, specifically demanding evidence for AlphaEvolve’s ability to invent entirely new algorithms. Others note historical precedent—AlphaFold and AlphaGo also achieved unconventional results by extensive search and self-play, but substantive empirical results validated their significance.
  - AlphaEvolve reportedly combines Gemini Flash and Gemini Pro models as its core framework, allowing modular upgrades as new SOTA models become available. This design enables adaptability and suggests sustained improvements in efficiency and capabilities as underlying models advance.
  - The impact of AlphaEvolve on complex Google infrastructure (chip design, networking, DC deployment, cloud compute) is highlighted: AI-driven optimization here could lead to significant efficiency gains across the company’s extensive tech stack, potentially outpacing competitors who lack similar vertical integration.
  - Commenters debate the legitimacy of Google’s claims, noting that Google/DeepMind historically shares less-inflated performance metrics compared to some competitors. There is demand for concrete benchmarks or evidence, especially regarding claims about algorithmic invention and mathematical discoveries.
- DeepMind unveils ‘spectacular’ general-purpose science AI (Score: 246, Comments: 29): DeepMind’s newly announced AlphaEvolve system integrates large language models (LLMs) with automated algorithmic evaluators to autonomously evolve novel, high-performance algorithms across scientific domains. In benchmarking, AlphaEvolve discovered matrix multiplication routines surpassing the long-standing Strassen’s algorithm, as well as improved designs for tensor processing units (TPUs) and cloud resource allocation. The architecture uniquely combines LLM-driven proposal generation with evolutionary algorithm selection, enabling domain-general problem-solving capabilities (for details, see the Nature announcement). Commenters highlight AlphaEvolve’s ability to outperform classic algorithms (e.g., Strassen’s for matrix multiplication) and speculate that such advances demonstrate DeepMind’s leadership toward artificial general intelligence (AGI), linking this progress to DeepMind’s recent ‘after AGI’ specialist hires.
  - DeepMind’s new AI, AlphaEvolve, develops matrix multiplication algorithms that can surpass the speed of Strassen’s algorithm, a method that has remained the fastest-known since 1969. This demonstrates the AI’s ability to discover novel optimization strategies in computational mathematics, a key benchmarking domain in computer science.
  - A critical observation is that AlphaEvolve’s capabilities are inherently tied to domains with scalable and cheap validation, such as program analysis where computational correctness and speed are straightforward to measure. For fields like astronomy, particle physics, medicine, or business—where validation is costly or limited—the impact of these AI-driven discoveries remains minimal; the limiting factor shifts from idea generation to experimental validation.
  - While AI-driven improvements in computational tasks like matrix multiplication suggest a compounding effect (flywheel effect) across technically tractable domains, the broader implication is a proof-of-concept for AI systems that could theoretically drive rapid and recursive innovation if applied across interconnected foundational technologies. This is relevant to discussions about the feasibility of a technological singularity driven by general-purpose science AIs.
2. Anthropic Claude Sonnet/Opus Model Release Anticipation and OpenAI Model Rollout
- Looks like we can expect an Anthropic release in the coming weeks (Score: 218, Comments: 61): The image depicts a formal presentation, likely by an Anthropic representative, highlighting the upcoming release of the Claude Sonnet and Claude Opus models. These models are noted for their distinctive reasoning abilities, suggesting significant advancements in Anthropic’s AI architecture. Both the post and comments indicate that credible sources (specifically The Information) have reported on imminent releases, putting Anthropic in direct competition with OpenAI and Google for major model announcements in the coming weeks. One commenter emphasizes the historical accuracy of The Information’s reporting on model releases, lending credibility to the news. Another commenter jokes about server capacity issues with previous Anthropic models, hinting at scalability concerns that the community hopes will be addressed in the new releases.
  - Discussion references The Information’s credibility in accurately predicting major AI model release timelines, highlighting their prior success in preempting industry news about upcoming models from companies like Anthropic and OpenAI.
  - There is anticipation around OpenAI attempting to upstage Google in the near term, referencing previous years where major announcements were closely timed, underscoring the ongoing competitive push among top companies (OpenAI, Google, Anthropic) for model releases and attention.
  - A user expresses particular interest in an improved version with greater server capacity, alluding to past server-side bottlenecks with releases like Sonnet 3.7, which have affected model accessibility and user experience.
- Damn ok now this will be interesting (Score: 193, Comments: 41): The image is a tweet highlighting new models from Anthropic—Claude Sonnet and Claude Opus—that can dynamically switch modes between reasoning, tool/database usage, and self-correction. They reportedly possess enhanced code generation capabilities that allow them to test and fix their own outputs. The announcement signals coming releases expected within weeks. A main technical discussion in the comments is concerns over prompt length potentially harming model performance, given more complex mode switching could require much larger system prompts. One user shares anecdotal evidence of dynamic code editing and rapid artifact previewing with what may have been early access, calling it surprisingly powerful.
  - A commenter raises concerns about system prompt length and token usage, noting that the introduction of new Anthropic models might lead to significantly larger system prompts (“8000 more tokens”), which could impact model performance or context retention. There’s a hope expressed that these models maintain their capabilities even with increased prompt size.
  - Another user details their experience with a new code-assist feature, observing artifact previews and graphical glitches occurring during iterative UI changes before the final result is committed. The granular update and commit cycle, including artifact previews, is described as feeling powerful, suggesting a technically advanced or novel implementation that improves user feedback during development.
  - There are technical remarks on token consumption, with one user emphasizing that new features are likely to increase the token usage significantly, potentially impacting operating costs and efficiency (‘token costs gone go wild,’ ‘cline is already eating tokens’).
- 4.1 model now appearing in web browser under “More models” (Score: 109, Comments: 64): The image documents the rollout of new model variants in the OpenAI ChatGPT web UI, specifically under the “More models” menu. It confirms the presence of “GPT-4.1” and its mini variant (“GPT-4.1-mini”), alongside other models like “GPT-4o” and “o4-mini”; notably, “GPT-4.1” is explicitly labeled as optimal for “quick coding and analysis.” The image provides evidence of active deployment and new model differentiation for users, indicating backend updates and evolving model lineup in OpenAI’s product. Commenters note that “4.1-mini” appears to be replacing “4o-mini,” and one user highlights that the “4.1-mini” is already accessible and reportedly performs well for coding tasks.
  - Several users note that “4.1-mini” appears to be replacing the “4o-mini” model in the web interface, suggesting an update or shift in available lightweight models. This impacts users seeking the fastest, most cost-effective options for everyday or embedded use cases.
  - Specific feedback highlights that the 4.1 model excels at coding tasks—one user reports successful integration with the Roo platform, indicating immediate developer interest and swift experimentation with new model capabilities.
  - Device and app differences are mentioned: Android users may need to update their app to access both 4.1 and 4.1-mini, while some web users only see 4.1 mini so far, suggesting phased rollout or platform-dependent availability.
3. ChatGPT as New Internet Interface and Its Societal Impact
- Last year ChatGPT was the 15th most visited site. Now it’s #5, while every other top-10 site is losing traffic (Wikipedia fell 6%). People aren’t surfing the web anymore—they’re heading straight to ChatGPT. It’s not just a tool; it’s become the new internet interface, quietly replacing the old web. (Score: 278, Comments: 117): The provided image displays a table ranking the most visited websites, highlighting ChatGPT.com climbing to the #5 position globally in traffic and showing a 13.04% month-over-month increase, in contrast to declines for traditional sites like Wikipedia (-6%). The data indicate a marked behavioral shift where users increasingly bypass conventional search engines or content aggregators, using conversational AI as their primary interface to online information. This underscores ChatGPT’s rapid emergence not just as a tool but as a gateway supplanting traditional web navigation. A technically relevant debate in the comments notes the growing user preference for AI assistants over traditional search, motivated by poor web experiences (e.g., intrusive ads, SEO manipulation) and friction in content discovery. Concerns are also raised about the future monetization of ChatGPT (e.g., advertising).
  - Several commenters point to the technical decline of traditional search engines, such as Google, attributing the shift to ChatGPT to increased web clutter from aggressive SEO tactics and ad overlays that degrade the actual information retrieval experience. This trend has resulted in users preferring ChatGPT for direct answers, as it bypasses pop-ups and extended irrelevant narratives that plague standard recipes or information sites.
  - Discussion highlights the risk of future monetization changes impacting ChatGPT, such as introducing more ads or reduced usability as traffic increases. This could mirror the historical degradation seen in other web platforms once they prioritized advertising revenue streams over user experience.
- Something Strange Is Happening To The Internet (Score: 1945, Comments: 417): The post discusses a significant shift in global web traffic, as shown by a table (see image) ranking top websites by traffic. Google.com leads but is declining (3.18% MoM), while ChatGPT has surged to #5 with a +13.04% MoM growth, outpacing Reddit, Amazon, and Whatsapp, and is now the only domain in the top 10 with positive growth. This suggests users increasingly rely on ChatGPT as a primary ‘interface’, potentially bypassing traditional search engines, blogs, or forums, signaling not just growth but a possible paradigm shift in how people access information online. Commenters are skeptical about the novelty, likening the trend to previous surges by Facebook and Google, and joking that AI likely authored the post, hinting at the growing omnipresence and debate over AI’s role in reshaping internet consumption, rather than seeing it as truly unprecedented.
  - Multiple commenters express concerns about the proliferation of content generated by large language models (LLMs) like ChatGPT, observing that distinctive patterns in posts and comments (e.g., formulaic phrasings such as “it isn’t X, it’s Y”) are strong indicators of AI-generated text and contribute to a perceived decline in authenticity and quality online.
  - One participant argues that if LLMs and generative AI contribute to reducing internet traffic and the prevalence of click-driven content economies, it could potentially improve the internet’s utility—moving from outrage and clickbait to a model that favors functional, purposeful interactions reminiscent of earlier internet eras.
  - The discussion draws parallels between current shifts (including the rise of AI content) and previous changes in major platforms like Facebook, Google, and Twitter, suggesting a pattern where technological or algorithmic shifts fundamentally reshape traffic, engagement, and how online communities form and persist. The Information) have reported on imminent releases, putting Anthropic in direct competition with OpenAI and Google for major model announcements in the coming weeks. One commenter emphasizes the historical accuracy of The Information’s reporting on model releases, lending credibility to the news. Another commenter jokes about server capacity issues with previous Anthropic models, hinting at scalability concerns that the community hopes will be addressed in the new releases.
- Discussion references The Information’s credibility in accurately predicting major AI model release timelines, highlighting their prior success in preempting industry news about upcoming models from companies like Anthropic and OpenAI.
- There is anticipation around OpenAI attempting to upstage Google in the near term, referencing previous years where major announcements were closely timed, underscoring the ongoing competitive push among top companies (OpenAI, Google, Anthropic) for model releases and attention.
- A user expresses particular interest in an improved version with greater server capacity, alluding to past server-side bottlenecks with releases like Sonnet 3.7, which have affected model accessibility and user experience.
Damn ok now this will be interesting (Score: 193, Comments: 41): The image is a tweet highlighting new models from Anthropic—Claude Sonnet and Claude Opus—that can dynamically switch modes between reasoning, tool/database usage, and self-correction. They reportedly possess enhanced code generation capabilities that allow them to test and fix their own outputs. The announcement signals coming releases expected within weeks. A main technical discussion in the comments is concerns over prompt length potentially harming model performance, given more complex mode switching could require much larger system prompts. One user shares anecdotal evidence of dynamic code editing and rapid artifact previewing with what may have been early access, calling it surprisingly powerful.
- A commenter raises concerns about system prompt length and token usage, noting that the introduction of new Anthropic models might lead to significantly larger system prompts (“8000 more tokens”), which could impact model performance or context retention. There’s a hope expressed that these models maintain their capabilities even with increased prompt size.
- Another user details their experience with a new code-assist feature, observing artifact previews and graphical glitches occurring during iterative UI changes before the final result is committed. The granular update and commit cycle, including artifact previews, is described as feeling powerful, suggesting a technically advanced or novel implementation that improves user feedback during development.
- There are technical remarks on token consumption, with one user emphasizing that new features are likely to increase the token usage significantly, potentially impacting operating costs and efficiency (‘token costs gone go wild,’ ‘cline is already eating tokens’).
4.1 model now appearing in web browser under “More models” (Score: 109, Comments: 64): The image documents the rollout of new model variants in the OpenAI ChatGPT web UI, specifically under the “More models” menu. It confirms the presence of “GPT-4.1” and its mini variant (“GPT-4.1-mini”), alongside other models like “GPT-4o” and “o4-mini”; notably, “GPT-4.1” is explicitly labeled as optimal for “quick coding and analysis.” The image provides evidence of active deployment and new model differentiation for users, indicating backend updates and evolving model lineup in OpenAI’s product. Commenters note that “4.1-mini” appears to be replacing “4o-mini,” and one user highlights that the “4.1-mini” is already accessible and reportedly performs well for coding tasks.
- Several users note that “4.1-mini” appears to be replacing the “4o-mini” model in the web interface, suggesting an update or shift in available lightweight models. This impacts users seeking the fastest, most cost-effective options for everyday or embedded use cases.
- Specific feedback highlights that the 4.1 model excels at coding tasks—one user reports successful integration with the Roo platform, indicating immediate developer interest and swift experimentation with new model capabilities.
- Device and app differences are mentioned: Android users may need to update their app to access both 4.1 and 4.1-mini, while some web users only see 4.1 mini so far, suggesting phased rollout or platform-dependent availability.

3. ChatGPT as New Internet Interface and Its Societal Impact

Last year ChatGPT was the 15th most visited site. Now it’s #5, while every other top-10 site is losing traffic (Wikipedia fell 6%). People aren’t surfing the web anymore—they’re heading straight to ChatGPT. It’s not just a tool; it’s become the new internet interface, quietly replacing the old web. (Score: 278, Comments: 117): The provided image displays a table ranking the most visited websites, highlighting ChatGPT.com climbing to the #5 position globally in traffic and showing a 13.04% month-over-month increase, in contrast to declines for traditional sites like Wikipedia (-6%). The data indicate a marked behavioral shift where users increasingly bypass conventional search engines or content aggregators, using conversational AI as their primary interface to online information. This underscores ChatGPT’s rapid emergence not just as a tool but as a gateway supplanting traditional web navigation. A technically relevant debate in the comments notes the growing user preference for AI assistants over traditional search, motivated by poor web experiences (e.g., intrusive ads, SEO manipulation) and friction in content discovery. Concerns are also raised about the future monetization of ChatGPT (e.g., advertising).
- Several commenters point to the technical decline of traditional search engines, such as Google, attributing the shift to ChatGPT to increased web clutter from aggressive SEO tactics and ad overlays that degrade the actual information retrieval experience. This trend has resulted in users preferring ChatGPT for direct answers, as it bypasses pop-ups and extended irrelevant narratives that plague standard recipes or information sites.
- Discussion highlights the risk of future monetization changes impacting ChatGPT, such as introducing more ads or reduced usability as traffic increases. This could mirror the historical degradation seen in other web platforms once they prioritized advertising revenue streams over user experience.
Something Strange Is Happening To The Internet (Score: 1945, Comments: 417): The post discusses a significant shift in global web traffic, as shown by a table (see image) ranking top websites by traffic. Google.com leads but is declining (3.18% MoM), while ChatGPT has surged to #5 with a +13.04% MoM growth, outpacing Reddit, Amazon, and Whatsapp, and is now the only domain in the top 10 with positive growth. This suggests users increasingly rely on ChatGPT as a primary ‘interface’, potentially bypassing traditional search engines, blogs, or forums, signaling not just growth but a possible paradigm shift in how people access information online. Commenters are skeptical about the novelty, likening the trend to previous surges by Facebook and Google, and joking that AI likely authored the post, hinting at the growing omnipresence and debate over AI’s role in reshaping internet consumption, rather than seeing it as truly unprecedented.
- Multiple commenters express concerns about the proliferation of content generated by large language models (LLMs) like ChatGPT, observing that distinctive patterns in posts and comments (e.g., formulaic phrasings such as “it isn’t X, it’s Y”) are strong indicators of AI-generated text and contribute to a perceived decline in authenticity and quality online.
- One participant argues that if LLMs and generative AI contribute to reducing internet traffic and the prevalence of click-driven content economies, it could potentially improve the internet’s utility—moving from outrage and clickbait to a model that favors functional, purposeful interactions reminiscent of earlier internet eras.
- The discussion draws parallels between current shifts (including the rise of AI content) and previous changes in major platforms like Facebook, Google, and Twitter, suggesting a pattern where technological or algorithmic shifts fundamentally reshape traffic, engagement, and how online communities form and persist.

AI Discord Recap

A summary of Summaries of Summaries by gpt-4.1-2025-04-14

1. Model Benchmark Showdowns and Coding Performance

Sonar Models Sweep Benchmarks, GPT-4.1 Steals Coding Crown: Sonar Pro Low crushed Claude 3.5 Sonnet on BrowseComp with 4.0% accuracy (nearly 50% higher) and boasted up to 3x faster, more consistent latency, while Qwen 3 8B and GPT-4.1 received widespread praise for coding and reasoning tasks (Perplexity AI, Unsloth AI).
- Community consensus across multiple Discords is that GPT-4.1 is the new coding king, with users trashing O3 for code but lauding it for planning/research, and Qwen 3 models outperforming Gemma 3 at similar sizes, especially after fine-tuning.
Gemini 2.5 Pro and O4 Mini High Ignite Coding Rivalry: Gemini 2.5 Pro wowed users with C++ coding prowess, described as a dream come true, while O4 Mini High was called a coding beast for its fast, high-quality completions across large codebases (LMArena, OpenAI).
- Despite some hallucination complaints, users consider Gemini 2.5 Pro and GPT-4.1 as top-tier for coding, with Claude 4 + O3 Pro anticipated as an insane combo once released.

2. Distributed and Decentralized Training/Inference

Psyche Network Powers Decentralized LLM Training: Nous Research launched the Psyche Network, a decentralized training platform coordinating global GPUs via custom peer-to-peer networking and DisTrO optimizers, aiming to pretrain a 40B parameter LLM on a dataset mixing FineWeb (14T), FineWeb-2 (4T), and The Stack v2 (1T).
- The testnet quickly filled 500k slots in 40 minutes, and users can contribute USDC for compute, with open forums driving model design and a GitHub repo available for community contributions.
Lost in Conversation: LLMs Tank on Multi-Turn Tasks: The Lost in Conversation paper found LLMs suffer a 39% performance drop in multi-turn conversations versus single-turn, with unreliability stemming from premature solution attempts and poor error recovery (GitHub repo).
- This exposes a major weakness for distributed agentic systems and highlights the need for improved error correction and conversational memory mechanisms.

3. Hardware and Performance Optimizations

PCIE 5.0, CUDA Strides, and PyTorch Nightly Perks: Upgrading to PCIE 5.0 boosted token generation speed from 26 tkps to 38 tkps on a 50-series GPU, while PyTorch devs recommend needs_exact_strides over needs_fixed_stride_order for more reliable tensor ops in nightly builds (LM Studio, GPU MODE).
- Matrix load/store with PTX instructions, including ldmatrix, was spotlighted in a blogpost and repo, while leaderboard submissions on AMD MI300 showed aggressive optimizations for fp8-mm and mixture-of-experts kernels.
CUTLASS 4.0, CuTe DSL, and Kernel Kung Fu: CUTLASS 4.0 and CuTe DSL are out (pip install nvidia-cutlass-dsl), with Jupyter notebook examples and Python 3.12 support, though the release versioning appears ‘borked’ and MLIR compiler is not yet open source (CUTLASS Notebooks).
- Custom CuTe kernels outperformed PyTorch by 60x on large problems, and new kernel debugging tips (Nsight Compute, nsys, ncu) were shared, while reference kernel PRs improved leaderboard runtimes (PR #31).

4. Prompt Engineering, Tokenization, and Memory Mishaps

Tokenization Woes: Gemma, BOS Tokens, and PromptTemplates: GemmaTokenizer in Torchtune was caught mismatching output tokens with HFModelTokenizer due to missing PromptTemplates and multiple BOS tokens from config errors, tracing the blame back to HF/Google’s tokenizer config (Torchtune).
- Discussions stressed the need for template and config alignment, and that even technically ‘correct’ implementations can be functionally flawed for real-world LLM usage.
LlamaIndex Memory Gets a DB Makeover: LlamaIndex launched a Memory component for agentic workflows, supporting in-memory and scalable DB backends (SQLite, PostgreSQL), with debate over context serialization vs. DB for long chat histories.
- For large-scale or structured history, a DB is recommended over plain serialization, with users comparing the tradeoffs for persistent context in LLM-powered agents.

Discord: High level Discord summaries

Perplexity AI Discord

Sonar Models Demolish Benchmarks: Sonar Pro Low outperformed Claude 3.5 Sonnet on BrowseComp, with Sonar Low also outperforming Claude 3.7 Sonnet, according to recent benchmark evaluations.
- Sonar Pro Low achieved 4.0% accuracy on BrowseComp, almost 50% higher than Claude 3.5 sonnet, and both Sonar and Sonar Pro delivered up to 3x faster response times with more consistent latency.
Deep Research Launch Timing Speculated: Members speculated whether Perplexity AI’s new deep research feature will be released before or after the Comet project, one stating that deep research should be first.
- Another member hoped that Comet would be more than just a browser with a copilot style side bar, referencing a thinking batman gif.
Merlin AI’s Pricing Displeases Users: Users discussed Merlin AI’s pricing, with one noting its shady pricing due to unclear rate limits and another sharing that support gave a shitty response when asked, while praising its web search quality.
- It was also noted that for standard paid accounts on Merlin AI, any usage that surpasses $16 per day will also lead to the immediate termination of service for that day.
Perplexity Pro Unlocks API Access: Perplexity Pro actually includes $5/month in API credits!
- You can find the API docs and get started at Perplexity AI docs; registering via a credit card is required, but users will not be charged if they keep their usage at or below $5.
Perplexity Projects Sparks Vibing Concerns: Speculation arose around Perplexity’s new project feature, with users sharing screenshots and videos, like this one, and some expressing concerns about its potential misuse for vibe coding.
- There were also questions about how certain users gained early access to the feature, sparking discussion on testing catalogs and potential connections within the AI community.

LM Studio Discord

Embedding Modules Hit File Size Limit: Users are encountering a “Maximum file size per conversation reached” error with embedding modules in LM Studio, capped at 31.46 MB when using the default nomic-ai/nomic-embed-text-v1.5-GGUF model.
- A user seeks to process files of a few hundred MB for audio-to-lyrics generation, suggesting a need for larger embedding module sizes.
LM Studio’s Log Spam is Termed Benign: A reported log spam issue in LM Studio is considered benign and has been addressed in the 0.3.16 build 2.
- This fix should alleviate concerns about potential issues when running embedding models.
LM Studio JIT Loading Gets Glitchy: LM Studio’s JIT model loading via the Continue VS Code plugin sometimes serves clients with the wrong model if another is already loaded.
- Users were advised to check model identifiers using lms ls and remove 8bit from the configuration to resolve mismatches.
Devs Advised Against Building LLMs From Scratch: Members debated building LLMs from scratch, with most advising against it due to massive compute costs and data needs, recommending fine-tuning instead, and sharing the “Build a Large Language Model from Scratch” Manning book for theoretical knowledge.
- A member expressed interest in building a model for various use cases.
PCIE 5.0 Gives Performance Jump: A user reports a performance increase after upgrading to a 50 series card, going from 26 tkps to 38 tkps using qwen3-14b-q4km with a 4096 context, attributing it to PCIE 5.0 benefits.
- The user highlights the advantage of installing the card in the bottom slot without interference due to the shorter PCIE connector design.

LMArena Discord

Gemma ‘Cutiepie’ Incoming?: Members speculated about a new Gemma model, tentatively named Cutiepie, possibly timed for Google I/O.
- Some members are trying to determine whether Gemma is free without requiring a login.
DeepSeek R2 still under wraps: Speculation around DeepSeek R2 has cooled, but it remains in development using Huawei GPUs and Chinese funding.
- There’s internal pressure to release something impressive, especially given the current models’ strength in reasoning traces and multilingual chain of thought.
Gemini 2.5 Pro Impresses with C++: Members laud Gemini 2.5 Pro’s coding prowess, especially with C++, with some describing it as a dream come true.
- Despite the enthusiasm, some users report models are suffering from hallucinations, while others consider it superior to o3.
O3 Pro stuck in development?: The release of o3 Pro is delayed, and some suggest OpenAI is strategically waiting for other labs to reveal their cards to take the lead.
- It is internally believed that OpenAI already has o4 but is holding it back for strategic reasons, and there is an expectation of an insane combo with Claude 4 + o3 pro.
GPT 4.1 hailed as coding genius: Members highlighted GPT 4.1’s coding capabilities, noting its quick compilation and error-fixing on large codebases, alongside instant response times.
- One member found GPT 4.1 a significant upgrade from GPT 4o mini for free users, but felt that 4.1 is solely for coding.

Unsloth AI (Daniel Han) Discord

File Usage Fixes are Frustrating: Users wrestled with an AttributeError when using file:// for image loading, initially suspecting an issue with how the image variable was being set to None.
- Despite path corrections, the error persisted, prompting consideration of URLs as an alternative, with one user exclaiming image finetuning gives me a headache.
Qwen3 Inference Speed Slowdown: A user reported slow inference speeds with Qwen/Qwen3-0.6B on an A100, achieving only 30 tokens/second with a batch size of 1, using Unsloth’s FastLanguageModel.from_pretrained().
- Others suggested that the small model size and batch size might negate the benefits of load_in_8bit=True, while another user noted the base model does not contain a chat template in the tokenizer_config.json.
O3 Outperforms, Except on Coding: Members trashed O3 for outputting garbage code, suggesting it’s better for planning and research with tool calls while GPT 4.1 is the best coding model, available in Github Copilot for those with the edu account.
- It was suggested that O4 mini high is better for coding, with one member stating O3 is fucking garbage for coding.
Qwen3: Benchmarks Boss Gemma: It was observed that the Qwen 3 models outperform Gemma 3 at similar sizes, with the Qwen 3 8b model achieving near SOTA with fine tuning.
- It’s considered a perfect sweet spot for local llms, one member declared that the Qwen 3 8b model is CRAZY good.
Repo2txt vs RAG: A member advocated for repo2txt.com as a superior alternative to RAG for code injection, selecting files from a repo and injecting it directly into the prompt.
- They argued that models can’t read all code automatically in github and make mistakes.

OpenAI Discord

GPT-4.1 Debuts and Impresses in Coding: GPT-4.1, a model specialized for coding and instruction following, is now available in ChatGPT for Plus, Pro, and Team users, and will be available for Enterprise & Edu users in the coming weeks.
- GPT-4.1 mini is replacing GPT-4o mini in ChatGPT for all users, with safety evaluation details available in the Safety Evaluations Hub.
O3 Unlimited Tempts Users: Users rave about the unlimited O3 model, calling it a legendary model for solving issues, with new support for deep research via GitHub Repos, OneDrive, and SharePoint, and 20 file uploads per Chat/Project.
- Despite the Teams plan offering desirable internal knowledge features, members suggest sticking with the Pro plan because of the benefits of unlimited O3.
PII Data Guardrails Provoke Probes: Members reported challenges with PII guardrails blocking home address requests from HR data-connected apps, especially those with access to HR data.
- They suggested that users should contact OpenAI support for guidance on handling sensitive data requests and adhering to PII policies.
AI Universe Simulator Takes Shape: A member is building a 1:1 simulation of the universe using AI to explore thinking and create automated ecosystems, aiming to scale it into a web browser.
- The focus is on open communication lines between models, prioritizing efficiency over stacking models.
Ollama Gains Traction Over Windsurf: Users discussed local model inference with Ollama as an alternative to services like Windsurf, cautioning against paying for such services for AI app development.
- The recommendation is to focus on learning prompting to avoid API costs and use Ollama with VS Code extensions like Continue and Roo Code, while also pursuing LLM and Agentic courses on Hugging Face Learn.

Cursor Community Discord

GPU Power Peer-to-Peer Considered: Members pondered open-source options for utilizing GPU power in peer-to-peer or decentralized systems.
- The discussion yielded no definitive solutions.
Cursor’s 20% Markup Debated: Users debated the justification of Cursor’s 20% surcharge over actual API pricing.
- Some considered it a good deal whereas another claimed to save $600 per month by using Claude Max outside of Cursor.
Gemini 2.5 Pro Joins the Model Lineup: The community noticed the addition of Gemini 2.5 Pro to Cursor’s available models on May 6th.
- Some noted that the new model finally fixed the ‘i will code now (stops coding)’ issue.
Mono-Repo Methodology Manuevers: Users discussed managing multi-repo projects, with one suggesting consolidating into a single monorepo to avoid context fragmentation.
- Another user mentioned using separate workspaces within the same parent folder in Cursor 0.50.x.
Background Agent Begs for Confirmation: Users voiced frustration that the background agent requires excessive confirmation, increasing fast request usage.
- One user complained the background agent just keeps asking for confirmation to do things and its just eating fast request up faster.

Yannick Kilcher Discord

AI Regulation Faces Decade-Long Freeze: House Republicans inserted language into the Budget Reconciliation bill that would block all state and local governments from regulating AI for 10 years, potentially impacting privacy regulations (source).
- The provision, introduced by Representative Brett Guthrie of Kentucky, broadly prohibits state or local enforcement of laws regulating AI models or systems.
AlphaEvolve Rediscovered State-of-the-Art: DeepMind’s AlphaEvolve, a Gemini-powered coding agent, combines LLM creativity with automated evaluators to evolve algorithms for math and practical applications (DeepMind Blog).
- The system rediscovered state-of-the-art solutions in roughly 75% of cases and improved the previously best known solutions in 20% of cases, even advancing the kissing number problem.
RL-Diffusion Sparks Patentability Skepticism: Members discussed developing RL-Diffusion, combining forward and backward processes into one controlled by RL, but some expressed skepticism about its novelty and practical implementation.
- They emphasized that transformation into a patent-eligible application requires more than simply stating the abstract idea while adding the words ‘apply it.’
GSM8K Benchmark Nears Perfection: Language models are approaching near-perfect accuracy on grade-school level math benchmarks like GSM8K, with detailed analysis available in this paper and this blogpost.
- The paper seeks to determine whether language models truly develop reasoning skills or are simply memorizing templates.
Debate Brews on LLM Planning Abilities: Members debated whether LLMs formulate plans before generating solutions, based on claims that models avoid unnecessary computations.
- One member countered that the model learns that unnecessary things are random noise which by definition have no signal for the model to learn, so it ignores them.

OpenRouter (Alex Atallah) Discord

Personality Platform Invites Chatbot Customization: A new chatbot platform called Personality has launched, aiming to provide more customization and less filtering than existing solutions like c.ai, offering users the ability to create and roleplay with multiple characters at personality.gg.
- The platform also provides free image generation at personality.gg/playground, though it’s important to note that this feature is not powered by OpenRouter.
OpenAI Reasoning Model Names Spark Confusion: Users are requesting naming consistency in OpenAI’s reasoning models (e.g., /openai/o4-mini-high) to include reasoning level variants for all models, as documented in the OpenAI documentation.
- The primary goal is to ease evaluation across different reasoning models and reduce confusion in which models can do what.
Free Google Models throttled: Users are reporting extremely low or nonexistent rate limits with free Google models, even with available credits, prompting recommendations for alternatives like DeepSeek V3.
- Concerns have also surfaced about the potential removal of free routes for Gemini following a change shared on Twitter.
Claude’s System Prompt Causes OpenRouter Differences: Discrepancies in helpfulness between using Claude via OpenRouter versus the native Anthropic website are attributed to the extensive system prompts used by Anthropic, comprising around 16000 tokens.
- Users can manually implement the system prompt, available on GitHub, which includes tools.
“Always Use This Key” option defaults to specified key: A new option labeled “Always use this key” was introduced, causing confusion because of a similar but different option labeled “Use this key as a fallback”.
- The new option exclusively uses the specified key and prevents fallback to OpenRouter, which represents a change from the behavior of the older fallback setting.

Manus.im Discord Discord

Manus Credits Glitch and Time Zone Tick Tock: Several users reported their daily 300 credits on Manus not refreshing at 00:00 GMT, likely due to time zone processing inconsistencies.
- One user noted their credits refresh at 8:00pm in their timezone, highlighting the timing discrepancies.
Invitation Code Frenzy and origins: A user boasted a glitched account with 100 invitation codes and shared multiple invitation links, triggering discussion about their origin.
- Speculation arose that the codes came from paid subscriptions, while others questioned their value to existing members since new users get 500 credits.
Refund Woes for Failed Jobs: A user expressed frustration over failing to get a refund after a job failure consumed 800 credits on Manus.
- Other users stated that refunds are generally not provided, even when the service malfunctions, with one suggestion to dispute the charge.
Facebook Marketplace Scams get Rapped: A user requested Manus to generate a rap about Facebook Marketplace lowballing, employing slang such as best price, last price, and mates rates.
- The user clarified that the request was not for advertising purposes but to rap about scenarios and experiences related to the online marketplace.

GPU MODE Discord

needs_exact_strides trumps needs_fixed_stride_order!: In pytorch nightly, at::Tag::needs_exact_strides is superior to needs_fixed_stride_order due to the latter’s occasional inaccuracies.
- One developer moved the .contiguous calls in the C++ code so torch.compile can’t interfere.
Matrix Magic with PTX Instructions: A member shared a blogpost detailing how to efficiently load and store matrices within a warp using PTX instructions, including the ldmatrix instruction.
- They also linked the associated code, PTX documentation, and a LinkedIn post explaining the stmatrix instruction.
Reference Kernel Gets a Jolt of Lightning!: A pull request has been merged to mitigate the long reference time issue, aiming to improve run times when the main bot is updated.
- The update addresses concerns about the reference implementation taking too long, especially when faster implementations require numerous runs to meet termination criteria.
CUTLASS 4.0 and CuTe DSL released, but Borked?!**: CUTLASS 4.0 and CuTe DSL are now released, accessible via pip install nvidia-cutlass-dsl, and NVIDIA recommends starting with the Jupyter notebooks.
- Members noted the nvidia-cutlass-dsl is version 0.0.0.... which was released like 2 months ago according to pypy, so something seems borked with the release.
Factorio Blueprints Drafted by Genetic Algorithm: A member is planning to create a genetic algorithm that generates Factorio blueprints based on specified requirements like building materials, input/output locations, and area constraints, and found a paper on genetic programming for dynamic path-finding.
- The algorithm aims to enable LLMs to provide constants for fulfilling these requirements, serving as a tool for dynamic factory design.

aider (Paul Gauthier) Discord

Gemini 2.5 Pro Rewrites Huge Files: Users find that Gemini 2.5 Pro, used via OpenRouter with --edit-format diff-fenced, sometimes rewrites entire large files for minor changes, while others report that AI Studio provides faster results.
- Some prefer Sonnet 3.7 for their workflows, using cheaper models for simple tasks and Sonnet3.7 for more complex architecture.
Common Lisp Gets Modern AI Tooling: Users discussed leveraging existing models to enhance development in languages like Common Lisp, planning to utilize books and data sources to generate datasets and prompts for in-context learning, with one planning to use a Lisp DSL to build a compiler/interpreter.
- The approach involves LoRA-ing small models and employing semantic retrieval to integrate programming book knowledge into the context window.
Gemini Models Adds Comments and Stupid Ideas: A user observed that Gemini was adding excessive comments and unwanted ideas directly into the code, which it then executed.
- The user suggested enforcing strict code changes without incorporating unsolicited suggestions.
Aider Upgrade Still Troublesome: Users are still facing problems when upgrading Aider, where the version number fails to update even when the upgrade process seems successful.
- The SSL warning during the upgrade is likely unrelated and has been a persistent issue since January.
Aider Gets an Aussie Chat Makeover: A user discovered a method to improve the readability of Aider’s replies by modifying the ~/.aider.conf.yml file.
- They recommended using chat-language: English (Australia, use headings, bullet points, concise sentence fragments) to achieve a more concise output.

Eleuther Discord

LM-Eval-Harness Gets Datasets Quicker**: Users can now download datasets for specific tasks in lm-eval-harness with python3 lm_eval --model dummy --tasks [task_list] --limit 1 without immediately evaluating a model.
- The dummy model, which is defined here and returns random numbers, is used for testing purposes.
R1-distill Gets Prompted Formally**: Members debated on whether to prompt R1-distill models with a “user: What is xyz? assistant: ” format versus doing “What is xyz? ”.
- The debate ended without resolution.
LLMs Facing Regulation Heat: Members discussed regulations concerning bias standards in algorithms, particularly in the context of LLMs, mentioning regulatory agencies like the NCUA, EEOC, FDIC, HHS, and FTC.
- Regulators may view algorithms that can’t be studied as infringing, based on the archived EEOC guidance.
Skywork’s Reasoning Abilities in Focus**: The Skywork model and its techniques were lauded after its release, with a link provided to the Skywork Open Reasoner Series.
- One member noted that Skywork normalizes by the total tokens in the training batch rather than per sequence.
lm-eval’s Multi-GPU Woes**: A member reported uneven GPU utilization when using parallelize=True in lm-eval, with GPU 0 at 100% and GPU 1 at 0%.
- Another member suggested using vllm tensor parallel as it is more reliable and suggested accelerate launch -m lm_eval ... for running multiple replicas, instead of parallelize which uses naive pipeline parallelism.

Nous Research AI Discord

Atropos Gets Axolotl Support: Nous Research released v0.2.0 of Atropos, featuring integration with Axolotl as an official trainer partner and a usage guide.
- The update includes new environments, updated API handling, and better TRL support.
Psyche Network Democratizes Training: Nous Research launched the Psyche Network, a decentralized training network intended to democratize AI development by bringing together distributed compute resources for training large-scale models.
- The testnet launch involves pre-training a 40B parameter LLM using an MLA Architecture and a dataset comprising FineWeb (14T), FineWeb-2 (4T), and The Stack v2 (1T).
DisTrO Optimizer Shatters Bandwidth Ceiling: The Psyche network utilizes Nous’s DisTrO optimizers and a custom peer-to-peer networking stack to coordinate globally distributed GPUs, overcoming previous bandwidth constraints in AI training.
- Members can contribute USDC for compute.
Multi-Turn Conversations Trip Up LLMs: The Lost in Conversation paper and its corresponding GitHub repo, analyzes LLM performance in multi-turn conversations versus single-turn settings and it reveals a 39% average performance drop across six generation tasks in multi-turn scenarios.
- The paper attributes this to a minor loss in aptitude and a significant increase in unreliability, concluding that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
Benchmarks Flounder, Frontier Models Fumble: A member finds it very hard to feel out which frontier model is better at different tasks because benchmarks not granular or diverse enough and shares a link.
- They state that the “best” coding model might still be terrible at front end, or xyz framework, data viz etc.

Notebook LM Discord

NotebookLM Courts UX Feedback: NotebookLM team seeks user input on multilingual Audio Overviews to enhance user experience through participation in User Experience studies.
- Users are encouraged to provide feedback to improve the multilingual audio features within NotebookLM.
Invisible Sun TTRPG Shines with NotebookLM: A user is learning the Invisible Sun TTRPG by Monte Cook Gaming, using NotebookLM and ChatGPT Projects for rules lookup and has cited the shareability factor as a reason to prefer NotebookLM.
- They are planning to test NotebookLM’s insights on a new book coming via Backerkit.
Google User Jitters Over Potential NotebookLM Sunset: A user voiced concerns about NotebookLM being discontinued, drawing on Google’s history of product shutdowns, specifically being discontinued at an inconvenient time.
- Others argued that NotebookLM’s unique value makes its sunset unlikely, suggesting a potential rebrand or marketing initiative instead.
PDF Upload Restrictions Irk Users: Multiple users reported experiencing account restrictions preventing them from uploading PDFs to NotebookLM.
- The discussion provided no clear resolution to the problem.
Podcast Length Hack Surfaces: A user inquired about extending podcast length, and another suggested padding the podcast with repeated links or documents on the same topic to reach a 22-minute duration.
- It remains unconfirmed whether this strategy is universally effective.

Latent Space Discord

OAI Launch Stories Shared: A member shared very wholesome stories of OpenAI launches from andrewmayne.com.
- The author reminisced about the early days and scaling challenges.
ChatGPT Scaling Deconstructed: The community shared a link to a newsletter article titled Building, launching, and scaling ChatGPT by the Pragmatic Engineer.
- The article goes over the history and tech stack of the ChatGPT launch.
AI Founder in Residence Seeks Role: A member asked about Founder in Residence programs focused on AI, seeking advice on how to position themselves, as they have experience building AI systems for Analytics use cases in Amazon ads and want to build Self-Serve Agents in the same analytics space.
- No additional details were given.
Gemini Powers Algorithm Design with AlphaEvolve: Google DeepMind introduced AlphaEvolve, a coding agent powered by Gemini designed for creating advanced algorithms.
- This could be a pivotal moment in algorithm design as it shows how coding agents can be harnessed.
Prof Tom Yeh to walk through the Evolution of Llama 1/2/3/4: Prof Tom Yeh will walk through the Evolution of Llama 1/2/3/4 in one session at a special event.
- The event is organized by a member of the community.

HuggingFace Discord

Qwen Model Distillation Remains Elusive: A member sought resources for distilling the Qwen family of models but no specific notebooks or references were shared.
- Community members may have leads on this topic, so further exploration may be useful.
Perceptron Visualizers Captivate Community: A member shared a perceptron visualizer for educational purposes, showcasing its functionality in attached videos My_Video_0079.mp4 and My_Video_0080.mp4.
- Another member contributed to the visualization collection from darkspark.dev.
Stable Diffusion Spins Locally!: Community members explored running Stable Diffusion locally using combinations of Diffusers and TGI, or with WebUI Forge (GitHub link) or reForge (GitHub link).
- Links to Diffusers documentation (huggingface.co, huggingface.co/learn) were also helpful for setup.
PDF Format Ranks Low in Popularity Contest: Users voiced strong criticism of the PDF format, with one describing it as the worst format ever seen.
- A user proposed adding a markdown output option to improve semantic relationships for RAG ingestion, but others raised concerns about full categorization issues, particularly with tables.
Smolagents Framework Falls Flat: A user reported using the smolagents framework with Qwen and tools used in the course, citing terrible results.
- This may reflect a need for further refinement of the framework, or alternative frameworks for similar tasks.

MCP (Glama) Discord

Authpython APIs Lag Typescript’s: The community noted that Authpython generally lags behind Typescript by about 1-2 months in API updates, but a link to a Go-MCP client was shared for reference.
- This could impact development timelines and integration efforts for projects relying on the latest features.
Debug Smithery MCP Servers with ithena-cli: Members suggested debugging an MCP server running on Smithery using the ithena-cli tool.
- The tool stores all input and output for debugging purposes, providing a detailed log of interactions to help identify issues in Claude Desktop.
Tiny Agents Embrace Remote MCP Support: Hugging Face Tiny Agents now features remote MCP Support, enabling connections to both SSE and Streaming HTTP servers from the command line.
- This enhancement provides a versatile approach to agent development and management, extending the capabilities of Tiny Agents in networked environments.
Chatter: A Web-Hosted, MCP-Enabled Client Emerges: A new LLM-provider-agnostic, MCP enabled, web-hosted chat client named chatter has been open sourced and hosted at moopoint.io.
- Aimed as a web alternative to Claude Desktop, the client promises a free tier, memory, MCP server hosting, image handling, file uploads, and voice interaction soon.
Yarr MCP Servers make landfall: New community implementations of ARRs MCP servers have landed on GitHub.
- This was mentioned on X (formerly Twitter) by a community member

Torchtune Discord

Torchtune Models Hitch a Ride with vLLM: A member reported running a custom Torchtune model with vLLM in their internal version of GRPO.
- They are considering making their implementation public, after a user inquired about enabling vLLM support for their model.
vLLM Integration Gets Synchronized: A member suggested creating a synchronous GRPO recipe with vLLM, in addition to an asynchronous version.
- They stated a strong preference for the vLLM version, saying they genuinely don’t see any reason not to.
Gemma Tokenizer Strays from HFModelTokenizer: A member found that the HFModelTokenizer with the Gemma chat template produces output tokens that do not match the torchtune GemmaTokenizer tokens.
- This indicates that torchtune’s GemmaTokenizer may not be correctly applying the chat template.
Gemma PromptTemplate Goes MIA: It was noted that a specific PromptTemplate for Gemma is missing, which leads to incorrect tokenization and potential problems with the system role.
- While the default might be to use the Alpaca template, a correct Gemma-specific template is crucial.
BOS Tokens Error Injected from HF/Google’s Config: The HF tokenizer adds multiple beginning-of-sequence (BOS) tokens because the configuration has "add_bos_token": true alongside a BOS token in the chat template.
- This issue comes directly from HF/Google’s tokenizer config, making the implementation technically “correct” but functionally flawed.

Modular (Mojo 🔥) Discord

Variant SIMD Bug Triggers Segfaults in Mojo: A user discovered a crash with Variant when employing SIMD types in Mojo, with a segfault occurring between print statements when a Variant[T](simd) is used, and the issue appears to stem from insufficient space allocation within Variant or a lifetime issue.
- A reproducible example was shared on GitHub issue 4578, showcasing the bug’s erratic behavior relative to print statement locations.
Register Passable Types Face Scrutiny: Doubts have surfaced regarding the use of register_passable types exceeding system register sizes in Mojo, potentially leading to miscompilations due to LLVM’s limitations.
- The current Variant implementation may be flawed for register passable types T where sizeof[T]() surpasses system register sizes, suggesting replacement with various Trivial versions.
Mojo Integrates with Colab: It is now simpler to compile and execute Mojo code within a Colab notebook cell using import max.support.notebook, which introduces a %%mojo magic command.
- The announcement was detailed on the Modular forums.

tinygrad (George Hotz) Discord

WebGPU Backend Catches a Bug: The WebGPU backend has a bug where the generated kernel doesn’t have consecutive DEFINE_GLOBAL args, causing issues with bufs_from_lin, details here.
- claude reportedly fixed it.
BEAM Parameter Busts WebGPU Performance: Setting the BEAM parameter negatively impacts WebGPU backend performance: running at 30ms with no beam, but 150ms with BEAM=1.
- It runs at 100ms with BEAM=2.
Tinybox UI goes Minimalist: A minimalist UI concept for tinybox, featuring no login, no cloud, no fluff, emphasizing fast, local hardware control was built and showcased here.
- An HTTP settings page for tinybox is generally supported, given it maintains 0 deps and absolute minimal line count.
Blake3 Bountiful for Tensor Storage: There is a bounty for a high-performance blake3 implementation to use for content-addressable tensor storage in the cloud.
- The implementation should be general-purpose.

LlamaIndex Discord

LlamaIndex’s Memory Component Boosts AI: LlamaIndex rolled out a new Memory component that gives AI agents short and long-term memory, improving context in conversations and enabling static memory blocks (link).
- A user reported challenges with the Memory component in workflows, particularly that memory clears with each workflow call when user_id sets the session_id.
LlamaExtract Adds Citations and Reasoning: @tuanacelik’s new code walkthrough shows how to add citations and reasoning to LlamaExtract (link).
- The walkthrough shows how to make a schema that tells the LLM what to pull from complex data.
Memory Defaults to DB, DB Reco’d for Scale: The Memory component defaults to an in-memory SQLite DB, but use a local SQLite or PostgreSQL DB for scalability by changing the database URI.
- For long chat histories, a database is better than memory.to_dict() serialization as a JSON blob.
Context Serialization vs. DB Debated: A user questioned if a database connection is better than serializing the context with the Memory component, since context restore recovers chat history.
- The clarification came that serialization is fine for defaults but databases rock for large chat histories or needing structured history saving, noting python dict vs redis is the same problem.

Cohere Discord

Cohere Users Get Charged: A user shared a link to the billing dashboard for checking the number of API calls made to Cohere.
- However, a user noted that the trial key only displays tokens and not the raw number of requests, suggesting that Cohere may not explicitly count API calls.
Users Consider Cohere’s Value: Members discussed use cases for Cohere compared to models like ChatGPT and Anthropic.
- This discussion highlights the ongoing evaluation of Cohere’s positioning in the competitive landscape of AI models.
Cohere’s Command aCurious still elicits questions: A member sought guidance on suggested generation parameters for Command aCurious.
- The request underscores the importance of understanding and optimizing parameters for specific models to achieve desired results with Cohere.

LLM Agents (Berkeley MOOC) Discord

Medium Article or X Post unlocks Course Certificate: Members clarified that earning a course certificate requires writing a Medium article or an X post summarizing one of the lectures.
- Interested members must submit their work via this form to receive credit.
Submitting Coursework is critical for Certificate: To get the certificate, the coursework must be submitted via the provided Google Forms link after completing a Medium article or X Post.
- The submission ensures that the work is properly credited towards the course certificate.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Nomic.ai (GPT4All) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.

Discord: Detailed by-Channel summaries and links

Perplexity AI ▷ #general (748 messages🔥🔥🔥):

Android app custom model selection, Deep Research release date, Merlin AI pricing and web search quality, Perplexity AI Sonar vs GPT models, AI Studio for multimodal utility

Users Inquire about Custom Model Choice on Android App: A user inquired about the possibility of selecting a custom model, specifically Grok3, within the Perplexity AI Android application, attaching a screenshot.
- No immediate solution or workaround was provided in the discussion.
“Deep Research” Expected Soon, “Comet” Speculated: Members discussed the timing of Perplexity AI’s new deep research feature and speculated whether it would be released before or after the Comet project, with one member stating that, according to their communications, deep research should be first.
- Another member expressed hope that Comet would be more than just a browser with a copilot style side bar, referencing a thinking batman gif.
Merlin AI’s Murky Rate Limits Draw Ire: Users discussed Merlin AI’s pricing, with one noting its shady pricing due to unclear rate limits and another sharing that support gave a shitty response when asked, while praising its web search quality.
- It was also noted that for standard paid accounts on Merlin AI, any usage that surpasses $16 per day will also lead to the immediate termination of service for that day.
AI Studio Hailed as Superior Multimodal Marvel: A user championed AI Studio for its multimodal utility, particularly for supporting audio and video input, which is unmatched by major LLM chats; they further added a key detail: AI Studio is our lord and savior for true multimodal utility.
- Comparisons were made to ChatGPT and other platforms, with users emphasizing AI Studio’s capabilities being free.
Projects Feature Sparks Speculation and Sneak Peeks: Speculation arose around Perplexity’s new project feature, with users sharing screenshots and videos, like this one, of its functionality, and some expressing concerns about its potential misuse for vibe coding.
- There were also questions about how certain users gained early access to the feature, sparking discussion on testing catalogs and potential connections within the AI community.

Token Minimization, Sustain

Token Minimization techniques explored: A user shared a Perplexity AI search result about token minimization for sustain.
- Another user then shared a direct link to Perplexity AI’s page on the same topic: Token Minimization for Sustain.
Sustain and token optimization: The discussion revolved around methods to reduce token usage while maintaining model performance, crucial for sustainable AI practices.
- Resources shared included techniques for efficient token encoding and strategies to minimize input length without sacrificing essential information.

Perplexity AI ▷ #pplx-api (12 messages🔥):

Sonar Model Benchmarks, Perplexity Pro API Access, New Developer Relations Resident, Sharepoint integration

Sonar Models Beat Benchmarks!: Sonar Pro Low outperformed Claude 3.5 Sonnet on BrowseComp, with Sonar Low (the cheapest model) also outperforming Claude 3.7 Sonnet, according to recent benchmark evaluations.
- Sonar Pro Low achieved 4.0% accuracy on BrowseComp, almost 50% higher than Claude 3.5 sonnet, and both Sonar and Sonar Pro delivered up to 3x faster response times with more consistent latency.
Perplexity Pro Users Discover API Access: Perplexity Pro actually includes $5/month in API credits! You can find the API docs and get started at Perplexity AI docs.
- Registering via a credit card is the only way to access the API, but users will not be charged if they keep their usage at or below $5.
New DevRel Resident is a Robinhood Champ!: A new Developer Relations Resident (<@543991504922738688>) has joined the team, who won a Robinhood competition using the Perplexity API and will be helping developers build with Sonar.
Sharepoint Integration into Perplexity: A user reported that Sharepoint integration into Perplexity Enterprise Pro works well in UI with relevant responses.
- However, they noted that they cannot receive any useful responses using Perplexity API service and are seeking advice.

LM Studio ▷ #general (176 messages🔥🔥):

Embedding Modules Issue, Benign Log Spam, LM Studio JIT, Building Models from Scratch, LM Studio Autoload Issues

Embedding Issue Happening Across All Models: A user reported issues with embedding modules, encountering a “Maximum file size per conversation reached” error, capped at 31.46 MB with the default nomic-ai/nomic-embed-text-v1.5-GGUF model.
- The user seeks to overcome this limitation to process files of a few hundred MB for an audio-to-lyrics generation application.
LM Studio Gets Log Spam Fix: A user was informed that the log spam issue they encountered is benign and has been fixed in the 0.3.16 build 2 or will be fixed shortly.
- This resolves concerns about potential problems when running the embedding model.
LM Studio Has JIT Loading Issues: Users reported issues with LM Studio’s JIT model loading feature, particularly when using it via the Continue VS Code plugin.
- The server sometimes serves clients with the wrong model if another model is already loaded; it was suggested to check the identifiers of the models being used, using the lms ls command to ensure they match between the config and LM Studio as well as removing 8bit from it to fix it.
Build A LLM From Scratch?: Members discussed the feasibility of building LLMs from scratch, with one member expressing interest in building a model for various use cases.
- However, others emphasized the impracticality due to the massive compute costs and data requirements, recommending fine-tuning existing open-source models instead and suggesting the “Build a Large Language Model from Scratch” Manning book for theoretical understanding.
Problems Loading LM Studio Model: A user reported that LM Studio’s model autoload feature stopped working, with models failing to load even after restarting the app; these models were being loaded using the JIT system via API requests.
- It was mentioned that this could be due to a mismatch in the model identifier used by the client, which could be resolved by running lms ls.

LM Studio ▷ #hardware-discussion (450 messages🔥🔥🔥):

Gigabyte RTX 5060 Ti, PCIE 5.0 Benefits, qwen3-14b-q4km performance, Dual GPUs, ROCm on Linux vs Windows

Gigabyte’s Thrifty RTX 5060 Ti Design: A user showcases a new Gigabyte GeForce RTX 5060 Ti card, noting its ultra-short PCB design with only an x8 physical connector, despite the chip supporting x8, as seen in this videocardz.com article.
- The user praises the short board and large flow-through design for future-proofing and ease of installation, while humorously criticizing PNY’s design choices for prioritizing aesthetics over functionality, obstructing airflow.
PCIE 5.0 Boosts Card Performance: A user reports performance gains after upgrading to a 50 series card, specifically going from 26 tkps to 38 tkps using qwen3-14b-q4km with a 4096 context, due to PCIE 5.0 benefits on a board that limits the top slot to x8 when M.2 slots are occupied.
- The user also appreciates the ability to install the card in the bottom slot without interference, highlighting the advantages of the shorter PCIE connector design.
Dual GPUs Capped by Slower Card: When questioned about using two GPUs simultaneously, one user explains that the speed gets capped at the slower card’s performance plus overhead, indicating that the system will only perform as fast as the slowest component.
- They then shared an image of a PNY GPU, clarifying that they returned it.
GMK Strix Halo Mini PC: Users are eagerly awaiting their GMK Strix Halo mini PCs, with one user having ordered two and planning to post LM Studio performance results upon arrival, as well as hoping that they can get it to run models split across multiple computers.
- Another user noted that GMKtec order shipping is now underway for orders placed May 7-13.
ROCm on Linux for AMD GPUs: One user shared it wasn’t that difficult to switch to Linux for ROCm support, while another reported issues with ROCm detection on Linux, noting no performance difference compared to Vulkan on Windows.
- A user stated that Vulkan can be faster but has a bug with flash attention and that ROCm is a lot more consistent.

LMArena ▷ #general (530 messages🔥🔥🔥):

DeepSeek R2 release, New Gemma models, Claude Neptune / 3.8 leaks, GPT-4.1 vs GPT-4o, O3 Pro release delays

Gemma gets a new Model named Cutiepie: Members discussed a new Gemma model named Cutiepie, and some believe it might be slated for Google I/O if the timeline aligns.
- Others are trying to find strong anonymous models in the arena, and are wondering whether Gemma is free without logging in.
DeepSeek R2: Still MIA?: The community noted that the constant speculation around a new DeepSeek R2 release has finally subsided, yet still is in development using Huawei latest GPUs with Chinese funding.
- Internally, they are likely feeling pressure to release something impressive and solve server problems; current models excel in reasoning traces and multi-lingual chain of thought.
Gemini 2.5 Pro vs The Competition: Members discuss Gemini 2.5 Pro and express it is superior to the competition, especially when coding, with some citing coding in C++ with it as a dream come true.
- Users find current models are suffering from hallucinations like crazy, though others find it better than o3 and are starting to feel threatened by Gemini.
O3 Pro Delayed, O4 On the Horizon?: Speculation continues about the delayed release of o3 Pro, with some suggesting OpenAI is waiting for other labs to reveal their cards to claim the top spot, with a likely Google event looming.
- Internally, OpenAI already has o4, but is holding it back for strategic reasons relating to other labs; it is believed Claude 4 + o3 pro gonna be an insane combo.
GPT 4.1: Is it a Coding Prodigy?: Members discussed GPT 4.1’s coding capabilities, highlighting its quick compilation and error-fixing on large codebases, with instant response times.
- A member notes it is a big upgrade from GPT 4o mini to 4.1 mini for free users, but the same member feels that the 4.1 is solely for coding, not rlly other things tbh.

LMArena ▷ #announcements (1 messages):

Server Updates, Forum Category, Roles Creation, Moderation Improvements, Future Events

LMArena Server undergoes Updates: The LMArena server has implemented a few changes to make a more engaging and protected space, with community feedback driving these adjustments; members can provide input via this form.
- These changes include server structure adjustments, moderation improvements, and plans for regular events.
New Forum Category introduced: A new Forum Category has been added to gather feedback, troubleshoot issues, and handle model requests, replacing the existing <#1347792466203381830> and <#1346566622927655033> channels for better organization.
- Users may need to enable these channels in Channels & Roles -> Browse Channels if they are not visible.
Roles Creation for Targeted Announcements: New Roles have been created in the Channels & Roles section via auto-assigned questions, allowing for more targeted announcements to ensure members receive the most relevant information.
- The Server Guide, located at the top of the channel list, now contains the <#1343285970375540839> and <#1340554757349179416> channels.
Moderation Improved via ModMail Bot: For immediate needs, the <@&1349916362595635286> role can be pinged, and for private issues, members can now send a Direct Message to the ModMail bot found at the top of the Member List.
- The <#1343285970375540839> has been updated to keep discussions on-topic and foster a more inclusive space.
More regular events in the Future: Plans are in place to host events on a more regular basis, including Staff AMAs, contests, and casual game/activity nights.
- Members are encouraged to stay tuned for updates on these upcoming activities.

Unsloth AI (Daniel Han) ▷ #general (379 messages🔥🔥):

File:// usage, Qwen3 inference speed, llama 3.2 vision fine tuning, mergekit and frankenmerge, Qwen3 GRPO notebook

Fix attempts for ‘NoneType’ error with file usage: Users debugged an AttributeError: 'NoneType' object has no attribute 'startswith' when attempting to use file:// for image loading, suspecting an issue with how the image variable was being set to None initially in the code, and suggested adding additional forward slashes in the path.
- After some debugging, the user reported that the error persisted, even after the path correction, leading to the consideration of using URLs as an alternative.
Users question Qwen3 inference is extremely low for model: A user reported slow inference speeds with Qwen/Qwen3-0.6B on an A100, getting around 30 tokens/second with a batch size of 1, seeking advice on whether this is typical.
- Others suggested that the small model size and batch size might negate the benefits of using load_in_8bit=True.
Help is requested to finetune Llama 3.2 Vision Model: A user requested assistance setting up scripts for finetuning the Llama 3.2 vision model.
- They expressed frustration with image finetuning, stating it gives them a headache.
Discussion on the use of entropy loss in GRPOTrainer: A user inquired whether GRPOTrainer implements entropy loss, referencing a paper that found it helpful.
- Another user shared an ablation table and noted that without entropy loss, the model still converges, but performs 2-4 points worse and added, I could see an argument why the entropy loss would matter more with a single example than if you have a more traditionally-sized dataset, though!
Discussion on mergekit and frankenmerge: A user requested beginner-friendly guides, blogs, videos, or courses on mergekit and frankenmerge.
- Someone stated Mergekit apparently lets you merge different llms together

Unsloth AI (Daniel Han) ▷ #off-topic (49 messages🔥):

O3 evaluation, GPT-4.1 coding, Qwen models, NEFTune

O3 Model Called Out for Coding Garbage: Members complained that O3 outputs garbage code and that OpenAI models have become borderline unusable, while another member suggested that O3 is fucking garbage for coding and it’s better to use it for planning and research with tool calls and use O4 mini high for coding.
GPT-4.1 Declared Best Coding Model: One member stated that GPT 4.1 is the best coding model, as available in Github Copilot for those with the edu account.
- Another member agreed that 4.1 is much better than Sonnet and even O1.
Qwen Models Outperform Gemma in Benchmarks: It was noted that the Qwen 3 models are better than Gemma 3, beating all of them at similar sizes.
- Specifically, the Qwen 3 8b model is apparently CRAZY good at reaching SOTA in almost any domain with good fine tuning and is the perfect sweet spot for local llms.
Repo2txt.com Suggested as Superior to RAG for Code Injection: A member recommended repo2txt.com for selecting files from a repo and generating text, which is then injected directly into the prompt as it’s better than allowing the model to do RAG.
- They claim that models can’t read all code automatically in github and make mistakes.
Qwen Deep Research Discovers NEFTune: Members discussed Qwen’s deep research which revealed NEFTune, or injecting noise in embedding during finetuning, so it acts like regularization.
- One member favored it over Gemini deep research and ChatGPT since it’s very specific to the instruction, doesn’t hallucinate, and told them about NEFTune.

Unsloth AI (Daniel Han) ▷ #help (83 messages🔥🔥):

Vocabulary Size, Chat Templates and Base Models, Unsloth Performance Issues, GGUF compatibility, GRPO and Qwen3

Vocabulary Size Prevents OOV Errors: A member noted that the vocabulary size is the same and uses byte-level encoding, so there’s no chance of out-of-vocabulary (OOV) errors.
- The config adds the tool and think tokens, the chat template, and increases the model_max_length slightly.
Chat Template Flexibility with Base Models: It was mentioned that in the base model, you can use any template as you want and it doesn’t matter at all, even Alpaca or Gemma templates.
- However, you need to stick to it and use your own code to wrap the data into the template.
Unsloth Inference Speed Slow for Qwen3: A user reported slow inference speeds with Qwen3-0.6B on an A100 (40GB), getting around 30 tokens/second with a batch size of 1, using Unsloth’s FastLanguageModel.from_pretrained().
- The base model does not contain a chat template in the tokenizer_config.json.
Vision Model Merging Fix Pushed: A fix was pushed to address issues with save_pretrained_merged() and push_to_hub_merged for vision models; make sure to update unsloth-zoo and unsloth installations from main repos using pip install --force-reinstall git+https://github.com/unslothai/unsloth-zoo.git and pip install --force-reinstall git+https://github.com/unslothai/unsloth.git.
- The specific issue was detailed in this pull request.
New Feature in llama.cpp Requires mmproj File: When using the new feature in llama.cpp to use multimodal models, GGUF models may require an mmproj file.
- This file can be created by converting the model twice with llama.cpp, once normally and once with the --mmproj command-line argument, per the updated llama.cpp documentation.

Unsloth AI (Daniel Han) ▷ #research (5 messages):

Med Palm 2, QLoRA memory, modernBERT context length

Google’s Med Palm 2’s stabgan: A member noted that stabgan is kind of almost similar to what the Med Palm 2 paper of google did, but they used the concept for normal generation not reasoning.
- He noted the paper was thanks for sharing.
QLoRA memory remains low: A member reported the use of QLoRA kept the memory low.
- He used this to play with the full modernBERT context length of 8k, and to use massive batch sizes to get better and more diverse in-batch sampling of negatives during training.

OpenAI ▷ #annnouncements (2 messages):

Safety Evaluations Hub, GPT-4.1, GPT-4.1 Mini

Safety Evaluations Hub debuts: OpenAI introduced the Safety Evaluations Hub, a resource to explore safety results for their models.
- While system cards share safety metrics at launch, the Hub will be updated periodically as part of their efforts to communicate proactively about safety.
GPT-4.1 lands in ChatGPT: GPT-4.1, a specialized model excelling at coding tasks & instruction following, is now available directly in ChatGPT for Plus, Pro, & Team users via the “more models” dropdown.
- Enterprise & Edu users will get access in the coming weeks, and it is a faster alternative to OpenAI o3 & o4-mini for everyday coding needs.
GPT-4.1 mini supersedes GPT-4o mini: GPT-4.1 mini is replacing GPT-4o mini in ChatGPT for all users.
- GPT-4.1 and GPT-4.1 mini underwent standard safety evaluations, with detailed results available in the newly launched Safety Evaluations Hub.

OpenAI ▷ #ai-discussions (151 messages🔥🔥):

Sentient AI conversation, ChatGPT models for coding, O3 model intelligence, ChatGPT Enterprise plan, AI-generated images on Instagram

O4-Mini-High is a Coding Beast: Members raved about O4-mini-high for coding, with one user exclaiming, “Never seen such good and fast performance out of a coding model in A WHILE”, noting it quickly solved a problem and improved the code for a calculator in just 22 seconds.
- Despite some users noting GPT-4.1 being made for coding, the general sentiment leaned towards O4-mini-high delivering superior performance compared to Claude 3.7 Sonnet and Gemini 2.5 Pro.
O3 Unlimited is a dealbreaker: A user considering switching from a Pro subscription to a Teams subscription was concerned about losing unlimited O3, a model they consider “a legendary model to resolve most of my issues,” despite the Teams plan offering desirable internal knowledge features.
- Members suggested sticking with the Pro plan due to the benefits of unlimited O3, which now supports deep research through personal GitHub Repos, OneDrive, and SharePoint, with 20 uploaded files allowed in a Chat / Project.
ChatGPT Enterprise Trains on User Data for Enhanced Performance: A user shared their experience with a corporate version of ChatGPT that, initially “pretty trash”, improved significantly after training on thousands of users’ daily interactions, now utilizing GPT-4o and offering more secure and uncapped usage.
- Another user inquired about the message cap limit of O3 in the Enterprise plan, clarifying that while the plan uses GPT-4 Turbo, they were specifically interested in the message cap limit for the O3 model, later found to be 100 messages per week per user.
GPT-4.1 Mini replaces GPT-4o Mini: The community noted that GPT-4.1 Mini has replaced GPT-4o Mini in ChatGPT for all users as of May 14, 2025, touted for significant improvements in instruction-following, coding, and overall intelligence.
- Members discussed the shift from GPT-4o to GPT-4.1 on the free plan, weighing the pros and cons of each with some members believing GPT-4o is better for therapeutic purposes while GPT-4.1 excels in other areas such as front-end coding.

OpenAI ▷ #gpt-4-discussions (12 messages🔥):

GPT-4o for web app coding, Structured outputs for Azure OpenAI assistants, Node ID errors, Chat delays on PC vs. mobile, Flagged chats due to long output

GPT-4o Aids Web App Dev: Members discussed using GPT-4o for coding web apps (Vue, Express.js, MongoDB), emphasizing the need to specify tooling, OS, IDE, languages, frameworks, and preferred dependencies.
- Clarity in detailing requirements helps the model provide expected solutions; simply “Tell it exactly what you want”.
Azure OpenAI Assistants face Structure Output Woes: A user reported issues working with structured outputs for assistants in Azure OpenAI.
- Another user reported getting “getNodeByIdOrMessageId - no node found by id: placeholder-request-” type messages all the time now, indicating ongoing problems with the platform.
Typing Lag surfaces on PCs: A user experienced typing lag and delayed message loading on their PC, while the same chat session worked fine on their phone on the same network.
- They further isolated the issue by testing on a separate Win11 work computer, confirming the problem was specific to the PC setup.
Flagged Chats Trigger Crashes: A member suggested that the system might have flagged a chat due to an extremely long output, potentially longer than the output limit.
- The user theorized that “if you’re feeding it files with 5000+ lines of code and it’s writing fixes and refactors the code, but makes an error the system could flag it, then the whole chat is broken and unable to load on any device”.

OpenAI ▷ #prompt-engineering (70 messages🔥🔥):

GPTs for coding, PII data guardrails, AI for universe simulation, Mimicking writing style with AI, Ollama vs Windsurf

Users seek advice using GPT-4o for coding Vue, Express, and MongoDB: A member sought guidance on using GPT-4o for coding, specifically with Vue, Express.js, and MongoDB, and inquired about integrating it with Visual Studio.
- Another member recommends to start with tutorials using HTML, CSS, and Javascript with chatGPT to build basic applications such as a calculator, notes app or a weather app, after which the user should then learn Typescript, React, Vite and Electron.
Members discuss PII data guardrails challenges: A member reported encountering issues with PII guardrails when their application, which accesses HR data, blocked requests for home addresses.
- Another member suggested consulting OpenAI directly to discuss appropriate use and obtain guidance on handling sensitive data requests while adhering to their policies, especially concerning PII.
User builds AI Universe Simulator for Companion Prompt Generation: One member is using AI to build a 1:1 simulation of the universe to think about thinking and create automated ecosystems within AI models.
- The goal is to scale the simulation into a web browser and push for open communication lines between models, promoting efficiency over stacking models.
Users explore mimicking writing styles with ChatGPT: A member asked about using ChatGPT to mimic their writing style and a member suggested sharing samples and iterate and guide to refine the output.
- It was pointed out that a general prompt will most likely not yield quality results, but a strong, coherent set of rules and constraints are needed to achieve this and training the model on 1000 pages can help.
Users discuss using local model inference with Ollama, instead of services like Windsurf: A user questioned whether Ollama is inferior to Windsurf, and a member advised against paying for services like Windsurf, Lovable, or Replit when using AI to build apps.
- It was suggested to learn prompting to avoid API costs, and use local model inference with Ollama and VS Code extensions like Continue and Roo Code, while doing LLM and Agentic courses from Hugging Face.

OpenAI ▷ #api-discussions (70 messages🔥🔥):

ChatGPT for Web App Development, GPT-4o Coding Assistance, HR Data Guardrails and PII, Mimicking Writing Style, Prompt Engineering and Agentic Frameworks

Navigating ChatGPT for Web App Coding Assistance: A user sought advice on leveraging GPT-4o for web app development, specifically with Vue, Express.js, and MongoDB; another member suggested using it like conversing with a human, providing specific details about challenges, sharing code snippets, and stating background.
- The member recommended starting with a bare-bones prototype, testing it, fixing errors iteratively, and adding features one at a time, ensuring both the user and the model remain aligned.
Circumventing PII Guardrails for HR Data: A user reported issues with PII guardrails when requesting information like home addresses from an HR data-connected application, while OpenAI support might provide tailored guidance, community members are limited in offering specific workaround advice due to OpenAI’s policies.
- The suggestion was to contact OpenAI support directly to discuss the specific use case and seek appropriate guidance for handling sensitive PII data within the application.
Mastering Mimicry: Emulating Writing Style with ChatGPT: A user inquired about the best approach for getting ChatGPT to mimic their writing style, and a member suggested providing samples, requesting emulation for a specific goal, and iteratively correcting and refining the output based on the model’s feedback.
- Another member emphasized specifying details about structure and the elements used to clarify writing style for the bot.
Prompt Engineering for Stellar AI Results: A member recommended learning basic HTML, CSS, and JavaScript to better understand and debug AI-generated code and suggested completing LLM and Agentic courses on Hugging Face Learn to grasp prompting, context management, and roles.
- They suggested that one can evaluate their prompting skills by asking ChatGPT to rate the prompt engineering and provide feedback, which is key to improve results and understanding agentic frameworks as LLM + MCP servers + modes + prompts guiding it across multiple modes and tools.
Unveiling Personalized ChatGPT Engagement Metadata: A user requested a detailed breakdown of their ChatGPT usage metadata, including metrics like message length, role usage, and conversation depth, to gain insights into their interaction patterns.
- A follow up prompt of “are there any other stats you can share i had not covered?” was suggested to generate even more stats.

Cursor Community ▷ #general (271 messages🔥🔥):

GPU power in decentralized systems, Cursor pricing vs API pricing, Claude Max in Cursor, Multi-repo projects in Cursor, Cursor's Git changes sync issues

GPU Power P2P options pondered: A member inquired about open-source options for utilizing GPU power in peer-to-peer or decentralized systems.
- The discussion did not yield specific solutions, highlighting a potential area for exploration.
Cursor’s 20% markup debated: Users debated whether Cursor’s 20% surcharge over actual API pricing is justified, with some arguing it makes perfect business sense while others find it a bad value.
- One user stated, I feel cursor are the absolut best value for the money currently, compared to what you get for the $20 bucks while another claimed to save $600 per month by using Claude Max outside of Cursor.
Gemini 2.5 joins Cursor’s Model Mix: The community spotted the addition of Gemini 2.5 Pro on May 6th to Cursor’s selection of available models.
- Some noted that the new model finally fixed the ‘i will code now (stops coding)’ issue.
Mono-Repo Method Manuevers: Users discussed managing multi-repo projects, with one suggesting consolidating into a single monorepo to avoid context fragmentation.
- Another user mentioned using separate workspaces within the same parent folder in Cursor 0.50.x.
Background Agent Begs for Confirmation: Users expressed frustration that the background agent requires excessive confirmation, increasing fast request usage.
- One user complained the background agent just keeps asking for confirmation to do things and its just eating fast request up faster.

Yannick Kilcher ▷ #general (197 messages🔥🔥):

Patents in AI, RL-Diffusion, Generator Paradigm, Evolutionary Algorithms, Hamiltonian Neural Networks and Transformers

RL-Diffusion Debate Sparks Patent Concerns: A member discussed developing RL-Diffusion, combining forward and backward processes into one controlled by RL, leading to a debate about patentability and prior art, with some members skeptical about its novelty and practical implementation.
- Some members encouraged practical implementation and benchmarking before pursuing patents, emphasizing that transformation into a patent-eligible application requires “more than simply stat[ing] the [abstract idea] while adding the words ‘apply it.’”
Google’s AlphaEvolve Sparks Excitement and Skepticism: Members discussed Google’s AlphaEvolve, which pairs Gemini models with evolutionary algorithms to improve underlying models, with opinions divided on whether it’s a meaningful advancement or just “brute force with an LLM.”
- One member noted its potential significance in accelerating multiplication, while another linked it to existing work like AlphaTensor and AlphaCode, viewing it as a small step in neural net-driven search.
Hamiltonian Neural Networks and Transformers Get Attention: A member shared their idea of integrating transformers into Hamiltonian neural networks, referencing a paper on a Hamiltonian neural network (HNN)-Transformer architecture for modeling physical systems [https://ieeexplore.ieee.org/document/10316909].
- Discussion touched on whether attention mechanisms align with the history-independent nature of Hamiltonian systems, with another member suggesting a transformer that learns system Hamiltonian dynamics from a single trajectory.
Diffusion vs. Autoregression: Continuous vs. Discrete?: Members discussed the fundamental differences between diffusion models and autoregressive models, highlighting that diffusion models work with continuous distributions while autoregressive models work with discrete sequences of symbols.
- The discussion extended to how VQVAE can be used to transform images into discrete tokens for autoregressive models like Parti [https://sites.research.google/parti/], enabling transformers to operate on latents.

Yannick Kilcher ▷ #paper-discussion (23 messages🔥):

Grade School Math Benchmarks, ML systems rabbit hole, Data Loading and Preprocessing, LLMs are like humans, Model formulates a plan

GSM8K Benchmark: Near-Perfect Accuracy Achieved: Language models have shown their capability to solve mathematical reasoning problems, achieving near-perfect accuracy on grade-school level math benchmarks like GSM8K, as discussed in this paper and detailed in this blogpost.
- The paper explores whether language models truly develop reasoning skills or simply memorize templates.
Dive Deep into the ML Systems Rabbit Hole: It was shared that a significant portion of training time, around 65%, is spent on data loading and preprocessing, referencing this paper.
LLMs: Human-like or Not?: One member suggests that a paper’s findings about LLMs might be flawed because “They look to prove LLMs are like humans instead of trying to disprove it.”
- The member feels the paper misinterprets its results by claiming LLMs formulate plans before generating solutions.
Model Avoision of Unecessary Computations: One member quotes the paper to say “the model can learn to generate shortest solutions, almost always avoiding unnecessary computations” arguing “This suggests that the model formulates a plan before it generates, in order to avoid computing any quantities that are not needed towards solving the underlying math problem.”
- One member countered that “What the model learns is that unnecessary things are random noise which by definition have no signal for the model to learn, so it ignores them”.

Yannick Kilcher ▷ #ml-news (12 messages🔥):

AI Regulation Ban, AlphaEvolve, Budget Reconciliation Bill

GOP Blocks AI Regulation for a Decade: House Republicans added language to the Budget Reconciliation bill that would block all state and local governments from regulating AI for 10 years (source).
- The provision, introduced by Representative Brett Guthrie of Kentucky, vaguely states that no state or local entity can enforce laws regulating AI models or systems for a decade, potentially impacting privacy regulations.
DeepMind’s AlphaEvolve Cracks Open Problems: DeepMind’s AlphaEvolve, a Gemini-powered coding agent, evolves algorithms for math and practical applications, combining LLM creativity with automated evaluators (DeepMind Blog).
- The system rediscovered state-of-the-art solutions in roughly 75% of cases and improved the previously best known solutions in 20% of cases, even advancing the kissing number problem.

OpenRouter (Alex Atallah) ▷ #app-showcase (5 messages):

New Chatbot Platform, Customization and Models, Image Generation in Chat

Personality Launched: New Chatbot Platform Emerges: A member introduced Personality, a new chatbot platform enabling users to create and roleplay with multiple characters and use non-role-play assistants.
- The platform aims to offer more customization, less filtering, and a wider selection of models compared to existing solutions like c.ai.
Personality Platform Offers Free Image Generation: The platform’s playground at personality.gg/playground offers free image generation, though it’s noted that this feature is not powered by OpenRouter.
- Users are invited to try the platform for free at personality.gg and provide feedback.
Big Updates Coming to Personality Platform This Week: A major update is expected this week including the ability to generate images directly within chats and a better user interface.
- This aims to enhance the user experience and expand the platform’s capabilities.

OpenRouter (Alex Atallah) ▷ #general (177 messages🔥🔥):

OpenAI Reasoning Models Naming, Free Google Models, Gemini Rate Limits, Claude on OpenRouter vs Native, Corvid Befriending

OpenAI’s Reasoning Model Names Need a Revamp: A user inquired about the naming inconsistency in OpenAI’s reasoning models, noting that some have reasoning level variants (e.g., /openai/o4-mini-high) while others don’t, and requested consistency in offering reasoning levels for all models to aid evaluation.
Free Google Models Getting the Squeeze: Users reported issues with free Google models despite having credits, with some confirming extremely low rate limits.
- Alternatives like DeepSeek V3 were recommended, while concerns were raised about potential removal of free routes for Gemini following a change shared on Twitter.
Claude System Prompt Differences Explained: Users noticed a difference in helpfulness when using Claude via OpenRouter compared to the native Anthropic website due to the extensive system prompts used on the latter.
- It was suggested that users manually implement the system prompt, available on GitHub, which comprises around 16000 tokens and includes tools.
Become One With Murder Birds: A user shared their ongoing journey of befriending corvids (crows and magpies), detailing their feeding routine and the development of trust.
- They recounted stories of crows following them after being fed and anticipated building a crow army, ending on an anticlimactic note with their location in Germany.
“Always Use This Key” is Actually New: A new “Always use this key” option was introduced, causing confusion as it was initially mistaken for the existing “Use this key as a fallback” setting.
- The new feature exclusively uses the specified key and prevents fallback to OpenRouter, which represents a change from the behavior of the older fallback setting.

Manus.im Discord ▷ #general (165 messages🔥🔥):

Manus credits not refreshing, best use cases for manus, Manus invitation codes, Manus refunds, Gemini Developer API's Function Calling feature

Users Reporting Manus Daily Credits Not Refreshing Properly: Several users reported issues with their daily 300 credits not refreshing at 00:00 GMT, potentially due to time zone processing problems.
- A user mentioned their credits refresh at 8:00pm in their timezone, indicating inconsistencies in the credit refresh timing.
Unleashing Genius with Glitched Invitation Codes: A user claimed to have a glitched account with 100 invitation codes and shared numerous invitation links, leading to discussions about their origin and purpose.
- Some users speculated the codes were from a paid subscription, while others questioned their usefulness to existing members; new users get 500 credits with the codes.
Manus Credit Refunding Difficulties: A user reported a job failure that consumed 800 credits and expressed frustration about not being able to get a refund from Manus.
- Other users chimed in, stating that refunds are no longer provided, even if the service doesn’t work as expected, with one suggestion to dispute the charge.
Rapping about Facebook Marketplace Scams: A user requested Manus to generate a rap about Facebook Marketplace lowballing, using slang terms like best price, last price, and mates rates.
- The user clarified that the request was not an advertisement and involved rapping about scenarios and experiences related to the online marketplace.
Beta tester boasts about a secret Music project: A user mentioned being accepted into a new beta trial related to Music, but couldn’t disclose details due to NDAs.
- The user later promoted their social media accounts (TikTok, Instagram, LinkedIn, Threads, YouTube, & X) featuring Manus content.

GPU MODE ▷ #general (2 messages):

torch.compile performance, layernorm vs rmsnorm

PyTorch Implementations Compared: A member suggested that comparing “basic” implementations with separate kernels for each operation on GitHub could explain performance improvements.
- Another member confirmed finding similar results when comparing torch.compile of PyTorch’s layernorm and rmsnorm implementations, noting they seem to have basically the same performance.
Surprising GPU Profiles Reported: A member mentioned that some colleagues have posted surprising GPU profiles, prompting further investigation.
- They plan to follow up with them to understand the underlying factors contributing to these unexpected results.

GPU MODE ▷ #cuda (1 messages):

CUDA Shared Buffers, PyTorch Tensors, RAPIDSAI/rmm Library

Seek Simpler C++/CUDA Library for Shared Buffers: A member inquired about a simple C++ library, potentially with pybind, to demonstrate multiple processes reading/writing to shared CUDA buffers.
- They’re also interested in wrapping PyTorch tensors on top, noting that RAPIDSAI/rmm might be too extensive for their needs.
Further CUDA/PyTorch Interoperability: The user is seeking guidance on efficiently managing shared CUDA memory between multiple processes.
- They are particularly interested in a streamlined approach that integrates well with PyTorch tensors, possibly as an alternative to the more comprehensive RAPIDSAI/rmm library.

GPU MODE ▷ #torch (2 messages):

PyTorch nightly, at::Tag, needs_exact_strides, C++ code, torch.compile

needs_exact_strides better than needs_fixed_stride_order: Members discussed that if you’re on a pytorch nightly, at::Tag::needs_exact_strides is better because needs_fixed_stride_order sometimes lies.
- One member mentioned reading the code and believes the answer is no, while another thanked them and mentioned moving the .contiguous calls in the C++ code so torch.compile can’t mess with them.
Contiguous calls moved in C++ code: A developer moved the .contiguous calls in the C++ code to prevent torch.compile from interfering.
- This adjustment was made to address a recurring issue, and the developer appreciated the suggestion to use needs_exact_strides for better stride handling in PyTorch nightly builds.

GPU MODE ▷ #beginner (5 messages):

Arithmetic Intensity of Kernels, TMA Utilization Metrics, Triton Performance Debugging, Nsight Compute for Kernel Debugging

Kernel Arithmetic Intensity Question Arises: A member inquired about the best way to compute the arithmetic intensity of kernels and metrics for assessing TMA utilization in Hopper and Blackwell architectures.
- Another member suggested using tma__inst_executed.sum for TMA on Hopper, referencing an NVIDIA forum post and pointed out that Nsight Compute has a built-in roofline tool to estimate arithmetic intensity.
Nsight Systems (nsys) Debugs Triton Performance: A member asked if using nsys and nsys-ui is a typical workflow for debugging GPU performance while learning Triton.
- Another member confirmed that this is a typical workflow for whole-program performance analysis, especially on headless servers, and recommended Nsight Compute (ncu) and ncu-ui for debugging specific kernels.

GPU MODE ▷ #self-promotion (10 messages🔥):

Weight Pruning, PTX Instructions for Matrix Load/Store, CohereAI Talk Recording

Pruning Weights for Performance?: A member inquired about weight pruning, specifically random block-weight pruning, clarified to be done with program IDs rather than zeroing weights.
- This technique relates to efficiently loading and storing matrices using PTX instructions.
PTX Boosts Matrix Manipulations!: A member shared a blogpost detailing how to efficiently load and store matrices within a warp using PTX instructions, including the ldmatrix instruction.
- They also linked the associated code, PTX documentation, and a LinkedIn post explaining the stmatrix instruction.
CohereAI Talk slides released!: A member shared a Google Meet link for a talk, and subsequently shared the slides.
- When asked about a recording, another member suggested it would be available on the CohereAI YouTube channel.

GPU MODE ▷ #🍿 (1 messages):

c.3.p.1: This looks potentially interesting: https://arxiv.org/abs/2504.09246

GPU MODE ▷ #submissions (47 messages🔥):

AMD MI300, AMD fp8-mm, VectorAdd, Leaderboard Submissions

MI300 Leaderboard Domination: Numerous submissions were made to the amd-fp8-mm leaderboard on MI300, showcasing various performance improvements.
- Submissions ranged from 162 µs to 26.3 ms, indicating a wide spectrum of optimization levels.
Mixture of Experts on AMD: One successful submission was recorded on the amd-mixture-of-experts leaderboard on MI300 with a time of 7574 ms.
VectorAdd on T4: One submission achieved 8th place on the vectoradd leaderboard on T4 with a time of 6.41 ms.
New Personal Bests Abound: Several members achieved personal bests on the amd-fp8-mm leaderboard, reflecting ongoing optimization efforts.

GPU MODE ▷ #status (1 messages):

Competition delayed, Ironing out details, Problem #3

Competition Delayed Due To Ironing: Problem #3 for the <#1359640791525490768> competition will be delayed by a few days, and here’s why.
- The team is ironing out a few details to ensure the problem is as fun as possible, so your patience is appreciated.
Fun Details Being Ironed: The competition team is taking extra time to iron out a few details to ensure the problem is as fun as possible.
- The problem in question is problem #3.

GPU MODE ▷ #factorio-learning-env (15 messages🔥):

Factorio Genetic Algorithm, Cutting Down Tokens, Nearest buildable tool

FactorioGP Genetic Algorithm for Blueprint Generation: A member is planning to create a genetic algorithm that generates Factorio blueprints based on specified requirements like building materials, input/output locations, and area constraints, and found a paper on genetic programming for dynamic path-finding.
- The algorithm aims to enable LLMs to provide constants for fulfilling these requirements, serving as a tool for dynamic factory design.
Cutting Tokens Saves Dough: The group noted that it cost about $1000 to evaluate 6 models across 24 tasks, with 8 runs each, with one member suggesting a potential 90% reduction in token usage through intelligent context pulling and a RAG implementation.
- A RAG implementation could cut 90% of the tokens used
Nearest Buildable tool is Imperfect: Current strategy uses a nearest_buildable tool for identifying appropriate places to put things, which is imperfect, which they can create a thread on discord to discuss work streams.
- Recurring meetings might be established to discuss work streams.

GPU MODE ▷ #amd-competition (23 messages🔥):

Reference Kernel Times, Application Timeout Errors, fp8 gemm VGPR usage, Leaderboard Submission Issues, HIP Kernel .s File Access

Reference Kernel gets Speed Boost: A pull request has been merged to mitigate the long reference time issue, aiming to improve run times when the main bot is updated.
- The update addresses concerns about the reference implementation taking too long, especially when faster implementations require numerous runs to meet termination criteria.
Application timeout turns out to be a Blip: Members experienced intermittent application timeout errors, which were temporarily resolved by retrying submissions.
- A newline character within an asm volatile statement was identified as a potential cause, though the issue appeared to resolve itself.
Users Seek fp8 GEMM VGPR Insights: A member writing a HIP kernel for fp8 gemm inquired about how to determine VGPR usage.
- Another member suggested using ROCm’s amd_matrix_instruction_calculator to check.
Leaderboard Submission from CLI fails!: A user reported experiencing timeouts when submitting to the leaderboard for amd-mixture-of-experts via the command line interface (CLI).
- Submitting via Discord worked, but the CLI submissions consistently timed out.
Hunting HIP Kernel .s Assembly Files: A user sought a method to obtain the .s file (assembly code) for a HIP kernel, mentioning the use of hipcc and extra_cuda_cflags.
- One suggestion was to pass -save-temps to hipcc, but accessing the file during execution proved difficult; compiling locally was suggested as an alternative.

GPU MODE ▷ #cutlass (25 messages🔥):

CUTLASS 4.0 Release, CuTe DSL for Python, MLIR Compiler, PTX Dumping, Custom Kernel Performance

CUTLASS 4.0 and CuTe DSL Debut!: CUTLASS 4.0 and CuTe DSL are now released, accessible via pip install nvidia-cutlass-dsl, and NVIDIA recommends starting with the Jupyter notebooks.
- Members noted the nvidia-cutlass-dsl is version 0.0.0.... which was released like 2 months ago according to pypy, so something seems borked with the release.
CuTe DSL Requires Python 3.12: A user reported issues installing and running the examples, which was resolved by using Python 3.12, as required by the documentation.
CuTe DSL Achieves Blazing Fast Kernel Performance: A member implemented a custom CuTe kernel in C++ for fractional norms, achieving 67ms for a 30,000x30,000x1000 problem with p=1.0, outperforming torch.cdist at 4,000ms.
- Replacing cute.gemm in the sgemm.py example with a custom implementation yielded similar performance, compiling in 0.5 seconds and running in 62ms, beating pytorch by 60x!
MLIR Compiler Not Open Source (Yet?): A user inquired about building from source and whether the MLIR src files are open sourced, but developers confirmed that the dialect compiler is not OSS.
- Users can install with pip and just use.
Dumping PTX from CuTe DSL: A user asked if there’s a way to dump the generated PTX code, similar to Triton’s MLIR_ENABLE_DUMP, but currently, setting CUTE_DSL_PRINT_IR=1 only dumps the MLIR file.
- This feature does not yet exist.

GPU MODE ▷ #mojo (2 messages):

Mojo PyTorch backend, Autograd Implementation, Micrograd, Pytorch internals

Mojo ❤️ PyTorch Backend?: A member expressed enthusiasm for the idea of Mojo becoming a PyTorch backend, while also hoping for a more accessible codebase, especially for those less familiar with C++.
- He inquired about the implementation of the backward pass, specifically asking about the need for separate kernels and how fusion would be handled.
Micrograd as Pytorch Inspiration: One member mentioned that Micrograd was based on PyTorch, with links to the Micrograd video and the PyTorch paper provided for context.
- This suggests that the principles and implementation details of PyTorch’s autograd system could offer insights into how Mojo might handle its own backward pass.

aider (Paul Gauthier) ▷ #general (49 messages🔥):

Gemini 2.5 Pro, Model Performance, Common Lisp, AI Studio, Repomap

Gemini 2.5 Pro performance and use cases: Users are experimenting with Gemini 2.5 Pro via OpenRouter, using --edit-format diff-fenced, and observing that it sometimes rewrites huge files for small changes, raising questions about its behavior.
- Some users find AI Studio provides results faster, and report that Sonnet 3.7 works best for their workflow, while others are using cheaper models for ask and Sonnet3.7 for architect.
Discussing Common Lisp and Modern AI Tooling: Users are discussing using existing models to develop in less popular languages like Common Lisp, planning to use books and data sources to create datasets and prompts for in-context learning.
- The idea involves LoRA-ing small models and using semantic retrieval to add programming book wisdom to the context window, and a member suggests creating a Lisp DSL to build a compiler/interpreter.
Addressing Google AI Studio Redirect Issues: A user reports being redirected from Google AI Studio after briefly seeing the UI, despite their country being on the allowed list, seeking potential solutions or explanations.
Repomap Issues and Solutions: A user questions why repomap sometimes underperforms, even with a high map multiplier, indicating potential issues with file mapping in certain projects.
- They are “getting the perfect snake is no easy task!”
Gemini Adds Comments Into the Code and Stuiped Ideas for Later: A user found that Gemini adds so many comments into the code and even more stuiped ideas inside the code for later and it codes it cause it was inside the code.
- There should be stricked code change without coding ideas.

aider (Paul Gauthier) ▷ #questions-and-tips (48 messages🔥):

Gemini rate limits, Aider upgrades, Aider models, Aider configuration, Aider file navigation issues

Gemini Free Tier Runs into Rate Limits: Users reported experiencing sudden rate limits with Gemini’s free tier, even after periods of inactivity, with one user noting probably they just deprioritize those clusters so when load goes up elsewhere it 429s.
Aider Upgrade Woes Persist: Users are encountering issues when upgrading Aider, with the upgrade process failing to update the version number despite appearing to complete successfully.
- The SSL warning during the upgrade is likely unrelated, as it has been a recurring issue since January.
Experimental Gemini Models Disabled, Confusion Ensues: Users faced errors indicating that the free experimental Gemini model was disabled, leading to confusion and the suggestion to switch to the preview model.
- One user reported unexpected charges, questioning whether they were actually using the preview model, which was later clarified by checking the Aider announce lines at startup, that the pro-preview wasn’t being used.
Aider Gets a Concise Aussie Chat Makeover: A user discovered a way to make Aider’s replies easier to read by modifying the ~/.aider.conf.yml file.
- They suggested using chat-language: English (Australia, use headings, bullet points, concise sentence fragments).
Aider Struggles with Large File Navigation: A user reported issues with Aider navigating a 1600 line file, experiencing difficulties with line numbers and debugging unrelated code.
- It was suggested to try different models and to consider that the repo map might be contributing to the issue.

Eleuther ▷ #general (22 messages🔥):

lm-eval-harness dataset download, R1-distill models prompt format, Regulatory bias standards and LLMs, Open Science Conference call for papers, ODSC vs OSC conference confusion

LM-Eval-Harness Simplifies Dataset Downloads: A user inquired about downloading datasets for specific tasks in lm-eval-harness without immediately evaluating a model, and a solution was found by using python3 lm_eval --model dummy --tasks [task_list] --limit 1 to download the datasets.
- The dummy model defined here is used for testing and returns random numbers, while --limit n restricts evaluation to the first n rows.
R1-distill Model Prompt Formatting Explored: A member asked if it’s common practice to prompt R1-distill models with a “user: What is xyz? assistant: ” format rather than just directly doing “What is xyz? ”.
- Unfortunately, the thread was cut off here.
LLMs Face Bias Regulation Scrutiny: Discussion revolved around regulations concerning bias standards in algorithms, particularly in the context of LLMs, citing examples of regulatory agencies like the NCUA, EEOC, FDIC, HHS, and FTC.
- The archived EEOC guidance was mentioned, emphasizing that regulators require proof of non-discrimination and may view algorithms that can’t be studied as infringing.
Open Science Conference Announces Call for Papers: The Open Science Conference is accepting calls for papers, potentially suitable for interdisciplinary work, with submissions due in one week.
- Further details on the call can be found on the Open Science Conference website.
ODSC and OSC Conferences: Avoid Confusion!: It was clarified that ODSC is distinct from OSC and that there are a lot of conferences with very similar names/abbreviations, with a warning that some of these might be scams spread through old google groups.
- One member confirmed ODSC is legitimate (since they were a speaker there), and OSC appears legitimate, but less popular.

Eleuther ▷ #research (57 messages🔥🔥):

Model of Mind AI, Falsifiable Hypothesis, Sparse Gradients, Qwen 3, Skywork Model

Modeling AI after the Mind: A member modeled an AI after the concepts of a conscious, subconscious and unconscious mind with a higher level behavioral system, based on the psychological model.
- A member noted that the channel discusses a rather narrow range of specific ML topics.
Falsifiable hypothesis: It was stated there is no need for a degree here, but there must be some adherence to a falsifiable hypothesis, or mathematical description of the process.
- The channel is for discussing research topics or specific papers and results rather than one’s own partially formed research ideas.
Qwen Cooks Up a Storm: Members noticed that Qwen is cooking hard as evidenced by a linked image.
- A question arose — what the actual mechanism for controlling the entropy is, in Qwen 3.
Skywork’s Shorter Reasoning: A member said that the Skywork model and techniques are very good, especially given that a full release just came out and linked to Skywork Open Reasoner Series.
- They normalized by the total tokens in the training batch rather than per sequence which is basically dr GRPO.

Eleuther ▷ #lm-thunderdome (7 messages):

Multi-GPU lm-eval, vllm Tensor Parallel

Multi-GPU lm-eval utilization issues surface: A member reported that when using parallelize=True in lm-eval, GPU 0 has 100% utilization, while GPU 1 has 0% utilization.
- Another member explained that parallelize uses naive pipeline parallelism where it splits the model layers, so no more than one rank is used at a time, suggesting accelerate launch -m lm_eval ... for running multiple replicas.
vllm Tensor Parallelism recommended for stability: When other multi-GPU solutions failed, a member suggested using vllm tensor parallel, noting that it’s more reliable.
- The original poster was unaware of using vllm with lm-eval, and expressed that they had been using the HuggingFace implementation.

Nous Research AI ▷ #announcements (2 messages):

Atropos v0.2.0 Release, Psyche Network Launch, Decentralized AI Training, Large Language Model Training, Open Source AI Development

Atropos v0.2.0 drops with Axolotl support: Nous Research has released v0.2.0 of Atropos, their RL environments project, featuring new environments, updated API handling, better TRL support, and integration with Axolotl as an official trainer partner, with usage guide here.
Psyche Network Launches to Democratize AI Training: Nous Research launched the Psyche Network, a decentralized training network aimed at democratizing AI development by bringing together distributed compute resources for training large-scale models.
Psyche Testnet Trains a 40B Parameter LLM: The testnet launch of Psyche involves pre-training a 40B parameter LLM using an MLA Architecture and a dataset comprising FineWeb (14T), FineWeb-2 (4T), and The Stack v2 (1T).
DisTrO Optimizer Breaks Bandwidth Constraints on Psyche: The Psyche network utilizes Nous’s DisTrO optimizers and a custom peer-to-peer networking stack to coordinate globally distributed GPUs, overcoming previous bandwidth constraints in AI training.
Open Source Community Drives Psyche Development: Nous Research encourages community involvement through forums and Discord to gather model ideas, aiming to foster innovation in model creation and design within the open source community, with code available on GitHub.

Nous Research AI ▷ #general (78 messages🔥🔥):

Frontier Models, smolvlm-realtime-webcam, 3 GPUs, latex2sympy2_extended math_verify, Atropos

Benchmarks fail to capture model nuances: A member finds it very hard to feel out which frontier model is better at different tasks because benchmarks not granular or diverse enough and shares a link to illustrate that the “best” coding model might still be terrible at front end, or xyz framework, data viz etc.
Try smolvlm realtime webcam project: One member shared a link to smolvlm-realtime-webcam project.
Troubleshooting Atropos dependencies: A member ran into a problem running the examples/gsm8k_server.py file, which requires the math_verify and latex2sympy2_extended modules, and another member suggested to use pip install latex2sympy2_extended math_verify to fix the problem.
Nous’ Psyche is distributed GPU training!: The Psyche Network mining pool filled up to 500k in 40 minutes, prompting one to state that it’s almost as if making AI training open to everyone is a good idea and the project link- was shared by a member.
Contribute USDC for Compute, it is donation: Members discussed donating USDC to contribute to compute for the Psyche project, which the funds go to Nous, and a member confirmed that any capital contributed to this pool is purely a donation and for testing purposes only.

Nous Research AI ▷ #ask-about-llms (1 messages):

princepolka: Is 05-06 worse at instruction-following than the previous 2.5 Pro?

Nous Research AI ▷ #research-papers (1 messages):

LLMs in multi-turn conversations, LLM performance degradation, Lost in Conversation paper, Premature Solution Generation by LLMs, LLM Recovery from Conversational Errors

LLMs Struggle with Multi-Turn Conversations: A member shared the Lost in Conversation paper and its corresponding GitHub repo, analyzing LLM performance in multi-turn conversations versus single-turn settings.
- The paper reveals a 39% average performance drop across six generation tasks in multi-turn scenarios, attributing it to a minor loss in aptitude and a significant increase in unreliability, concluding that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
LLMs’ Premature Solution Attempts Lead to Unreliability: The study indicates that LLMs often make assumptions early in conversations and prematurely attempt to generate final solutions, leading to unreliability.
- This behavior suggests that LLMs may benefit from improved error correction mechanisms to recover from incorrect turns in a conversation.

Nous Research AI ▷ #interesting-links (2 messages):

Finetuning to 1.58 Bits, Cody S Tweet

WandB Report: Finetuning to 1.58 Bits: A WandB report discusses finetuning to 1.58 bits.
- The report likely contains details on techniques and results related to achieving such low-bit finetuning.
Cody S Posts on X: Cody S tweeted something on X.
- Without more context, the tweet’s contents and relevance to AI research are unclear.

Nous Research AI ▷ #research-papers (1 messages):

LLMs in Multi-Turn Conversations, Lost In Conversation paper, LLM Unreliability

LLMs Get Lost in Multi-Turn Conversations: A member shared the Lost In Conversation paper and its GitHub repo, which finds that LLMs perform significantly worse in multi-turn conversations compared to single-turn interactions, with an average performance drop of 39% across six generation tasks.
LLMs Prone to Premature Solution Attempts: The paper’s analysis of 200,000+ simulated conversations revealed that LLMs often make assumptions early and prematurely attempt to generate final solutions, leading to unreliability.
- In simpler terms, the authors discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

Notebook LM ▷ #announcements (1 messages):

User Experience studies, Multilingual Audio Overviews, NotebookLM Feedback

NotebookLM Users Invited to UX Studies: A friendly reminder to opt-in to participate in User Experience studies was posted.
- The NotebookLM team is currently looking for feedback on their multilingual Audio Overviews feature.
Feedback Wanted on Multilingual Audio Overviews: NotebookLM users are encouraged to provide feedback on multilingual Audio Overviews through user experience studies.
- This initiative aims to improve the user experience for those utilizing the multilingual audio features within NotebookLM.

Notebook LM ▷ #use-cases (19 messages🔥):

Invisible Sun TTRPG, Shareability Factor, Google Product Discontinuation, NotebookLM and OneNote Sync, Podcast Feature ToS

Invisible Sun TTRPG Gamified with NotebookLM: A member has been teaching themself a new TTRPG called Invisible Sun by Monte Cook Gaming, using NotebookLM and ChatGPT Projects for rules lookup.
- They like NotebookLM for the shareability factor and clear citations but prefers ChatGPT audio reviews; they look forward to testing NotebookLM’s insights on a new book coming via Backerkit.
XDA Developers Hail NotebookLM for its Unique Features: XDA-developers published an article on six use cases/features where NotebookLM excels, prompting agreement among users.
- Another article shows how to use NotebookLM with OneNote.
Google User Fears NotebookLM’s Eventual Sunset: A user expressed concern that NotebookLM might be discontinued at an inconvenient time, citing Google’s history of sunsetting good products.
- Another countered that NotebookLM’s uniqueness and potential make it unlikely to be discontinued, suggesting a possible rebrand and marketing push.
OneNote Sync Dream Sparks Discussion: A member proposed linking a OneNote notebook to NotebookLM for synchronization, so that changes in OneNote would update the NotebookLM source.
- This idea sparked discussion about the potential integration and benefits of such a feature.
Podcast Feature Usage Questions Arise: A user inquired about the terms of service for using NotebookLM’s podcast feature, specifically regarding using the audio in platforms like YouTube.
- Another user suggested checking the T&C for clarity and advised that disclaimers about accuracy and links to original sources are important.

Notebook LM ▷ #general (32 messages🔥):

Podcast Length, Audio Upload and Transcription, Account Restrictions on PDF Uploads, Adding Information to System Prompt, Early Access Installation Issues

Pad Podcast Length with Repeated Links: A user inquired about increasing podcast length in NLM, and one member suggested adding several links or documents on the same topic, even if repeated, to extend the podcast to 22 minutes.
- It was not specified if this strategy would work for everyone.
Audio Upload Transcribes; AI Studio Enhances Subtitles: One member suggested uploading audio as a source for transcription.
- Another member recommended using 2.5 flash on AI Studio for timecoded subtitles.
Restrictions on PDF Uploads Plague Some Users: Several users are experiencing restrictions on their accounts, preventing them from uploading PDFs.
- There was no resolution for this problem in the discussion.
System Prompt Customization Craving Human Touch: A user expressed dissatisfaction with podcasters using too many letters to refer to things and sought a way to add info to the system prompt to prefer human language instead.
- No solutions were offered during this discussion.
NotebookLM Beta Installation Glitches: A user reported being stuck on “installing” after receiving the ‘early access’ notification for the app.
- There was discussion around the user’s region but no specific fix offered.

Latent Space ▷ #ai-general-chat (41 messages🔥):

GPT-4 Launch, ChatGPT Scaling, AI Founder in Residence, AI in Ohio Courts, AlphaEvolve

OAI Launch Retrospective: Personal Observations: A member shared some very wholesome stories of OpenAI launches from andrewmayne.com.
Scaling ChatGPT: Building and Launching: The community shared a link to a newsletter article titled Building, launching, and scaling ChatGPT by the Pragmatic Engineer.
- The article goes over the history and tech stack of the ChatGPT launch.
Founder In Residence: AI Edition: A member asked about Founder in Residence programs focused on AI, seeking advice on how to position themselves.
- They have experience building AI systems for Analytics use cases in Amazon ads and want to build Self-Serve Agents in the same analytics space.
Gemini Powers Algorithm Design with AlphaEvolve: Google DeepMind introduced AlphaEvolve, a coding agent powered by Gemini designed for creating advanced algorithms.
Turbopuffer hits General Availability: Turbopuffer announced they are GA (Generally Available).

Latent Space ▷ #ai-announcements (3 messages):

Tom Yeh, Llama 1/2/3/4, LLM Paper Club

Prof Tom Yeh to walk through Llama 1/2/3/4: Prof Tom Yeh will walk through the Evolution of Llama 1/2/3/4 in one session at a special event.
- The event is organized by a member of the community.
LLM Paper Club Notification Channels: A member directed users to look top left for Channels & Roles to be tagged in the relevant role for notifications for LLM Paper Club.
- No additional details were given.

HuggingFace ▷ #general (15 messages🔥):

Qwen Model Distillation, MiniCPM-V-2_6, Perceptron Visualizers, Local Stable Diffusion Hosting, Langfuse Deployment with Smolagents

Qwen’s Quintessence: Distilling Knowledge?: A member inquired about notebooks or references for the distillation of the Qwen family of models.
- No resources were directly shared in the provided context.
MiniCPM-V-2_6: A Trending Model?: A member asked if anyone has tried using openbmb/MiniCPM-V-2_6, noting that it’s trending and has a high number of downloads.
- No responses were provided in the given context.
Visualizing Vectors: Perceptron Visualizer Sparks Joy!: A member shared a perceptron visualizer for educational purposes, as shown in the attached videos My_Video_0079.mp4 and My_Video_0080.mp4.
- Another member then shared another visualizer to enjoy from darkspark.dev.
Stable Diffusion, Served Locally: Forge Your Own Images!: Several members inquired about locally hosting Stable Diffusion.
- It was suggested to combine Diffusers and TGI, or use WebUI Forge (GitHub link) or reForge (GitHub link); links to Diffusers documentation (huggingface.co, huggingface.co/learn) were also shared.
Langfuse Local Launch: Telemetry Tango!: A member asked for help getting a local Langfuse deployment working with smolagents.
- They were directed to the dedicated channel and advised to get the docker-compose.yml from official docs and use opentelemetry-sdk.

HuggingFace ▷ #today-im-learning (4 messages):

Assistance Offered, Hugging Face Transformers, EleutherAI Suggestion, Diffusion Course from MIT

AI Engineer Volunteers Expertise: An AI engineer offered assistance on interesting projects, particularly for researchers or professors working on papers, especially related to LLM research and reinforcement learning.
- The engineer is happy to contribute to anything—from brainstorming and coding to experimentation, implementation, or even the less glamorous parts like paperwork or debugging.
Transformer Familiarity Questioned: A member asked whether the engineer was familiar with huggingface transformers.
- Another member suggested checking out EleutherAI.
MIT Diffusion Course Recommended: A member inquired about research papers the engineer has been looking at.
- The member shared the MIT diffusion course focused on image generation.

HuggingFace ▷ #i-made-this (7 messages):

pdf2tex vs 12GB ram, PDF format criticism, Markdown output suggestion, Civitai censorship

pdf2tex RAM Usage Impresses!: A user noted that pdf2tex uses only 1GB of RAM while auto-detecting and extracting figures using OpenCV, contrasting with a project using 12GB of RAM for parallel processing.
PDFs: The Bane of Existence!: Users expressed strong dislike for the PDF format, with one calling it the worst format ever seen and another joking that calling PDF a format is a loose term.
- One user converts PDFs to TGA or BMP in memory for easier processing and expressed a desire for a pdfToSrc solution.
Markdown Output Proposed!: A user suggested adding a markdown output option to improve semantic relationships for RAG ingestion.
- The developer acknowledged the suggestion, noting that while markdown is better than plain text, it may not fully address categorization issues for embedders, particularly with tables.
Civitai Censors Celebrity Content!: A user reported that Civitai has muted all celebrity content, raising concerns about censorship.
- They linked to a Civitai model and shared a quote from Jeri Ryan (Seven of Nine) regarding the use of AI to generate nudes.

HuggingFace ▷ #reading-group (3 messages):

Simulation-Based Inference, AI Reading Group session

Reading Group Explores Decision-Making with Simulation-Based Inference: The AI Reading Group session will discuss using Simulation-Based Inference for modeling decision-making, relevant to understanding human behavior.
- A Medium post provides additional information about the paper.
Reading Group Keeps Consistent Schedule: The AI Reading Group session will be held at 9am PDT / 12pm EDT / 6pm CEST.
- This is consistent with the previous meeting time.

HuggingFace ▷ #NLP (3 messages):

Emotion detection limitations, Transformers tokenizer context length

Emotion detection faces benchmark quality woes: Emotion detection doesn’t work very well due to low quality benchmarks, because scholars have some difficulty to define what they want to predict and encoder models tend to learn heuristic.
- This is mostly caused by the difficulties in reaching agreement on gold standard labels.
Transformers tokenizers limited by context length: All models have a context length, according to the Transformers tokenizer documentation.
- If you pass too much context to your models, you will just have an error.

HuggingFace ▷ #smol-course (2 messages):

Agent blocked sites, Smolagents framework

Agent wastes time on blocked sites: A user reported that their agent wastes time going to a blocked site (universetoday.com).
- No solutions were provided in the given messages.
Smolagents framework yields terrible results: A user reported using the smolagents framework with Qwen and some of the tools used in the course (Google Search).
- The user complained that the results were terrible.

HuggingFace ▷ #agents-course (10 messages🔥):

HF Inference Provider Credits, HF SPACE_ID and SPACE_HOST ENV vars, Unit 1 code execution, InferenceClient Model Selection, Llama models text_generation

Credits Crunch Prompts Assignment Submission Solutions: A user inquired about submitting the final assignment after exceeding the monthly included credits for the HF Inference Provider, while developing unit 4 locally using Ollama.
- Another user suggested adding HF SPACE_ID and SPACE_HOST as ENV variables and running the app locally.
Unit 1’s Home Turf: Where Does Code Roam?: A user asked where to run the code for Unit 1, specifically mentioning the HF space duplication.
- Another user recommended using Google Colab.
InferenceClient Model Choice Causes Pope Age Quandary: A user reported that when running the Unit 1 agent in the space, with no changes and giving it a ‘hello’, the agent tried to calculate the age of the Pope.
- He also shared that he tried using the client = InferenceClient(“meta-llama/Llama-3.3-70B-Instruct”)
Text Generation Troubles in Llama Land: A user suggested using this model client = InferenceClient("mistralai/Mixtral-8x7B-Instruct-v0.1").
- This is happening because the text_generation function is not supported in any of the Llama models.

MCP (Glama) ▷ #general (40 messages🔥):

Typescript vs Authpython Lag, Debugging MCP servers on Smithery, Scalable MCP with Streamable HTTP, User Confirmation for AI Agent MCP Tools, Revolutionary idea for MCP Security

Authpython Lags Typescript APIs: It was mentioned that Authpython generally lags behind Typescript by around 1-2 months in terms of API updates.
- A member suggested checking a specific channel for examples, with a link to a Go-MCP client.
Debugging MCP Server Deployed to Smithery: A user sought advice on debugging an MCP server running on Smithery due to an error encountered in Claude Desktop.
- Another member recommended using the ithena-cli tool to store all input and output for debugging, prefixing the run command.
MCP Servers go Streamable HTTP?: A user inquired about using MCP servers with streamable HTTP instead of stdio for scalability, noting that most open-source servers use stdio.
- They were unsure if they needed to reconfigure every open-source MCP server to use HTTP streamable from stdio.
AI Agents now ask for user Confirmation: A user asked how to ensure their AI Agent explicitly asks for user confirmation before triggering updates via MCP tools, similar to Claude Desktop.
- The author of fast-agent chimed in noting there is a pre_tool_call check hook that can be used to add an approval flow, similar to the existing human input tool.
MCP Inspector Images Evaporate in Claude?: A user reported that while images are visible in the resource section of MCP Inspector after invoking a tool, Claude Desktop does not show the image in the resource section.
- Another user clarified that Claude only shows images in the tool response view.

MCP (Glama) ▷ #showcase (3 messages):

Yarr MCP Servers, Tiny Agents Remote MCP Support, LLM-provider-agnostic, MCP enabled Chat Client

Yarr MCP Servers on GitHub: A member shared a link to a GitHub repository containing several ARRs MCP servers.
- A member also shared a link to X (formerly Twitter) further discussing this topic.
Tiny Agents Gets Remote MCP Support: Hugging Face Tiny Agents now has remote MCP Support and can connect to both SSE and Streaming HTTP servers from the command line.
- Tiny Agents offers a versatile approach to agent development and management.
New Web-Hosted Chat Client is MCP Enabled: A member introduced a new LLM-provider-agnostic, MCP enabled, web-hosted chat client open sourced at chatter and hosted at moopoint.io.
- The client aims to replace Claude Desktop with a web interface for interacting with LLM providers and MCP servers, with features like a free tier, memory, MCP server hosting, image handling, file uploads, and voice interaction coming soon.

Torchtune ▷ #general (5 messages):

Custom Torchtune Models with vLLM, Synchronous GRPO recipe with vLLM

Torchtune Model’s vLLM Voyage: A member confirmed running a custom Torchtune model with vLLM in their internal version of GRPO.
- They hinted at potentially making their implementation public, after being asked how to enable vLLM support for their model.
vLLM Integration Gets Synchronized: A member proposed creating a synchronous GRPO recipe with vLLM, suggesting both synchronous and asynchronous versions should exist.
- They expressed a strong preference for using the vLLM version, stating they genuinely don’t see any reason not to.

Torchtune ▷ #dev (37 messages🔥):

HFModelTokenizer vs GemmaTokenizer, Gemma PromptTemplate, Tokenizer configurations, Masking assistant tokens

Gemma Tokenizer Faces Discrepancy with HFModelTokenizer: A member reported that the HFModelTokenizer with the Gemma chat template produces output tokens that don’t match the torchtune GemmaTokenizer tokens.
- This discrepancy suggests that the torchtune’s GemmaTokenizer may not be applying the chat template correctly.
Gemma PromptTemplate Missing, Alpaca to the Rescue?: It was noted that there isn’t a specific PromptTemplate for Gemma, which leads to incorrect tokenization and potential issues with the system role.
- The default might be to use the Alpaca template, but it’s crucial to have a correct Gemma-specific template.
Multiple BOS tokens Error Inherited from HF/Google’s Config: The HF tokenizer is adding multiple beginning-of-sequence (BOS) tokens due to the configuration having "add_bos_token": true alongside a BOS token in the chat template.
- This issue is inherited from HF/Google’s tokenizer config, making the implementation technically “correct” but functionally flawed.
Navigating the Maze of Jinja Tricks for Masking Assistant Tokens: A discussion emerged around masking, specifically how Hugging Face provides an option to return an assistant mask.
- The conversation highlights the complexity of maintaining the masking process, with potential solutions involving Jinja tricks.

Modular (Mojo 🔥) ▷ #mojo (25 messages🔥):

Variant bug with SIMD, register_passable types, Mojo in Google Colab

Variant Bug causing segfaults with SIMD found: A user reported a crash with Variant when using SIMD types in Mojo, specifically a segfault occurring between print statements when a Variant[T](simd) is used; the issue seems related to insufficient space allocation within Variant or a lifetime issue.
- A minimal, reproducible example was provided which can be found on GitHub issue 4578, along with other code snippets demonstrating the bug’s erratic behavior, which includes the location of print statements affecting the crash.
Concerns with register_passable Types Arise: Concerns were voiced about using register_passable types that exceed the size of system registers in Mojo, as it may be causing miscompilations, because LLVM does not handle it well.
- It was suggested that the current implementation of Variant might be flawed for register passable types T where sizeof[T]() is larger than any register on the system, which should be replaced with various versions of Trivial.
Colab Mojo Integration launches!: It is now slightly easier to compile and run Mojo code in a Colab notebook cell via a new import: import max.support.notebook, which gives a %%mojo magic command.
- The announcement was posted on the Modular forums.

tinygrad (George Hotz) ▷ #general (15 messages🔥):

WebGPU bug, BEAM parameter, tinybox-ui, high performance blake3 implementation

WebGPU backend suffers from a bug: The generated kernel does not have consecutive DEFINE_GLOBAL args, but bufs_from_lin assumes DEFINE_GLOBAL has consecutive args, according to this message.
- claude allegedly managed to fix it.
BEAM parameter impacts WebGPU performance: Setting BEAM to anything results in the WebGPU backend suffering in performance; it runs at 30ms with no beam and 150ms with BEAM=1.
- It runs at 100ms with BEAM=2.
Minimalist Tinybox UI concept emerges: A user built a minimalist UI concept for tinybox, with no login, no cloud, no fluff, focusing on fast, local control for people who touch hardware, which can be found here.
- It was stated that an HTTP settings page for tinybox is generally supported, with the caveat that it needs to have 0 deps and absolute minimal line count.
Blake3 for tensor storage: A bounty exists for a high performance blake3 implementation to use for content addressable tensor storage for the cloud.
- As such, the implementation should be general purpose, or something something according to a user.

tinygrad (George Hotz) ▷ #learn-tinygrad (1 messages):

cookiecrumbs3808: Or offloaded to CPU, I guess.

LlamaIndex ▷ #blog (3 messages):

LlamaIndex Memory component, LlamaExtract citation implementation

LlamaIndex Memory Component Augments AI Agents: LlamaIndex introduces a new Memory component to enhance AI agents with both short-term and long-term memory capabilities for context-aware conversations.
- The new memory component allows developers to implement static memory blocks (link) to their chatbot agents.
LlamaExtract Gets Citations and Reasoning: A new code walkthrough by @tuanacelik demonstrates how to implement citations and reasoning in LlamaExtract.
- The walkthrough details how to define a custom schema that instructs the LLM on what to extract from complex data sources (link).

LlamaIndex ▷ #general (6 messages):

LlamaIndex Memory Component, Memory Session Management, Database Integration for Memory, Serialization vs. Database for Context, Memory vs Redis

Memory Component Stumper for Workflows: A user is facing issues with the new Memory component in LlamaIndex workflows, noting that the memory is empty on each workflow call when using user_id to set the session_id.
- The user also inquired about Redis integration with the Memory component.
Memory Defaults to In-Memory DB, but DB Connection Recommended for scalability: By default, the Memory component uses an in-memory SQLite database, but it can be configured to use a local SQLite database or a PostgreSQL database by changing the database URI.
- For large chat histories, using a database is recommended over serializing to a JSON blob via memory.to_dict() for scalability.
Context Serialization vs. DB for Chat History: A user questioned the benefit of using a database connection with the Memory component versus serializing the context, as restoring the context also restores chat history.
- The response clarified that serializing the context is fine by default but using a database is preferable for large chat histories or when a structured way to save the history is needed, plus python dict vs redis is the same problem.

Cohere ▷ #💬-general (3 messages):

Generation Parameters, Use cases for Cohere, Cohere vs ChatGPT and Anthropic

Guidance on Generation Parameters Requested: A member asked for guidance on suggested generation parameters for Command aCurious.
Interest in Cohere’s Use Cases: Others were curious about use cases for Cohere vs. other models like ChatGPT and Anthropic.

Cohere ▷ #🔌-api-discussions (5 messages):

Cohere API Calls, Cohere Billing, Cohere Trial Key

Cohere User Asks about API Call Count: A Cohere user asked how to check the number of API calls made.
- Another user provided a link to the billing dashboard.
Cohere Trial Key Doesn’t Show Number of API Calls: A Cohere user stated that the trial key only shows tokens and not the number of API calls made.
- They added, I don’t think there is a raw number of requests being counted.

LLM Agents (Berkeley MOOC) ▷ #mooc-questions (2 messages):

Course certificate requirements, Medium article or X post for certificate

Medium Article or X Post unlocks Course Certificate: Members clarified that earning a course certificate requires writing a Medium article or an X post summarizing one of the lectures.
- Interested members must submit their work via this form to receive credit.
Submitting Coursework for Certificate: To get the certificate, the coursework must be submitted via the provided Google Forms link after completing a Medium article or X Post.
- The submission ensures that the work is properly credited towards the course certificate.

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama Recap

1. Benchmarking AMD Strix Halo and Qwen3 Models for Local LLM Inference

2. MAESTRO Local-First AI Research App Release and Benchmarks

3. BitNet R1 Ternary Model Finetune and Community Tools

Other AI Subreddit Recap

1. AlphaEvolve and DeepMind Breakthroughs in Coding and Science AI

2. Anthropic Claude Sonnet/Opus Model Release Anticipation and OpenAI Model Rollout

2. Anthropic Claude Sonnet/Opus Model Release Anticipation and OpenAI Model Rollout

3. ChatGPT as New Internet Interface and Its Societal Impact

3. ChatGPT as New Internet Interface and Its Societal Impact

AI Discord Recap

Discord: High level Discord summaries

Perplexity AI Discord

LM Studio Discord

LMArena Discord

Unsloth AI (Daniel Han) Discord

OpenAI Discord

Cursor Community Discord

Yannick Kilcher Discord

OpenRouter (Alex Atallah) Discord

Manus.im Discord Discord

GPU MODE Discord

aider (Paul Gauthier) Discord

Eleuther Discord

Nous Research AI Discord

Notebook LM Discord

Latent Space Discord

HuggingFace Discord

MCP (Glama) Discord

Torchtune Discord

Modular (Mojo 🔥) Discord

tinygrad (George Hotz) Discord

LlamaIndex Discord

Cohere Discord

LLM Agents (Berkeley MOOC) Discord

Discord: Detailed by-Channel summaries and links

Perplexity AI ▷ #general (748 messages🔥🔥🔥):

Perplexity AI ▷ #sharing (2 messages):

Perplexity AI ▷ #pplx-api (12 messages🔥):

LM Studio ▷ #general (176 messages🔥🔥):

LM Studio ▷ #hardware-discussion (450 messages🔥🔥🔥):

LMArena ▷ #general (530 messages🔥🔥🔥):

LMArena ▷ #announcements (1 messages):

Unsloth AI (Daniel Han) ▷ #general (379 messages🔥🔥):

Unsloth AI (Daniel Han) ▷ #off-topic (49 messages🔥):

Unsloth AI (Daniel Han) ▷ #help (83 messages🔥🔥):

Unsloth AI (Daniel Han) ▷ #research (5 messages):

OpenAI ▷ #annnouncements (2 messages):

OpenAI ▷ #ai-discussions (151 messages🔥🔥):

OpenAI ▷ #gpt-4-discussions (12 messages🔥):

OpenAI ▷ #prompt-engineering (70 messages🔥🔥):

OpenAI ▷ #api-discussions (70 messages🔥🔥):

Cursor Community ▷ #general (271 messages🔥🔥):

Yannick Kilcher ▷ #general (197 messages🔥🔥):

Yannick Kilcher ▷ #paper-discussion (23 messages🔥):

Yannick Kilcher ▷ #ml-news (12 messages🔥):

OpenRouter (Alex Atallah) ▷ #app-showcase (5 messages):

OpenRouter (Alex Atallah) ▷ #general (177 messages🔥🔥):

Manus.im Discord ▷ #general (165 messages🔥🔥):

GPU MODE ▷ #general (2 messages):

GPU MODE ▷ #cuda (1 messages):

GPU MODE ▷ #torch (2 messages):

GPU MODE ▷ #beginner (5 messages):

GPU MODE ▷ #self-promotion (10 messages🔥):

GPU MODE ▷ #🍿 (1 messages):

GPU MODE ▷ #submissions (47 messages🔥):

GPU MODE ▷ #status (1 messages):

GPU MODE ▷ #factorio-learning-env (15 messages🔥):

GPU MODE ▷ #amd-competition (23 messages🔥):

GPU MODE ▷ #cutlass (25 messages🔥):

GPU MODE ▷ #mojo (2 messages):

aider (Paul Gauthier) ▷ #general (49 messages🔥):

aider (Paul Gauthier) ▷ #questions-and-tips (48 messages🔥):

Eleuther ▷ #general (22 messages🔥):

Eleuther ▷ #research (57 messages🔥🔥):

Eleuther ▷ #lm-thunderdome (7 messages):

Nous Research AI ▷ #announcements (2 messages):

Nous Research AI ▷ #general (78 messages🔥🔥):