Whisper is all you need.
AI News for 5/13/2025-5/14/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (214 channels, and 4313 messages) for you. Estimated reading time saved (at 200wpm): 428 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!
We try to keep coverage to model- and code-specific news that weâre pretty sure engineers will someday use at work, but occasionally smaller product launches are interesting fodder for commentary on the broader AI landscape, especially if the launches involve highly regarded work products like Notion or Granola.
Thereâs an ongoing joke in biology that everything evolves into crab. The same is happening in AI wrapper land - just because theyâre now recognized to be valuable, doesnât stop them from still being easy to clone. Bolt inspires Figma Make, Claude Code inspires OpenAI Codex, Deep Research inspires Deep Research inspires Research inspires DeepSearch, and on and on. Ideas are worth nothing, may the best distribution + execution win.
The occasion of Granolaâs $43m Series B (at $250m valuation) is their time to launch âGranola 2.0â, their collaborative version with a surprisingly⊠Notiony UI.
This is a day after Ivan Zhao launched⊠an interesting Granola-lite feature.
AI Twitter Recap
Language Models and Releases
- GPT-4.1 availability: @OpenAI announced that GPT-4.1 will be directly available in ChatGPT for Plus, Pro, and Team users, with Enterprise and Education users gaining access in the coming weeks, specializing in coding tasks and instruction following. @kevinweil noted that GPT 4.1 mini is replacing GPT 4o mini everywhere in ChatGPT, including for free users.
- Claude models: @scaling01 expressed excitement about the upcoming Claude Opus, anticipating further models like Ultra and GPT-4.5 based reasoning models. @steph_palazzolo shared information on Anthropicâs upcoming Claude Sonnet and Claude Opus releases, noting their different reasoning models. However, @andersonbcdefg criticized that Claude is braindead now, with O3 making random stuff up and sending you down rabbitholes of hallucinations.
- Qwen models: @Alibaba_Qwen shared the Qwen3 Technical Report, detailing model specifics and complete assessments. @iScienceLuvr highlighted that Seed1.5-VL delivers state-of-the-art results on 38 out of 60 public VLM benchmarks. @reach_vb congratulated the team, and @Yuchenj_UW commended the Qwen teamâs great work. @qtnx_ also expressed respect for the Qwen teamâs impressive and hilarious throwing of thirty six TRILLION tokens on a 600M.
- Metaâs AI efforts: @AIatMeta announced new releases from Meta FAIR, including models, benchmarks, and datasets for molecular property prediction, language processing, and neuroscience. However, @Yuchenj_UW criticized Metaâs AI, particularly Llama 4, noting issues with ignoring attached pictures and login failures.
- @_akhaliq announced that AM-Thinking-v1 just dropped on Hugging Face, calling it an advancement on the Frontier of Reasoning at 32B Scale.
- @RisingSayak announced a new Diffusers-compatible training script for SANA Sprint.
- Gemini 2.0 Flash Preview: @ArtificialAnlys reported that Gemini 2.0 Flash Preview image generation delivers a modest upgrade over the 2.0 Flash Experimental release, but still remains well below the state-of-the-art threshold. @HamelHusain said that Gemini one shotted these chapter summaries w/amazing accuracy.
- Stability AI just dropped Stable Audio Open Small on Hugging Face. @_akhaliq noted that is a Fast Text-to-Audio Generation with Adversarial Post-Training.
Agent Development and Tooling
- LangChain Interrupt: @LangChainAI provided updates from Interrupt 2025, focused on evals, quality, and reliability, emphasizing that quality is still the biggest blocker of bringing agents to production. @LangChainAI also introduced the Open Agent Platform, an open-source, no-code agent building platform. @LangChainAI announced that the LangGraph Platform is now generally available, designed for deploying, scaling, and managing agents.
- LlamaIndex Memory Component: @llama_index introduced a new, flexible Memory API that blends short-term chat history and long-term memory via plug-and-play blocks.
- Runway References Update: @c_valenzuelab shared the cool uses cases coming up with this latest References update.
- @LiorOnAI announced a debugging tool from @PatronusAI that scans full execution traces, detects 60+ failure types, and suggests prompt fixes, working with Langchain, CrewAI, OpenAI SDKs, and more.
- Model Context Protocol: @AndrewYNg announced a new course on MCP with Anthropic, focusing on building rich-context AI Apps. @DeepLearningAI announced a new course with Anthropic on MCP. @jerryjliu0 introduced a new abstraction for agentic memory, modeling it as a set of âblocksâ in a waterfall architecture.
- @nerdai introduced FedRAG, a framework for fine-tuning RAG systems, highlighting its focus on simplifying fine-tuning across centralized and federated architectures.
- @LiorOnAI noted OpenAI quietly released their GPT-4.1 Prompting Guide, saying itâs a must read if youâre using agents or LLMs.
- @steph_palazzolo observed that coding assistants are moving towards always-on agents that constantly search for bugs and vulnerabilities in the background.
AI Infrastructure and Tools
- Hugging Face and Integrations: @reach_vb noted that you can now directly use any model from Hugging Face directly over on Kaggle notebooks. @ClementDelangue noted that it is Very cool to see @PyTorch contributing on @huggingface. @_akhaliq reported that Blazingly fast whisper transcriptions with Inference Endpoints.
- vLLM Enhancements: @ClementDelangue reported 8x faster/cheaper @openai Whisper API thanks to Hugging Face Inference Endpoints & @vllm_project!. @vllm_project congratulated FlashInfer, and @danielhanchen shared a New GRPO notebook for Qwen3 Base, saying vLLM 0.8.5 is also supported now with Unsloth!
- Keras Updates: @fchollet discussed creating KerasHub pretrained components straight from the base classes.
- Model Context Protocol: MCP makes AI development less fragmented and standardizes connections between AI applications and external data sources, explained @AndrewYNg.
- Deep Learning AI launched course 4 of the Data Analytics Professional Certificate, which includes Data I/O and Preprocessing with Python and SQL. Throughout the course, youâll learn how to use generative AI to help debug and optimize your data pipeline. @DeepLearningAI
- @skypilot_org reported on spinning up Qwen3 @Alibaba_Qwen + SGLang @lmsysorg on H100 in one command.
AI and Research Concepts
- AlphaEvolve for Algorithm Discovery: @GoogleDeepMind introduced AlphaEvolve, a Gemini-powered coding agent for algorithm discovery, capable of designing faster matrix multiplication algorithms, finding new solutions to open math problems, and making data centers, chip design, and AI training more efficient. @GoogleDeepMind further noted that in 75% of cases, it rediscovered the best solution known so far.
- Auto-regression critiques: @francoisfleuret shared a hot take that Auto-regression sucks and is impressive as a parlor trick, saying any spark of intelligence from an LLM reflects that it moved beyond, and built a factorized model with meaningful latents.
- Evaluation methods: @BorisMPower stressed that creating evaluations is the most effective way to improve model performance in any domain.
- Importance of Implementation: @hyhieu226 highlighted that deep learning is ~10% idea and ~90% implementation.
- LLMs and Grammar: @LiorOnAI explained why LLMs trained on 90% English still perform incredibly well in other languages. They learn shared grammar concepts, models donât just memorize word-level patterns.
- @shaneguML shared Bruce Leeâs famous LLM researcher quote: âI fear not the LLM who has practiced 10,000 questions once, but I fear the LLM who has practiced one question 10,000 times.â
- Type-constrained Code Generation: @mathemagic1an shared about âType-constrained Code Generation with Large Language Modelsâ, saying it uses the LSP/type system to constrain valid output tokens during code generation and reduces compilation errors by >50% with 30B models.
Industry, Business, and Economic Impacts
- AI in Enterprises: @AIatMeta shared their study on CATransformers is a carbon-driven neural architecture and system hardware co-design framework, discovering greener CLIP models that achieve an average of 9.1% reduction potential in total lifecycle carbon emissions.
- AI skillsets: @NandoDF said that, if youâre a good data engineer, or an engineer who loves looking at data creating datasets for games, video, images, audio, text ⊠please send me a message.
- @ID_AA_Carmack believes that more of the world than many might imagine could run on outdated hardware if software optimization was truly a priority.
- Overcoming Conscientiousness as an entrepreneur is a common theme. @scottastevenson thinks that conscientious people are constantly drawn to easy structured dopamine rewards like cleaning their desk, running an errand.
- Importance of Demand: @rishdotblog summarized @ejames_c post, noting that the inability to find authentic demand kills startups.
- Impacts on Employment: @cto_junior believes that most âsoftware engineersâ here are only code monkeys with no insight of how the overall system works and they will for sure get replaced without upskilling.
Humor and Miscellaneous
- @sama declared that brian is the most auteur founder of this generation, and it really shines through in how he does launches!
- @typedfemale said âIâm in my later 20s now (and female btw). And this will sound weird, but I really think God put me on this earth to bring warmth to the lives of mildly autistic menâ.
- @arankomatsuzaki said âIâm glad youâre keeping track of my Ls đâ.
- @victkyatk wrote âbeing the greatest ML researcher of all time must be really annoyingâ.
- @francoisfleuret declared âI am worth whatever salary just for my enthusiasm.â.
AI Reddit Recap
/r/LocalLlama Recap
1. Benchmarking AMD Strix Halo and Qwen3 Models for Local LLM Inference
- AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance (Score: 104, Comments: 43): The post benchmarks the AMD Strix Halo (Ryzen AI Max+ 395) GPU, featuring 40 RDNA3.5 CUs and a peak of 59.4 FP16/BF16 TFLOPS, for LLM inference on Linux using llama.cpp and other frameworks. Raw compute efficiency with hipBLASLt reaches 36.9 TFLOPS (>60% theoretical), but llama.cppâs HIP backend underperforms (e.g., 348.96 tokens/sec for Llama-2-7B Q4_0), drastically below expected efficiency versus Vulkan (881.71 t/s), Apple M-series, and both 780M and 7900 XTX GPUs. The Vulkan backendâwith recent Flash Attention (FA) supportâdelivers the best prompt and token generation speeds (e.g., 884 t/s for Llama-2-7B Q4_0), while HIP+rocWMMA+FA excels for long contexts (almost no perf drop at 8K context). Testing also includes Qwen3-30B/109B and Llama 4 (up to 57.9 GiB models), showing that Vulcan delivers high tg128 for massive models and that ROCm and software stack maturity (esp PyTorch FA) remain bottlenecks. ROCm 6.4, AOTriton, and Composable Kernel are confirmed to build and work, but PyTorch Flash Attention still fails on this hardware. Useful reference: Strix Halo benchmarking results. Commenters highlight that for Llama-2-7B Q4_0, the GPU achieves 79% of theoretical memory bandwidth and 87% for Qwen3 32B Q8âhigher efficiency than most conventional systems, per synthetic benchmarks. Others request testing with higher precision models (e.g., Qwen 32B Q8 at large context) and follow ongoing ROCm and PyTorch development threads (ROCm#4499, ROCm/TheRock#244).
- Ongoing efforts to improve PyTorch support for AMD GPUs are highlighted, with direct reference to active ROCm development discussions and issue tracking. Technical readers are pointed to ROCm/ROCm issue #4499 and ROCm/TheRock discourse #244, indicating a focus on library and compatibility optimizations for PyTorch users on AMD hardware.
- Benchmark results show that the Llama-2-7B-GGUF Q4_0 model achieves throughput at 79% of theoretical memory bandwidth, while Qwen3 32B Q8 reaches 87%, which is significantly higher than most conventional systems where synthetic benchmarks often perform worse. Reference provided to memory bandwidth benchmarks and discussion.
- There is an interest in RPC latency tests for Strix Halo systems, comparing the potential value proposition of using these new devices as single RPC servers versus scaling multiple cheaper systems. The inquiry seeks technical details on RPC performance testing, particularly for LLM inference deployments, and whether such benchmarks were conducted using one or several units as hosts/clients.
- Qwen3-30B-A6B-16-Extreme is fantastic (Score: 120, Comments: 44): The Qwen3-30B-A6B-16-Extreme model (Hugging Face link) is a MoE (Mixture of Experts) LLM variant which increases the active experts from 8 to 16 (out of 128 total) compared to the original Qwen 30B-A3B specification. The model is not actually finetuned but instead has had its expert count changed via configuration, as clarified on the model card, which can impact inference depth and potentially accuracy. There is also GGUF quantization support (link) and a 128k context-length variant is available. Technical debate in the comments centers on the impact of increasing the number of experts without retraining: some users question whether simply activating more experts yields performance gains or requires retraining, and call for benchmarks to quantify the improvement. One commenter points out that contrary to the model card, escalating the number of experts does not constitute a proper âfinetuneâ but is just a configuration change.
- Discussion centers on model architecture for Qwen3-30B-A6B-16-Extreme, specifically increasing active MOE (Mixture-of-Experts) experts from 8 to 16 out of 128 without retraining. Technical users confirm you can change expert count via configuration (not weights), e.g., ââoverride-kv qwen3moe.expert_used_count=int:24â (llama.cpp) or through LM Studio settings.
- The SHA256 checksum for the safetensors file remains unchanged, indicating only the config file is altered to use more experts per token and the model weights themselves are unmodified. This suggests increased expert count is simply a runtime configuration, not a genuine finetune, despite some model cards erroneously describing it as such.
- Questions remain about the performance impact of increasing experts. Commenters request benchmark comparisons between different expert counts and debate whether simply activating more experts yields better results, or if further training is necessary to maximize gains.
- Embrace the jank (2x5090) (Score: 101, Comments: 48): The OP upgraded a mining rig by adding a second NVIDIA RTX 5090 GPU to an existing 4x3090 setup, noting improved availability and reduced pricing of the 5090. They report compatibility challenges due to the physical length of the Gigabyte 5090, but observed that the ROPs are robust (indicating late-batch cards) and that cable/power thermals remain safe with power limits set (400W for 5090s, 250W for 3090s). Use-case includes simultaneous LoRA training on one 5090 and image generation on another via ComfyUI, with inference planned via vllm or sglang on the 3090s. Commenters highlight the prohibitive cost of high-VRAM cards like the 5090 in some regions, and suggest further technical analysis such as system noise measurement.
- A user discusses simultaneous workload capability by running a LoRA training session on one RTX 5090 and image generation in ComfyUI on another 5090, with TabbyAPI operating on 4x3090s. The workload is described as mild, and the user intends to test higher-demand scenarios with vllm or sglang inference later, pointing to interest in assessing real multi-GPU performance under more intensive AI serving tasks.
- Thermal management concerns are highlighted, specifically regarding the risk of connector melting when running high-end GPUs like the 5090. The user queries about the type and quality of thermal camera used for monitoring, suggesting attention to hardware safety and reliability in extreme or enterprise GPU setups.
2. MAESTRO Local-First AI Research App Release and Benchmarks
- Announcing MAESTRO: A Local-First AI Research App! (Plus some benchmarks) (Score: 149, Comments: 34): MAESTRO is a modular, local-first AI research app supporting document ingestion, hybrid-search RAG pipelines, and a multi-agent system (planning, research, reflection, writing), configurable for both local and cloud/API-based LLMs. Benchmarks, available in the repositoryâs
VERIFIER_AND_MODEL_FINDINGS.md
, use a panel of LLM âverifiersâ to assess and match local and remote models for agent roles, reporting per-task performance (e.g., note-taking, synthesis) and providing deployment recommendations. The system supports both Streamlit UI and CLI interaction, tracks resource/cost usage, and is actively evolving towards enhanced UIs and agentic frameworks. See code and details. Notable commenter questions include surprise at certain benchmark outcomes (e.g., âqwen3 8b performs better than 32b?â), requests for broader websearch API support (e.g., SearxNG, DuckDuckGo, Google), and critiques of example outputs as conventional, with suggestions to test agents on more novel research domains for true value assessment.- The OP is questioned whether Qwen3 8B truly outperforms 32B models, reflecting skepticism and interest in the posted benchmarks. This highlights community attention to surprising performance results, especially since conventional wisdom would expect significantly larger models (32B) to outperform smaller ones (8B) unless there are dramatic efficiency or instruction-tuning improvements in the model architecture.
- A critical bug report is provided referencing a PyTorch error relating to custom class instantiation. The user receives:
RuntimeError: Tried to instantiate class '__path__._path', but it does not exist! Ensure that it is registered via torch::class_
. This suggests a potential packaging or extension issue where a required TorchScript/PyTorch custom class was not registered or is missing.
- Wan-AI/Wan2.1-VACE-14B · Hugging Face (Apache-2.0) (Score: 118, Comments: 6): Wan2.1 (Hugging Face repo) is an open-source, state-of-the-art video foundation model suite covering text-to-video, image-to-video, video editing, text-to-image, and video-to-audio, with models ranging from 1.3B to 14B parameters. It demonstrates SOTA performance (outperforming both open and commercial peers), supports consumer-grade GPUs (5-second 480p render on RTX 4090 in 4 minutes with the 1.3B model), and features robust bilingual (Chinese/English) text generation, a high-quality, temporal-preserving video VAE (Wan-VAE), and tight integration with Diffusers and ComfyUI, supporting advanced distributed/multi-GPU inference via FSDP and Ulysses/Ring strategies. All code and weights are released under Apache 2.0, optimized for LoRA/finetunable workflows, with quantization and speed optimizations supplied. Commenters request a MoE (Mixture of Experts) 14B variant to significantly improve inference speed for practical deployment (potentially achieving 10Ă speedup with ~90% performance retention), and request clarification on naming and feature distinctions between Wan2.1âs previous and current variants (ITV/TTV vs. VTV components).
- A user discusses the potential impact of a MoE (Mixture of Experts) 14B version, noting that even at 90% of the original modelâs performance, a speedup by 10x would drastically improve practical inference times, especially for consumer use-cases (e.g., reducing a 20-minute render to 2 minutes, or 10 minutes to 1 minute on an RTX 4090 with optimizations).
- Technical highlights from the model card are cited: the T2V-1.3B model operates on just 8.19GB VRAM, enabling compatibility with consumer GPUs, and can generate a 5-second 480p video in ~4 minutes on an RTX 4090 without quantization. Wan2.1 reportedly outperforms other open and commercial models across benchmarks, excels in multi-modal tasks, and is the first video model to robustly generate both Chinese and English text, with a highly efficient Wan-VAE for 1080p video handling.
- There is confusion about the naming conventions and versions in the Wan series, with users questioning the transition from ITV/TTV to potentially VTV, indicating the need for clearer documentation or changelog on model progressions and architecture changes.
3. BitNet R1 Ternary Model Finetune and Community Tools
- BitNet Finetunes of R1 Distills (Score: 274, Comments: 65): A novel method enables direct fine-tuning of existing FP16 Llama and Qwen checkpoints into ternary BitNet (weights limited to {-1, 0, 1}), by inserting an input-side RMSNorm before every linear layer. Models bitnet-r1-llama-8B (converged in ~300M tokens) and bitnet-r1-qwen-32B (~200M tokens) were trained using BF16 AdamW on 8ĂH100 GPUs with all linear weights quantized (including lm_head for this release). PyTorch/Transformers support is available via a PR (repo), allowing use and further fine-tuning with only a quant_config change; checkpoints are hosted on Hugging Face (bitnet-r1-llama-8b, bitnet-r1-qwen-32b). This approach reduces memory requirements and training costs, achieving competitive loss trends, with a roadmap including convergence, keeping the output head in full precision, and RMS patch upstreaming. Expert comments highlight that this method enables BitNet weights to be achieved with minimal additional trainingâcheaper than retraining from scratchâand express interest in whether performance surpasses 4-bit quantization, with requests for further benchmarks and broader hardware evaluations.
- The core innovation detailed is the addition of an input-side RMSNorm layer before each linear operation in existing FP16 Llama or Qwen models, allowing direct fine-tuning into the highly compressed 1-bit BitNet format. This method enables rapid adaptation (convergence in roughly 200-300M tokens) at a fraction of the original full training cost, with minimal runtime impact as the extra RMSNorm can be fused into the quantization process post-training.
- All linear weights, including the critical
lm_head
, were quantized in these experiments to stress-test stabilityâa choice expected to produce suboptimal perplexity versus approaches keepinglm_head
in full precision. The authors note that future iterations will retain full-precisionlm_head
and aim for better convergence and compatibility with original model weights, eventually supporting drop-in replacement for standard checkpoints. - Training was performed using BF16 AdamW and DeepSpeed ZeRO-3 on 8x H100 GPUs. While BitNet weight packing reduces memory and can offer faster inference in memory-bound scenarios, some hardware may incur overhead due to de-quantization. The checkpoints are experimental and a slight perplexity gap is expected until further training continues; code modifications are available in a custom fork of Hugging Face Transformers for early adopters.
- I updated the SmolVLM llama.cpp webcam demo to run locally in-browser on WebGPU. (Score: 205, Comments: 14): The post announces an update to the SmolVLM/llama.cpp webcam demo: it now runs completely in-browser using WebGPU and Transformers.js, eliminating the need for local installations or a server backend. The demo leverages client-side WebGPU acceleration for real-time inference and is deployed on Hugging Face Spaces, with minimal implementation (single index.html file) available in the files section (demo link: https://huggingface.co/spaces/webml-community/smolvlm-realtime-webgpu). Technical discussion in comments is minimal; most feedback is anecdotal and focused on demo output rather than on implementation details or performance characteristics.
- A user inquired about the size of the 500M SmolVLM model, prompting discussion around its storage requirements for local/in-browser execution. While the exact figure isnât stated in the thread, 500M parameter models typically range from ~1GB to 2GB in FP16 precision, which is crucial for those considering in-browser deployment on resource-constrained devices.
- US issues worldwide restriction on using Huawei AI chips (Score: 177, Comments: 166): The US has implemented a worldwide export restriction on the use of Huawei AI chips, expanding its extraterritorial controls beyond domestic borders to curb access to advanced semiconductor technology. The move is intended to prevent Huawei from supplying AI chips for AI and HPC applications that could compete with leading US-based suppliers such as Nvidia, citing national security and competitive advantage concerns. This action further extends previous restrictions on chip manufacturing equipment and semiconductor design tools beyond the US, impacting global supply chains and non-US entities (see Reuters coverage). Top comments note the implicit technological threat posed by Huawei to Nvidia, reflect skepticism about enforceability outside US jurisdiction, and interpret the restriction as a form of technological endorsement for Huaweiâs AI chips.
- Comments reference the potential technical competitiveness of Huaweiâs AI chips, with speculation that US restrictions indicate these chips could surpass Nvidiaâs offerings in terms of price and performance in certain global markets. This inference aligns with recent industry analysis suggesting that Huaweiâs new Ascend series chips are becoming viable alternatives for AI workloads, especially where Nvidiaâs high costs or supply constraints are prohibitive.
Other AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo
1. AlphaEvolve and DeepMind Breakthroughs in Coding and Science AI
- DeepMind introduces AlphaEvolve: a Gemini-powered coding agent for algorithm discovery (Score: 1103, Comments: 274): DeepMind announced AlphaEvolve, an automated coding agent leveraging Gemini LLM ensembles (Flash for exploration, Pro for depth) for novel algorithm discovery and optimization (DeepMind blog). AlphaEvolve iteratively generates and tests code solutions, yielding performance gainsâe.g., a
23%
speedup in Geminiâs matrix multiplication kernel (yielding1%
overall training time reduction),0.7%
compute recovery in data center scheduling, and hardware improvements for TPUs. On mathematical tasks, it rediscovered state-of-the-art solutions in75%
of50+
open problems and outperformed prior best in20%
, including improvements to the kissing number problem. The agent reduces kernel optimization from weeks to days via automated, unsupervised search. Technical debate centers on implications for LLM-based scientific discovery, with some viewing AlphaEvolve as a counterexample to claims that LLMs cannot autonomously discover new algorithms (contrasting with talks from experts like Yann LeCun). Commenters also anticipate that such advances signal near-term breakthroughs in unsupervised self-improvement for algorithmic discovery.- AlphaEvolve, when tested on over 50 open problems in mathematical domains including analysis, geometry, combinatorics, and number theory (e.g., the kissing number problem), was able to rediscover the best-known solutions in
75%
of cases and surpass previous solutions in20%
of cases, resulting in verifiable new discoveries. (source) - The system used Gemini-powered approaches to optimize matrix multiplication kernels, which accelerated this key operation by
23%
and led to a measurable1%
decrease in the training time for Gemini models. This efficiency gain translates into reduced computational expense, and AlphaEvolveâs automation reduces kernel optimization cycles from weeks of manual tuning to days of automated runs, thus speeding research cycles. - AlphaEvolveâs optimization strategies are directly applied to core infrastructure, including Googleâs data centers and AI chip design, as well as the very model architectures (such as those powering AlphaEvolve) that it is intended to improve, making for a self-optimizing feedback loop within the AI development stack.
- AlphaEvolve, when tested on over 50 open problems in mathematical domains including analysis, geometry, combinatorics, and number theory (e.g., the kissing number problem), was able to rediscover the best-known solutions in
- Meet AlphaEvolve, the Google AI that writes its own codeâand just saved millions in computing costs (Score: 455, Comments: 52): Google AI has introduced AlphaEvolve, an AI system ostensibly capable of autonomously generating novel computer algorithms, with claims of significant cost savings (millions) in computing expenses. The announcement further asserts AlphaEvolve is able to make new discoveries within computing and mathematics, echoing the ambition of prior breakthroughs like AlphaFold and AlphaGo. Technical readers flag the need for concrete evidence supporting these claims, such as reproducible benchmarks, accessible datasets, or details about algorithmic innovation. Commenters express skepticism about the magnitude of the claims, specifically demanding evidence for AlphaEvolveâs ability to invent entirely new algorithms. Others note historical precedentâAlphaFold and AlphaGo also achieved unconventional results by extensive search and self-play, but substantive empirical results validated their significance.
- AlphaEvolve reportedly combines Gemini Flash and Gemini Pro models as its core framework, allowing modular upgrades as new SOTA models become available. This design enables adaptability and suggests sustained improvements in efficiency and capabilities as underlying models advance.
- The impact of AlphaEvolve on complex Google infrastructure (chip design, networking, DC deployment, cloud compute) is highlighted: AI-driven optimization here could lead to significant efficiency gains across the companyâs extensive tech stack, potentially outpacing competitors who lack similar vertical integration.
- Commenters debate the legitimacy of Googleâs claims, noting that Google/DeepMind historically shares less-inflated performance metrics compared to some competitors. There is demand for concrete benchmarks or evidence, especially regarding claims about algorithmic invention and mathematical discoveries.
- DeepMind unveils âspectacularâ general-purpose science AI (Score: 246, Comments: 29): DeepMindâs newly announced AlphaEvolve system integrates large language models (LLMs) with automated algorithmic evaluators to autonomously evolve novel, high-performance algorithms across scientific domains. In benchmarking, AlphaEvolve discovered matrix multiplication routines surpassing the long-standing Strassenâs algorithm, as well as improved designs for tensor processing units (TPUs) and cloud resource allocation. The architecture uniquely combines LLM-driven proposal generation with evolutionary algorithm selection, enabling domain-general problem-solving capabilities (for details, see the Nature announcement). Commenters highlight AlphaEvolveâs ability to outperform classic algorithms (e.g., Strassenâs for matrix multiplication) and speculate that such advances demonstrate DeepMindâs leadership toward artificial general intelligence (AGI), linking this progress to DeepMindâs recent âafter AGIâ specialist hires.
- DeepMindâs new AI, AlphaEvolve, develops matrix multiplication algorithms that can surpass the speed of Strassenâs algorithm, a method that has remained the fastest-known since 1969. This demonstrates the AIâs ability to discover novel optimization strategies in computational mathematics, a key benchmarking domain in computer science.
- A critical observation is that AlphaEvolveâs capabilities are inherently tied to domains with scalable and cheap validation, such as program analysis where computational correctness and speed are straightforward to measure. For fields like astronomy, particle physics, medicine, or businessâwhere validation is costly or limitedâthe impact of these AI-driven discoveries remains minimal; the limiting factor shifts from idea generation to experimental validation.
- While AI-driven improvements in computational tasks like matrix multiplication suggest a compounding effect (flywheel effect) across technically tractable domains, the broader implication is a proof-of-concept for AI systems that could theoretically drive rapid and recursive innovation if applied across interconnected foundational technologies. This is relevant to discussions about the feasibility of a technological singularity driven by general-purpose science AIs.
2. Anthropic Claude Sonnet/Opus Model Release Anticipation and OpenAI Model Rollout
-
Looks like we can expect an Anthropic release in the coming weeks (Score: 218, Comments: 61): **The image depicts a formal presentation, likely by an Anthropic representative, highlighting the upcoming release of the Claude Sonnet and Claude Opus models. These models are noted for their distinctive reasoning abilities, suggesting significant advancements in Anthropicâs AI architecture. Both the post and comments indicate that credible sources (**1. AlphaEvolve and DeepMind Breakthroughs in Coding and Science AI
- DeepMind introduces AlphaEvolve: a Gemini-powered coding agent for algorithm discovery (Score: 1103, Comments: 274): DeepMind announced AlphaEvolve, an automated coding agent leveraging Gemini LLM ensembles (Flash for exploration, Pro for depth) for novel algorithm discovery and optimization (DeepMind blog). AlphaEvolve iteratively generates and tests code solutions, yielding performance gainsâe.g., a
23%
speedup in Geminiâs matrix multiplication kernel (yielding1%
overall training time reduction),0.7%
compute recovery in data center scheduling, and hardware improvements for TPUs. On mathematical tasks, it rediscovered state-of-the-art solutions in75%
of50+
open problems and outperformed prior best in20%
, including improvements to the kissing number problem. The agent reduces kernel optimization from weeks to days via automated, unsupervised search. Technical debate centers on implications for LLM-based scientific discovery, with some viewing AlphaEvolve as a counterexample to claims that LLMs cannot autonomously discover new algorithms (contrasting with talks from experts like Yann LeCun). Commenters also anticipate that such advances signal near-term breakthroughs in unsupervised self-improvement for algorithmic discovery.- AlphaEvolve, when tested on over 50 open problems in mathematical domains including analysis, geometry, combinatorics, and number theory (e.g., the kissing number problem), was able to rediscover the best-known solutions in
75%
of cases and surpass previous solutions in20%
of cases, resulting in verifiable new discoveries. (source) - The system used Gemini-powered approaches to optimize matrix multiplication kernels, which accelerated this key operation by
23%
and led to a measurable1%
decrease in the training time for Gemini models. This efficiency gain translates into reduced computational expense, and AlphaEvolveâs automation reduces kernel optimization cycles from weeks of manual tuning to days of automated runs, thus speeding research cycles. - AlphaEvolveâs optimization strategies are directly applied to core infrastructure, including Googleâs data centers and AI chip design, as well as the very model architectures (such as those powering AlphaEvolve) that it is intended to improve, making for a self-optimizing feedback loop within the AI development stack.
- AlphaEvolve, when tested on over 50 open problems in mathematical domains including analysis, geometry, combinatorics, and number theory (e.g., the kissing number problem), was able to rediscover the best-known solutions in
- Meet AlphaEvolve, the Google AI that writes its own codeâand just saved millions in computing costs (Score: 455, Comments: 52): Google AI has introduced AlphaEvolve, an AI system ostensibly capable of autonomously generating novel computer algorithms, with claims of significant cost savings (millions) in computing expenses. The announcement further asserts AlphaEvolve is able to make new discoveries within computing and mathematics, echoing the ambition of prior breakthroughs like AlphaFold and AlphaGo. Technical readers flag the need for concrete evidence supporting these claims, such as reproducible benchmarks, accessible datasets, or details about algorithmic innovation. Commenters express skepticism about the magnitude of the claims, specifically demanding evidence for AlphaEvolveâs ability to invent entirely new algorithms. Others note historical precedentâAlphaFold and AlphaGo also achieved unconventional results by extensive search and self-play, but substantive empirical results validated their significance.
- AlphaEvolve reportedly combines Gemini Flash and Gemini Pro models as its core framework, allowing modular upgrades as new SOTA models become available. This design enables adaptability and suggests sustained improvements in efficiency and capabilities as underlying models advance.
- The impact of AlphaEvolve on complex Google infrastructure (chip design, networking, DC deployment, cloud compute) is highlighted: AI-driven optimization here could lead to significant efficiency gains across the companyâs extensive tech stack, potentially outpacing competitors who lack similar vertical integration.
- Commenters debate the legitimacy of Googleâs claims, noting that Google/DeepMind historically shares less-inflated performance metrics compared to some competitors. There is demand for concrete benchmarks or evidence, especially regarding claims about algorithmic invention and mathematical discoveries.
- DeepMind unveils âspectacularâ general-purpose science AI (Score: 246, Comments: 29): DeepMindâs newly announced AlphaEvolve system integrates large language models (LLMs) with automated algorithmic evaluators to autonomously evolve novel, high-performance algorithms across scientific domains. In benchmarking, AlphaEvolve discovered matrix multiplication routines surpassing the long-standing Strassenâs algorithm, as well as improved designs for tensor processing units (TPUs) and cloud resource allocation. The architecture uniquely combines LLM-driven proposal generation with evolutionary algorithm selection, enabling domain-general problem-solving capabilities (for details, see the Nature announcement). Commenters highlight AlphaEvolveâs ability to outperform classic algorithms (e.g., Strassenâs for matrix multiplication) and speculate that such advances demonstrate DeepMindâs leadership toward artificial general intelligence (AGI), linking this progress to DeepMindâs recent âafter AGIâ specialist hires.
- DeepMindâs new AI, AlphaEvolve, develops matrix multiplication algorithms that can surpass the speed of Strassenâs algorithm, a method that has remained the fastest-known since 1969. This demonstrates the AIâs ability to discover novel optimization strategies in computational mathematics, a key benchmarking domain in computer science.
- A critical observation is that AlphaEvolveâs capabilities are inherently tied to domains with scalable and cheap validation, such as program analysis where computational correctness and speed are straightforward to measure. For fields like astronomy, particle physics, medicine, or businessâwhere validation is costly or limitedâthe impact of these AI-driven discoveries remains minimal; the limiting factor shifts from idea generation to experimental validation.
- While AI-driven improvements in computational tasks like matrix multiplication suggest a compounding effect (flywheel effect) across technically tractable domains, the broader implication is a proof-of-concept for AI systems that could theoretically drive rapid and recursive innovation if applied across interconnected foundational technologies. This is relevant to discussions about the feasibility of a technological singularity driven by general-purpose science AIs.
2. Anthropic Claude Sonnet/Opus Model Release Anticipation and OpenAI Model Rollout
- Looks like we can expect an Anthropic release in the coming weeks (Score: 218, Comments: 61): The image depicts a formal presentation, likely by an Anthropic representative, highlighting the upcoming release of the Claude Sonnet and Claude Opus models. These models are noted for their distinctive reasoning abilities, suggesting significant advancements in Anthropicâs AI architecture. Both the post and comments indicate that credible sources (specifically The Information) have reported on imminent releases, putting Anthropic in direct competition with OpenAI and Google for major model announcements in the coming weeks. One commenter emphasizes the historical accuracy of The Informationâs reporting on model releases, lending credibility to the news. Another commenter jokes about server capacity issues with previous Anthropic models, hinting at scalability concerns that the community hopes will be addressed in the new releases.
- Discussion references The Informationâs credibility in accurately predicting major AI model release timelines, highlighting their prior success in preempting industry news about upcoming models from companies like Anthropic and OpenAI.
- There is anticipation around OpenAI attempting to upstage Google in the near term, referencing previous years where major announcements were closely timed, underscoring the ongoing competitive push among top companies (OpenAI, Google, Anthropic) for model releases and attention.
- A user expresses particular interest in an improved version with greater server capacity, alluding to past server-side bottlenecks with releases like Sonnet 3.7, which have affected model accessibility and user experience.
- Damn ok now this will be interesting (Score: 193, Comments: 41): The image is a tweet highlighting new models from AnthropicâClaude Sonnet and Claude Opusâthat can dynamically switch modes between reasoning, tool/database usage, and self-correction. They reportedly possess enhanced code generation capabilities that allow them to test and fix their own outputs. The announcement signals coming releases expected within weeks. A main technical discussion in the comments is concerns over prompt length potentially harming model performance, given more complex mode switching could require much larger system prompts. One user shares anecdotal evidence of dynamic code editing and rapid artifact previewing with what may have been early access, calling it surprisingly powerful.
- A commenter raises concerns about system prompt length and token usage, noting that the introduction of new Anthropic models might lead to significantly larger system prompts (â8000 more tokensâ), which could impact model performance or context retention. Thereâs a hope expressed that these models maintain their capabilities even with increased prompt size.
- Another user details their experience with a new code-assist feature, observing artifact previews and graphical glitches occurring during iterative UI changes before the final result is committed. The granular update and commit cycle, including artifact previews, is described as feeling powerful, suggesting a technically advanced or novel implementation that improves user feedback during development.
- There are technical remarks on token consumption, with one user emphasizing that new features are likely to increase the token usage significantly, potentially impacting operating costs and efficiency (âtoken costs gone go wild,â âcline is already eating tokensâ).
- 4.1 model now appearing in web browser under âMore modelsâ (Score: 109, Comments: 64): The image documents the rollout of new model variants in the OpenAI ChatGPT web UI, specifically under the âMore modelsâ menu. It confirms the presence of âGPT-4.1â and its mini variant (âGPT-4.1-miniâ), alongside other models like âGPT-4oâ and âo4-miniâ; notably, âGPT-4.1â is explicitly labeled as optimal for âquick coding and analysis.â The image provides evidence of active deployment and new model differentiation for users, indicating backend updates and evolving model lineup in OpenAIâs product. Commenters note that â4.1-miniâ appears to be replacing â4o-mini,â and one user highlights that the â4.1-miniâ is already accessible and reportedly performs well for coding tasks.
- Several users note that â4.1-miniâ appears to be replacing the â4o-miniâ model in the web interface, suggesting an update or shift in available lightweight models. This impacts users seeking the fastest, most cost-effective options for everyday or embedded use cases.
- Specific feedback highlights that the 4.1 model excels at coding tasksâone user reports successful integration with the Roo platform, indicating immediate developer interest and swift experimentation with new model capabilities.
- Device and app differences are mentioned: Android users may need to update their app to access both 4.1 and 4.1-mini, while some web users only see 4.1 mini so far, suggesting phased rollout or platform-dependent availability.
3. ChatGPT as New Internet Interface and Its Societal Impact
- Last year ChatGPT was the 15th most visited site. Now itâs #5, while every other top-10 site is losing traffic (Wikipedia fell 6%). People arenât surfing the web anymoreâtheyâre heading straight to ChatGPT. Itâs not just a tool; itâs become the new internet interface, quietly replacing the old web. (Score: 278, Comments: 117): The provided image displays a table ranking the most visited websites, highlighting ChatGPT.com climbing to the #5 position globally in traffic and showing a 13.04% month-over-month increase, in contrast to declines for traditional sites like Wikipedia (-6%). The data indicate a marked behavioral shift where users increasingly bypass conventional search engines or content aggregators, using conversational AI as their primary interface to online information. This underscores ChatGPTâs rapid emergence not just as a tool but as a gateway supplanting traditional web navigation. A technically relevant debate in the comments notes the growing user preference for AI assistants over traditional search, motivated by poor web experiences (e.g., intrusive ads, SEO manipulation) and friction in content discovery. Concerns are also raised about the future monetization of ChatGPT (e.g., advertising).
- Several commenters point to the technical decline of traditional search engines, such as Google, attributing the shift to ChatGPT to increased web clutter from aggressive SEO tactics and ad overlays that degrade the actual information retrieval experience. This trend has resulted in users preferring ChatGPT for direct answers, as it bypasses pop-ups and extended irrelevant narratives that plague standard recipes or information sites.
- Discussion highlights the risk of future monetization changes impacting ChatGPT, such as introducing more ads or reduced usability as traffic increases. This could mirror the historical degradation seen in other web platforms once they prioritized advertising revenue streams over user experience.
- Something Strange Is Happening To The Internet (Score: 1945, Comments: 417): The post discusses a significant shift in global web traffic, as shown by a table (see image) ranking top websites by traffic. Google.com leads but is declining (
3.18%
MoM), while ChatGPT has surged to #5 with a+13.04%
MoM growth, outpacing Reddit, Amazon, and Whatsapp, and is now the only domain in the top 10 with positive growth. This suggests users increasingly rely on ChatGPT as a primary âinterfaceâ, potentially bypassing traditional search engines, blogs, or forums, signaling not just growth but a possible paradigm shift in how people access information online. Commenters are skeptical about the novelty, likening the trend to previous surges by Facebook and Google, and joking that AI likely authored the post, hinting at the growing omnipresence and debate over AIâs role in reshaping internet consumption, rather than seeing it as truly unprecedented.- Multiple commenters express concerns about the proliferation of content generated by large language models (LLMs) like ChatGPT, observing that distinctive patterns in posts and comments (e.g., formulaic phrasings such as âit isnât X, itâs Yâ) are strong indicators of AI-generated text and contribute to a perceived decline in authenticity and quality online.
- One participant argues that if LLMs and generative AI contribute to reducing internet traffic and the prevalence of click-driven content economies, it could potentially improve the internetâs utilityâmoving from outrage and clickbait to a model that favors functional, purposeful interactions reminiscent of earlier internet eras.
- The discussion draws parallels between current shifts (including the rise of AI content) and previous changes in major platforms like Facebook, Google, and Twitter, suggesting a pattern where technological or algorithmic shifts fundamentally reshape traffic, engagement, and how online communities form and persist. The Information) have reported on imminent releases, putting Anthropic in direct competition with OpenAI and Google for major model announcements in the coming weeks. One commenter emphasizes the historical accuracy of The Informationâs reporting on model releases, lending credibility to the news. Another commenter jokes about server capacity issues with previous Anthropic models, hinting at scalability concerns that the community hopes will be addressed in the new releases.
- Discussion references The Informationâs credibility in accurately predicting major AI model release timelines, highlighting their prior success in preempting industry news about upcoming models from companies like Anthropic and OpenAI.
- There is anticipation around OpenAI attempting to upstage Google in the near term, referencing previous years where major announcements were closely timed, underscoring the ongoing competitive push among top companies (OpenAI, Google, Anthropic) for model releases and attention.
- A user expresses particular interest in an improved version with greater server capacity, alluding to past server-side bottlenecks with releases like Sonnet 3.7, which have affected model accessibility and user experience.
- DeepMind introduces AlphaEvolve: a Gemini-powered coding agent for algorithm discovery (Score: 1103, Comments: 274): DeepMind announced AlphaEvolve, an automated coding agent leveraging Gemini LLM ensembles (Flash for exploration, Pro for depth) for novel algorithm discovery and optimization (DeepMind blog). AlphaEvolve iteratively generates and tests code solutions, yielding performance gainsâe.g., a
-
Damn ok now this will be interesting (Score: 193, Comments: 41): The image is a tweet highlighting new models from AnthropicâClaude Sonnet and Claude Opusâthat can dynamically switch modes between reasoning, tool/database usage, and self-correction. They reportedly possess enhanced code generation capabilities that allow them to test and fix their own outputs. The announcement signals coming releases expected within weeks. A main technical discussion in the comments is concerns over prompt length potentially harming model performance, given more complex mode switching could require much larger system prompts. One user shares anecdotal evidence of dynamic code editing and rapid artifact previewing with what may have been early access, calling it surprisingly powerful.
- A commenter raises concerns about system prompt length and token usage, noting that the introduction of new Anthropic models might lead to significantly larger system prompts (â8000 more tokensâ), which could impact model performance or context retention. Thereâs a hope expressed that these models maintain their capabilities even with increased prompt size.
- Another user details their experience with a new code-assist feature, observing artifact previews and graphical glitches occurring during iterative UI changes before the final result is committed. The granular update and commit cycle, including artifact previews, is described as feeling powerful, suggesting a technically advanced or novel implementation that improves user feedback during development.
- There are technical remarks on token consumption, with one user emphasizing that new features are likely to increase the token usage significantly, potentially impacting operating costs and efficiency (âtoken costs gone go wild,â âcline is already eating tokensâ).
-
4.1 model now appearing in web browser under âMore modelsâ (Score: 109, Comments: 64): The image documents the rollout of new model variants in the OpenAI ChatGPT web UI, specifically under the âMore modelsâ menu. It confirms the presence of âGPT-4.1â and its mini variant (âGPT-4.1-miniâ), alongside other models like âGPT-4oâ and âo4-miniâ; notably, âGPT-4.1â is explicitly labeled as optimal for âquick coding and analysis.â The image provides evidence of active deployment and new model differentiation for users, indicating backend updates and evolving model lineup in OpenAIâs product. Commenters note that â4.1-miniâ appears to be replacing â4o-mini,â and one user highlights that the â4.1-miniâ is already accessible and reportedly performs well for coding tasks.
- Several users note that â4.1-miniâ appears to be replacing the â4o-miniâ model in the web interface, suggesting an update or shift in available lightweight models. This impacts users seeking the fastest, most cost-effective options for everyday or embedded use cases.
- Specific feedback highlights that the 4.1 model excels at coding tasksâone user reports successful integration with the Roo platform, indicating immediate developer interest and swift experimentation with new model capabilities.
- Device and app differences are mentioned: Android users may need to update their app to access both 4.1 and 4.1-mini, while some web users only see 4.1 mini so far, suggesting phased rollout or platform-dependent availability.
3. ChatGPT as New Internet Interface and Its Societal Impact
- Last year ChatGPT was the 15th most visited site. Now itâs #5, while every other top-10 site is losing traffic (Wikipedia fell 6%). People arenât surfing the web anymoreâtheyâre heading straight to ChatGPT. Itâs not just a tool; itâs become the new internet interface, quietly replacing the old web. (Score: 278, Comments: 117): The provided image displays a table ranking the most visited websites, highlighting ChatGPT.com climbing to the #5 position globally in traffic and showing a 13.04% month-over-month increase, in contrast to declines for traditional sites like Wikipedia (-6%). The data indicate a marked behavioral shift where users increasingly bypass conventional search engines or content aggregators, using conversational AI as their primary interface to online information. This underscores ChatGPTâs rapid emergence not just as a tool but as a gateway supplanting traditional web navigation. A technically relevant debate in the comments notes the growing user preference for AI assistants over traditional search, motivated by poor web experiences (e.g., intrusive ads, SEO manipulation) and friction in content discovery. Concerns are also raised about the future monetization of ChatGPT (e.g., advertising).
- Several commenters point to the technical decline of traditional search engines, such as Google, attributing the shift to ChatGPT to increased web clutter from aggressive SEO tactics and ad overlays that degrade the actual information retrieval experience. This trend has resulted in users preferring ChatGPT for direct answers, as it bypasses pop-ups and extended irrelevant narratives that plague standard recipes or information sites.
- Discussion highlights the risk of future monetization changes impacting ChatGPT, such as introducing more ads or reduced usability as traffic increases. This could mirror the historical degradation seen in other web platforms once they prioritized advertising revenue streams over user experience.
- Something Strange Is Happening To The Internet (Score: 1945, Comments: 417): The post discusses a significant shift in global web traffic, as shown by a table (see image) ranking top websites by traffic. Google.com leads but is declining (
3.18%
MoM), while ChatGPT has surged to #5 with a+13.04%
MoM growth, outpacing Reddit, Amazon, and Whatsapp, and is now the only domain in the top 10 with positive growth. This suggests users increasingly rely on ChatGPT as a primary âinterfaceâ, potentially bypassing traditional search engines, blogs, or forums, signaling not just growth but a possible paradigm shift in how people access information online. Commenters are skeptical about the novelty, likening the trend to previous surges by Facebook and Google, and joking that AI likely authored the post, hinting at the growing omnipresence and debate over AIâs role in reshaping internet consumption, rather than seeing it as truly unprecedented.- Multiple commenters express concerns about the proliferation of content generated by large language models (LLMs) like ChatGPT, observing that distinctive patterns in posts and comments (e.g., formulaic phrasings such as âit isnât X, itâs Yâ) are strong indicators of AI-generated text and contribute to a perceived decline in authenticity and quality online.
- One participant argues that if LLMs and generative AI contribute to reducing internet traffic and the prevalence of click-driven content economies, it could potentially improve the internetâs utilityâmoving from outrage and clickbait to a model that favors functional, purposeful interactions reminiscent of earlier internet eras.
- The discussion draws parallels between current shifts (including the rise of AI content) and previous changes in major platforms like Facebook, Google, and Twitter, suggesting a pattern where technological or algorithmic shifts fundamentally reshape traffic, engagement, and how online communities form and persist.
AI Discord Recap
A summary of Summaries of Summaries by gpt-4.1-2025-04-14
1. Model Benchmark Showdowns and Coding Performance
- Sonar Models Sweep Benchmarks, GPT-4.1 Steals Coding Crown: Sonar Pro Low crushed Claude 3.5 Sonnet on BrowseComp with 4.0% accuracy (nearly 50% higher) and boasted up to 3x faster, more consistent latency, while Qwen 3 8B and GPT-4.1 received widespread praise for coding and reasoning tasks (Perplexity AI, Unsloth AI).
- Community consensus across multiple Discords is that GPT-4.1 is the new coding king, with users trashing O3 for code but lauding it for planning/research, and Qwen 3 models outperforming Gemma 3 at similar sizes, especially after fine-tuning.
- Gemini 2.5 Pro and O4 Mini High Ignite Coding Rivalry: Gemini 2.5 Pro wowed users with C++ coding prowess, described as a dream come true, while O4 Mini High was called a coding beast for its fast, high-quality completions across large codebases (LMArena, OpenAI).
- Despite some hallucination complaints, users consider Gemini 2.5 Pro and GPT-4.1 as top-tier for coding, with Claude 4 + O3 Pro anticipated as an insane combo once released.
2. Distributed and Decentralized Training/Inference
- Psyche Network Powers Decentralized LLM Training: Nous Research launched the Psyche Network, a decentralized training platform coordinating global GPUs via custom peer-to-peer networking and DisTrO optimizers, aiming to pretrain a 40B parameter LLM on a dataset mixing FineWeb (14T), FineWeb-2 (4T), and The Stack v2 (1T).
- The testnet quickly filled 500k slots in 40 minutes, and users can contribute USDC for compute, with open forums driving model design and a GitHub repo available for community contributions.
- Lost in Conversation: LLMs Tank on Multi-Turn Tasks: The Lost in Conversation paper found LLMs suffer a 39% performance drop in multi-turn conversations versus single-turn, with unreliability stemming from premature solution attempts and poor error recovery (GitHub repo).
- This exposes a major weakness for distributed agentic systems and highlights the need for improved error correction and conversational memory mechanisms.
3. Hardware and Performance Optimizations
- PCIE 5.0, CUDA Strides, and PyTorch Nightly Perks: Upgrading to PCIE 5.0 boosted token generation speed from 26 tkps to 38 tkps on a 50-series GPU, while PyTorch devs recommend
needs_exact_strides
overneeds_fixed_stride_order
for more reliable tensor ops in nightly builds (LM Studio, GPU MODE). - CUTLASS 4.0, CuTe DSL, and Kernel Kung Fu: CUTLASS 4.0 and CuTe DSL are out (
pip install nvidia-cutlass-dsl
), with Jupyter notebook examples and Python 3.12 support, though the release versioning appears âborkedâ and MLIR compiler is not yet open source (CUTLASS Notebooks).- Custom CuTe kernels outperformed PyTorch by 60x on large problems, and new kernel debugging tips (Nsight Compute, nsys, ncu) were shared, while reference kernel PRs improved leaderboard runtimes (PR #31).
4. Prompt Engineering, Tokenization, and Memory Mishaps
- Tokenization Woes: Gemma, BOS Tokens, and PromptTemplates: GemmaTokenizer in Torchtune was caught mismatching output tokens with HFModelTokenizer due to missing PromptTemplates and multiple BOS tokens from config errors, tracing the blame back to HF/Googleâs tokenizer config (Torchtune).
- Discussions stressed the need for template and config alignment, and that even technically âcorrectâ implementations can be functionally flawed for real-world LLM usage.
- LlamaIndex Memory Gets a DB Makeover: LlamaIndex launched a Memory component for agentic workflows, supporting in-memory and scalable DB backends (SQLite, PostgreSQL), with debate over context serialization vs. DB for long chat histories.
- For large-scale or structured history, a DB is recommended over plain serialization, with users comparing the tradeoffs for persistent context in LLM-powered agents.
Discord: High level Discord summaries
Perplexity AI Discord
- Sonar Models Demolish Benchmarks: Sonar Pro Low outperformed Claude 3.5 Sonnet on BrowseComp, with Sonar Low also outperforming Claude 3.7 Sonnet, according to recent benchmark evaluations.
- Sonar Pro Low achieved 4.0% accuracy on BrowseComp, almost 50% higher than Claude 3.5 sonnet, and both Sonar and Sonar Pro delivered up to 3x faster response times with more consistent latency.
- Deep Research Launch Timing Speculated: Members speculated whether Perplexity AIâs new deep research feature will be released before or after the Comet project, one stating that deep research should be first.
- Another member hoped that Comet would be more than just a browser with a copilot style side bar, referencing a thinking batman gif.
- Merlin AIâs Pricing Displeases Users: Users discussed Merlin AIâs pricing, with one noting its shady pricing due to unclear rate limits and another sharing that support gave a shitty response when asked, while praising its web search quality.
- It was also noted that for standard paid accounts on Merlin AI, any usage that surpasses $16 per day will also lead to the immediate termination of service for that day.
- Perplexity Pro Unlocks API Access: Perplexity Pro actually includes $5/month in API credits!
- You can find the API docs and get started at Perplexity AI docs; registering via a credit card is required, but users will not be charged if they keep their usage at or below $5.
- Perplexity Projects Sparks Vibing Concerns: Speculation arose around Perplexityâs new project feature, with users sharing screenshots and videos, like this one, and some expressing concerns about its potential misuse for vibe coding.
- There were also questions about how certain users gained early access to the feature, sparking discussion on testing catalogs and potential connections within the AI community.
LM Studio Discord
- Embedding Modules Hit File Size Limit: Users are encountering a âMaximum file size per conversation reachedâ error with embedding modules in LM Studio, capped at 31.46 MB when using the default nomic-ai/nomic-embed-text-v1.5-GGUF model.
- A user seeks to process files of a few hundred MB for audio-to-lyrics generation, suggesting a need for larger embedding module sizes.
- LM Studioâs Log Spam is Termed Benign: A reported log spam issue in LM Studio is considered benign and has been addressed in the 0.3.16 build 2.
- This fix should alleviate concerns about potential issues when running embedding models.
- LM Studio JIT Loading Gets Glitchy: LM Studioâs JIT model loading via the Continue VS Code plugin sometimes serves clients with the wrong model if another is already loaded.
- Users were advised to check model identifiers using
lms ls
and remove 8bit from the configuration to resolve mismatches.
- Users were advised to check model identifiers using
- Devs Advised Against Building LLMs From Scratch: Members debated building LLMs from scratch, with most advising against it due to massive compute costs and data needs, recommending fine-tuning instead, and sharing the âBuild a Large Language Model from Scratchâ Manning book for theoretical knowledge.
- A member expressed interest in building a model for various use cases.
- PCIE 5.0 Gives Performance Jump: A user reports a performance increase after upgrading to a 50 series card, going from 26 tkps to 38 tkps using qwen3-14b-q4km with a 4096 context, attributing it to PCIE 5.0 benefits.
- The user highlights the advantage of installing the card in the bottom slot without interference due to the shorter PCIE connector design.
LMArena Discord
- Gemma âCutiepieâ Incoming?: Members speculated about a new Gemma model, tentatively named Cutiepie, possibly timed for Google I/O.
- Some members are trying to determine whether Gemma is free without requiring a login.
- DeepSeek R2 still under wraps: Speculation around DeepSeek R2 has cooled, but it remains in development using Huawei GPUs and Chinese funding.
- Thereâs internal pressure to release something impressive, especially given the current modelsâ strength in reasoning traces and multilingual chain of thought.
- Gemini 2.5 Pro Impresses with C++: Members laud Gemini 2.5 Proâs coding prowess, especially with C++, with some describing it as a dream come true.
- Despite the enthusiasm, some users report models are suffering from hallucinations, while others consider it superior to o3.
- O3 Pro stuck in development?: The release of o3 Pro is delayed, and some suggest OpenAI is strategically waiting for other labs to reveal their cards to take the lead.
- It is internally believed that OpenAI already has o4 but is holding it back for strategic reasons, and there is an expectation of an insane combo with Claude 4 + o3 pro.
- GPT 4.1 hailed as coding genius: Members highlighted GPT 4.1âs coding capabilities, noting its quick compilation and error-fixing on large codebases, alongside instant response times.
- One member found GPT 4.1 a significant upgrade from GPT 4o mini for free users, but felt that 4.1 is solely for coding.
Unsloth AI (Daniel Han) Discord
- File Usage Fixes are Frustrating: Users wrestled with an
AttributeError
when using file:// for image loading, initially suspecting an issue with how the image variable was being set to None.- Despite path corrections, the error persisted, prompting consideration of URLs as an alternative, with one user exclaiming image finetuning gives me a headache.
- Qwen3 Inference Speed Slowdown: A user reported slow inference speeds with Qwen/Qwen3-0.6B on an A100, achieving only 30 tokens/second with a batch size of 1, using Unslothâs
FastLanguageModel.from_pretrained()
.- Others suggested that the small model size and batch size might negate the benefits of
load_in_8bit=True
, while another user noted the base model does not contain a chat template in thetokenizer_config.json
.
- Others suggested that the small model size and batch size might negate the benefits of
- O3 Outperforms, Except on Coding: Members trashed O3 for outputting garbage code, suggesting itâs better for planning and research with tool calls while GPT 4.1 is the best coding model, available in Github Copilot for those with the edu account.
- It was suggested that O4 mini high is better for coding, with one member stating O3 is fucking garbage for coding.
- Qwen3: Benchmarks Boss Gemma: It was observed that the Qwen 3 models outperform Gemma 3 at similar sizes, with the Qwen 3 8b model achieving near SOTA with fine tuning.
- Itâs considered a perfect sweet spot for local llms, one member declared that the Qwen 3 8b model is CRAZY good.
- Repo2txt vs RAG: A member advocated for repo2txt.com as a superior alternative to RAG for code injection, selecting files from a repo and injecting it directly into the prompt.
- They argued that models canât read all code automatically in github and make mistakes.
OpenAI Discord
- GPT-4.1 Debuts and Impresses in Coding: GPT-4.1, a model specialized for coding and instruction following, is now available in ChatGPT for Plus, Pro, and Team users, and will be available for Enterprise & Edu users in the coming weeks.
- GPT-4.1 mini is replacing GPT-4o mini in ChatGPT for all users, with safety evaluation details available in the Safety Evaluations Hub.
- O3 Unlimited Tempts Users: Users rave about the unlimited O3 model, calling it a legendary model for solving issues, with new support for deep research via GitHub Repos, OneDrive, and SharePoint, and 20 file uploads per Chat/Project.
- Despite the Teams plan offering desirable internal knowledge features, members suggest sticking with the Pro plan because of the benefits of unlimited O3.
- PII Data Guardrails Provoke Probes: Members reported challenges with PII guardrails blocking home address requests from HR data-connected apps, especially those with access to HR data.
- They suggested that users should contact OpenAI support for guidance on handling sensitive data requests and adhering to PII policies.
- AI Universe Simulator Takes Shape: A member is building a 1:1 simulation of the universe using AI to explore thinking and create automated ecosystems, aiming to scale it into a web browser.
- The focus is on open communication lines between models, prioritizing efficiency over stacking models.
- Ollama Gains Traction Over Windsurf: Users discussed local model inference with Ollama as an alternative to services like Windsurf, cautioning against paying for such services for AI app development.
- The recommendation is to focus on learning prompting to avoid API costs and use Ollama with VS Code extensions like Continue and Roo Code, while also pursuing LLM and Agentic courses on Hugging Face Learn.
Cursor Community Discord
- GPU Power Peer-to-Peer Considered: Members pondered open-source options for utilizing GPU power in peer-to-peer or decentralized systems.
- The discussion yielded no definitive solutions.
- Cursorâs 20% Markup Debated: Users debated the justification of Cursorâs 20% surcharge over actual API pricing.
- Some considered it a good deal whereas another claimed to save $600 per month by using Claude Max outside of Cursor.
- Gemini 2.5 Pro Joins the Model Lineup: The community noticed the addition of Gemini 2.5 Pro to Cursorâs available models on May 6th.
- Some noted that the new model finally fixed the âi will code now (stops coding)â issue.
- Mono-Repo Methodology Manuevers: Users discussed managing multi-repo projects, with one suggesting consolidating into a single monorepo to avoid context fragmentation.
- Another user mentioned using separate workspaces within the same parent folder in Cursor 0.50.x.
- Background Agent Begs for Confirmation: Users voiced frustration that the background agent requires excessive confirmation, increasing fast request usage.
- One user complained the background agent just keeps asking for confirmation to do things and its just eating fast request up faster.
Yannick Kilcher Discord
- AI Regulation Faces Decade-Long Freeze: House Republicans inserted language into the Budget Reconciliation bill that would block all state and local governments from regulating AI for 10 years, potentially impacting privacy regulations (source).
- The provision, introduced by Representative Brett Guthrie of Kentucky, broadly prohibits state or local enforcement of laws regulating AI models or systems.
- AlphaEvolve Rediscovered State-of-the-Art: DeepMindâs AlphaEvolve, a Gemini-powered coding agent, combines LLM creativity with automated evaluators to evolve algorithms for math and practical applications (DeepMind Blog).
- The system rediscovered state-of-the-art solutions in roughly 75% of cases and improved the previously best known solutions in 20% of cases, even advancing the kissing number problem.
- RL-Diffusion Sparks Patentability Skepticism: Members discussed developing RL-Diffusion, combining forward and backward processes into one controlled by RL, but some expressed skepticism about its novelty and practical implementation.
- They emphasized that transformation into a patent-eligible application requires more than simply stating the abstract idea while adding the words âapply it.â
- GSM8K Benchmark Nears Perfection: Language models are approaching near-perfect accuracy on grade-school level math benchmarks like GSM8K, with detailed analysis available in this paper and this blogpost.
- The paper seeks to determine whether language models truly develop reasoning skills or are simply memorizing templates.
- Debate Brews on LLM Planning Abilities: Members debated whether LLMs formulate plans before generating solutions, based on claims that models avoid unnecessary computations.
- One member countered that the model learns that unnecessary things are random noise which by definition have no signal for the model to learn, so it ignores them.
OpenRouter (Alex Atallah) Discord
- Personality Platform Invites Chatbot Customization: A new chatbot platform called Personality has launched, aiming to provide more customization and less filtering than existing solutions like c.ai, offering users the ability to create and roleplay with multiple characters at personality.gg.
- The platform also provides free image generation at personality.gg/playground, though itâs important to note that this feature is not powered by OpenRouter.
- OpenAI Reasoning Model Names Spark Confusion: Users are requesting naming consistency in OpenAIâs reasoning models (e.g.,
/openai/o4-mini-high
) to include reasoning level variants for all models, as documented in the OpenAI documentation.- The primary goal is to ease evaluation across different reasoning models and reduce confusion in which models can do what.
- Free Google Models throttled: Users are reporting extremely low or nonexistent rate limits with free Google models, even with available credits, prompting recommendations for alternatives like DeepSeek V3.
- Concerns have also surfaced about the potential removal of free routes for Gemini following a change shared on Twitter.
- Claudeâs System Prompt Causes OpenRouter Differences: Discrepancies in helpfulness between using Claude via OpenRouter versus the native Anthropic website are attributed to the extensive system prompts used by Anthropic, comprising around 16000 tokens.
- Users can manually implement the system prompt, available on GitHub, which includes tools.
- âAlways Use This Keyâ option defaults to specified key: A new option labeled âAlways use this keyâ was introduced, causing confusion because of a similar but different option labeled âUse this key as a fallbackâ.
- The new option exclusively uses the specified key and prevents fallback to OpenRouter, which represents a change from the behavior of the older fallback setting.
Manus.im Discord Discord
- Manus Credits Glitch and Time Zone Tick Tock: Several users reported their daily 300 credits on Manus not refreshing at 00:00 GMT, likely due to time zone processing inconsistencies.
- One user noted their credits refresh at 8:00pm in their timezone, highlighting the timing discrepancies.
- Invitation Code Frenzy and origins: A user boasted a glitched account with 100 invitation codes and shared multiple invitation links, triggering discussion about their origin.
- Speculation arose that the codes came from paid subscriptions, while others questioned their value to existing members since new users get 500 credits.
- Refund Woes for Failed Jobs: A user expressed frustration over failing to get a refund after a job failure consumed 800 credits on Manus.
- Other users stated that refunds are generally not provided, even when the service malfunctions, with one suggestion to dispute the charge.
- Facebook Marketplace Scams get Rapped: A user requested Manus to generate a rap about Facebook Marketplace lowballing, employing slang such as best price, last price, and mates rates.
- The user clarified that the request was not for advertising purposes but to rap about scenarios and experiences related to the online marketplace.
GPU MODE Discord
- needs_exact_strides trumps needs_fixed_stride_order!: In pytorch nightly,
at::Tag::needs_exact_strides
is superior toneeds_fixed_stride_order
due to the latterâs occasional inaccuracies.- One developer moved the
.contiguous
calls in the C++ code sotorch.compile
canât interfere.
- One developer moved the
- Matrix Magic with PTX Instructions: A member shared a blogpost detailing how to efficiently load and store matrices within a warp using PTX instructions, including the
ldmatrix
instruction.- They also linked the associated code, PTX documentation, and a LinkedIn post explaining the
stmatrix
instruction.
- They also linked the associated code, PTX documentation, and a LinkedIn post explaining the
- Reference Kernel Gets a Jolt of Lightning!: A pull request has been merged to mitigate the long reference time issue, aiming to improve run times when the main bot is updated.
- The update addresses concerns about the reference implementation taking too long, especially when faster implementations require numerous runs to meet termination criteria.
- CUTLASS 4.0 and CuTe DSL released, but Borked?!**: CUTLASS 4.0 and CuTe DSL are now released, accessible via
pip install nvidia-cutlass-dsl
, and NVIDIA recommends starting with the Jupyter notebooks.- Members noted the
nvidia-cutlass-dsl
is version0.0.0....
which was released like 2 months ago according to pypy, so something seems borked with the release.
- Members noted the
- Factorio Blueprints Drafted by Genetic Algorithm: A member is planning to create a genetic algorithm that generates Factorio blueprints based on specified requirements like building materials, input/output locations, and area constraints, and found a paper on genetic programming for dynamic path-finding.
- The algorithm aims to enable LLMs to provide constants for fulfilling these requirements, serving as a tool for dynamic factory design.
aider (Paul Gauthier) Discord
- Gemini 2.5 Pro Rewrites Huge Files: Users find that Gemini 2.5 Pro, used via OpenRouter with
--edit-format diff-fenced
, sometimes rewrites entire large files for minor changes, while others report that AI Studio provides faster results.- Some prefer Sonnet 3.7 for their workflows, using cheaper models for simple tasks and Sonnet3.7 for more complex architecture.
- Common Lisp Gets Modern AI Tooling: Users discussed leveraging existing models to enhance development in languages like Common Lisp, planning to utilize books and data sources to generate datasets and prompts for in-context learning, with one planning to use a Lisp DSL to build a compiler/interpreter.
- The approach involves LoRA-ing small models and employing semantic retrieval to integrate programming book knowledge into the context window.
- Gemini Models Adds Comments and Stupid Ideas: A user observed that Gemini was adding excessive comments and unwanted ideas directly into the code, which it then executed.
- The user suggested enforcing strict code changes without incorporating unsolicited suggestions.
- Aider Upgrade Still Troublesome: Users are still facing problems when upgrading Aider, where the version number fails to update even when the upgrade process seems successful.
- The SSL warning during the upgrade is likely unrelated and has been a persistent issue since January.
- Aider Gets an Aussie Chat Makeover: A user discovered a method to improve the readability of Aiderâs replies by modifying the
~/.aider.conf.yml
file.- They recommended using
chat-language: English (Australia, use headings, bullet points, concise sentence fragments)
to achieve a more concise output.
- They recommended using
Eleuther Discord
- LM-Eval-Harness Gets Datasets Quicker**: Users can now download datasets for specific tasks in lm-eval-harness with
python3 lm_eval --model dummy --tasks [task_list] --limit 1
without immediately evaluating a model.- The
dummy
model, which is defined here and returns random numbers, is used for testing purposes.
- The
- R1-distill Gets Prompted Formally**: Members debated on whether to prompt R1-distill models with a âuser: What is xyz? assistant:
â format versus doing âWhat is xyz?â .- The debate ended without resolution.
- LLMs Facing Regulation Heat: Members discussed regulations concerning bias standards in algorithms, particularly in the context of LLMs, mentioning regulatory agencies like the NCUA, EEOC, FDIC, HHS, and FTC.
- Regulators may view algorithms that canât be studied as infringing, based on the archived EEOC guidance.
- Skyworkâs Reasoning Abilities in Focus**: The Skywork model and its techniques were lauded after its release, with a link provided to the Skywork Open Reasoner Series.
- One member noted that Skywork normalizes by the total tokens in the training batch rather than per sequence.
- lm-evalâs Multi-GPU Woes**: A member reported uneven GPU utilization when using
parallelize=True
in lm-eval, with GPU 0 at 100% and GPU 1 at 0%.- Another member suggested using vllm tensor parallel as it is more reliable and suggested
accelerate launch -m lm_eval ...
for running multiple replicas, instead ofparallelize
which uses naive pipeline parallelism.
- Another member suggested using vllm tensor parallel as it is more reliable and suggested
Nous Research AI Discord
- Atropos Gets Axolotl Support: Nous Research released v0.2.0 of Atropos, featuring integration with Axolotl as an official trainer partner and a usage guide.
- The update includes new environments, updated API handling, and better TRL support.
- Psyche Network Democratizes Training: Nous Research launched the Psyche Network, a decentralized training network intended to democratize AI development by bringing together distributed compute resources for training large-scale models.
- The testnet launch involves pre-training a 40B parameter LLM using an MLA Architecture and a dataset comprising FineWeb (14T), FineWeb-2 (4T), and The Stack v2 (1T).
- DisTrO Optimizer Shatters Bandwidth Ceiling: The Psyche network utilizes Nousâs DisTrO optimizers and a custom peer-to-peer networking stack to coordinate globally distributed GPUs, overcoming previous bandwidth constraints in AI training.
- Members can contribute USDC for compute.
- Multi-Turn Conversations Trip Up LLMs: The Lost in Conversation paper and its corresponding GitHub repo, analyzes LLM performance in multi-turn conversations versus single-turn settings and it reveals a 39% average performance drop across six generation tasks in multi-turn scenarios.
- The paper attributes this to a minor loss in aptitude and a significant increase in unreliability, concluding that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
- Benchmarks Flounder, Frontier Models Fumble: A member finds it very hard to feel out which frontier model is better at different tasks because benchmarks not granular or diverse enough and shares a link.
- They state that the âbestâ coding model might still be terrible at front end, or xyz framework, data viz etc.
Notebook LM Discord
- NotebookLM Courts UX Feedback: NotebookLM team seeks user input on multilingual Audio Overviews to enhance user experience through participation in User Experience studies.
- Users are encouraged to provide feedback to improve the multilingual audio features within NotebookLM.
- Invisible Sun TTRPG Shines with NotebookLM: A user is learning the Invisible Sun TTRPG by Monte Cook Gaming, using NotebookLM and ChatGPT Projects for rules lookup and has cited the shareability factor as a reason to prefer NotebookLM.
- They are planning to test NotebookLMâs insights on a new book coming via Backerkit.
- Google User Jitters Over Potential NotebookLM Sunset: A user voiced concerns about NotebookLM being discontinued, drawing on Googleâs history of product shutdowns, specifically being discontinued at an inconvenient time.
- Others argued that NotebookLMâs unique value makes its sunset unlikely, suggesting a potential rebrand or marketing initiative instead.
- PDF Upload Restrictions Irk Users: Multiple users reported experiencing account restrictions preventing them from uploading PDFs to NotebookLM.
- The discussion provided no clear resolution to the problem.
- Podcast Length Hack Surfaces: A user inquired about extending podcast length, and another suggested padding the podcast with repeated links or documents on the same topic to reach a 22-minute duration.
- It remains unconfirmed whether this strategy is universally effective.
Latent Space Discord
- OAI Launch Stories Shared: A member shared very wholesome stories of OpenAI launches from andrewmayne.com.
- The author reminisced about the early days and scaling challenges.
- ChatGPT Scaling Deconstructed: The community shared a link to a newsletter article titled Building, launching, and scaling ChatGPT by the Pragmatic Engineer.
- The article goes over the history and tech stack of the ChatGPT launch.
- AI Founder in Residence Seeks Role: A member asked about Founder in Residence programs focused on AI, seeking advice on how to position themselves, as they have experience building AI systems for Analytics use cases in Amazon ads and want to build Self-Serve Agents in the same analytics space.
- No additional details were given.
- Gemini Powers Algorithm Design with AlphaEvolve: Google DeepMind introduced AlphaEvolve, a coding agent powered by Gemini designed for creating advanced algorithms.
- This could be a pivotal moment in algorithm design as it shows how coding agents can be harnessed.
- Prof Tom Yeh to walk through the Evolution of Llama 1/2/3/4: Prof Tom Yeh will walk through the Evolution of Llama 1/2/3/4 in one session at a special event.
- The event is organized by a member of the community.
HuggingFace Discord
- Qwen Model Distillation Remains Elusive: A member sought resources for distilling the Qwen family of models but no specific notebooks or references were shared.
- Community members may have leads on this topic, so further exploration may be useful.
- Perceptron Visualizers Captivate Community: A member shared a perceptron visualizer for educational purposes, showcasing its functionality in attached videos My_Video_0079.mp4 and My_Video_0080.mp4.
- Another member contributed to the visualization collection from darkspark.dev.
- Stable Diffusion Spins Locally!: Community members explored running Stable Diffusion locally using combinations of Diffusers and TGI, or with WebUI Forge (GitHub link) or reForge (GitHub link).
- Links to Diffusers documentation (huggingface.co, huggingface.co/learn) were also helpful for setup.
- PDF Format Ranks Low in Popularity Contest: Users voiced strong criticism of the PDF format, with one describing it as the worst format ever seen.
- A user proposed adding a markdown output option to improve semantic relationships for RAG ingestion, but others raised concerns about full categorization issues, particularly with tables.
- Smolagents Framework Falls Flat: A user reported using the smolagents framework with Qwen and tools used in the course, citing terrible results.
- This may reflect a need for further refinement of the framework, or alternative frameworks for similar tasks.
MCP (Glama) Discord
- Authpython APIs Lag Typescriptâs: The community noted that Authpython generally lags behind Typescript by about 1-2 months in API updates, but a link to a Go-MCP client was shared for reference.
- This could impact development timelines and integration efforts for projects relying on the latest features.
- Debug Smithery MCP Servers with ithena-cli: Members suggested debugging an MCP server running on Smithery using the ithena-cli tool.
- The tool stores all input and output for debugging purposes, providing a detailed log of interactions to help identify issues in Claude Desktop.
- Tiny Agents Embrace Remote MCP Support: Hugging Face Tiny Agents now features remote MCP Support, enabling connections to both SSE and Streaming HTTP servers from the command line.
- This enhancement provides a versatile approach to agent development and management, extending the capabilities of Tiny Agents in networked environments.
- Chatter: A Web-Hosted, MCP-Enabled Client Emerges: A new LLM-provider-agnostic, MCP enabled, web-hosted chat client named chatter has been open sourced and hosted at moopoint.io.
- Aimed as a web alternative to Claude Desktop, the client promises a free tier, memory, MCP server hosting, image handling, file uploads, and voice interaction soon.
- Yarr MCP Servers make landfall: New community implementations of ARRs MCP servers have landed on GitHub.
- This was mentioned on X (formerly Twitter) by a community member
Torchtune Discord
- Torchtune Models Hitch a Ride with vLLM: A member reported running a custom Torchtune model with vLLM in their internal version of GRPO.
- They are considering making their implementation public, after a user inquired about enabling vLLM support for their model.
- vLLM Integration Gets Synchronized: A member suggested creating a synchronous GRPO recipe with vLLM, in addition to an asynchronous version.
- They stated a strong preference for the vLLM version, saying they genuinely donât see any reason not to.
- Gemma Tokenizer Strays from HFModelTokenizer: A member found that the HFModelTokenizer with the Gemma chat template produces output tokens that do not match the torchtune GemmaTokenizer tokens.
- This indicates that torchtuneâs GemmaTokenizer may not be correctly applying the chat template.
- Gemma PromptTemplate Goes MIA: It was noted that a specific PromptTemplate for Gemma is missing, which leads to incorrect tokenization and potential problems with the
system
role.- While the default might be to use the Alpaca template, a correct Gemma-specific template is crucial.
- BOS Tokens Error Injected from HF/Googleâs Config: The HF tokenizer adds multiple beginning-of-sequence (BOS) tokens because the configuration has
"add_bos_token": true
alongside a BOS token in the chat template.- This issue comes directly from HF/Googleâs tokenizer config, making the implementation technically âcorrectâ but functionally flawed.
Modular (Mojo đ„) Discord
- Variant SIMD Bug Triggers Segfaults in Mojo: A user discovered a crash with
Variant
when employingSIMD
types in Mojo, with a segfault occurring between print statements when aVariant[T](simd)
is used, and the issue appears to stem from insufficient space allocation withinVariant
or a lifetime issue.- A reproducible example was shared on GitHub issue 4578, showcasing the bugâs erratic behavior relative to print statement locations.
- Register Passable Types Face Scrutiny: Doubts have surfaced regarding the use of
register_passable
types exceeding system register sizes in Mojo, potentially leading to miscompilations due to LLVMâs limitations.- The current
Variant
implementation may be flawed for register passable typesT
wheresizeof[T]()
surpasses system register sizes, suggesting replacement with variousTrivial
versions.
- The current
- Mojo Integrates with Colab: It is now simpler to compile and execute Mojo code within a Colab notebook cell using
import max.support.notebook
, which introduces a%%mojo
magic command.- The announcement was detailed on the Modular forums.
tinygrad (George Hotz) Discord
- WebGPU Backend Catches a Bug: The WebGPU backend has a bug where the generated kernel doesnât have consecutive DEFINE_GLOBAL args, causing issues with
bufs_from_lin
, details here.- claude reportedly fixed it.
- BEAM Parameter Busts WebGPU Performance: Setting the BEAM parameter negatively impacts WebGPU backend performance: running at 30ms with no beam, but 150ms with BEAM=1.
- It runs at 100ms with BEAM=2.
- Tinybox UI goes Minimalist: A minimalist UI concept for tinybox, featuring no login, no cloud, no fluff, emphasizing fast, local hardware control was built and showcased here.
- An HTTP settings page for tinybox is generally supported, given it maintains 0 deps and absolute minimal line count.
- Blake3 Bountiful for Tensor Storage: There is a bounty for a high-performance blake3 implementation to use for content-addressable tensor storage in the cloud.
- The implementation should be general-purpose.
LlamaIndex Discord
- LlamaIndexâs Memory Component Boosts AI: LlamaIndex rolled out a new Memory component that gives AI agents short and long-term memory, improving context in conversations and enabling static memory blocks (link).
- A user reported challenges with the Memory component in workflows, particularly that memory clears with each workflow call when
user_id
sets thesession_id
.
- A user reported challenges with the Memory component in workflows, particularly that memory clears with each workflow call when
- LlamaExtract Adds Citations and Reasoning: @tuanacelikâs new code walkthrough shows how to add citations and reasoning to LlamaExtract (link).
- The walkthrough shows how to make a schema that tells the LLM what to pull from complex data.
- Memory Defaults to DB, DB Recoâd for Scale: The Memory component defaults to an in-memory SQLite DB, but use a local SQLite or PostgreSQL DB for scalability by changing the database URI.
- For long chat histories, a database is better than
memory.to_dict()
serialization as a JSON blob.
- For long chat histories, a database is better than
- Context Serialization vs. DB Debated: A user questioned if a database connection is better than serializing the context with the Memory component, since context restore recovers chat history.
- The clarification came that serialization is fine for defaults but databases rock for large chat histories or needing structured history saving, noting python dict vs redis is the same problem.
Cohere Discord
- Cohere Users Get Charged: A user shared a link to the billing dashboard for checking the number of API calls made to Cohere.
- However, a user noted that the trial key only displays tokens and not the raw number of requests, suggesting that Cohere may not explicitly count API calls.
- Users Consider Cohereâs Value: Members discussed use cases for Cohere compared to models like ChatGPT and Anthropic.
- This discussion highlights the ongoing evaluation of Cohereâs positioning in the competitive landscape of AI models.
- Cohereâs Command aCurious still elicits questions: A member sought guidance on suggested generation parameters for Command aCurious.
- The request underscores the importance of understanding and optimizing parameters for specific models to achieve desired results with Cohere.
LLM Agents (Berkeley MOOC) Discord
- Medium Article or X Post unlocks Course Certificate: Members clarified that earning a course certificate requires writing a Medium article or an X post summarizing one of the lectures.
- Interested members must submit their work via this form to receive credit.
- Submitting Coursework is critical for Certificate: To get the certificate, the coursework must be submitted via the provided Google Forms link after completing a Medium article or X Post.
- The submission ensures that the work is properly credited towards the course certificate.
The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The Nomic.ai (GPT4All) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
You are receiving this email because you opted in via our site.
Want to change how you receive these emails? You can unsubscribe from this list.
Discord: Detailed by-Channel summaries and links
Perplexity AI â· #general (748 messagesđ„đ„đ„):
Android app custom model selection, Deep Research release date, Merlin AI pricing and web search quality, Perplexity AI Sonar vs GPT models, AI Studio for multimodal utility
- Users Inquire about Custom Model Choice on Android App: A user inquired about the possibility of selecting a custom model, specifically Grok3, within the Perplexity AI Android application, attaching a screenshot.
- No immediate solution or workaround was provided in the discussion.
- âDeep Researchâ Expected Soon, âCometâ Speculated: Members discussed the timing of Perplexity AIâs new deep research feature and speculated whether it would be released before or after the Comet project, with one member stating that, according to their communications, deep research should be first.
- Another member expressed hope that Comet would be more than just a browser with a copilot style side bar, referencing a thinking batman gif.
- Merlin AIâs Murky Rate Limits Draw Ire: Users discussed Merlin AIâs pricing, with one noting its shady pricing due to unclear rate limits and another sharing that support gave a shitty response when asked, while praising its web search quality.
- It was also noted that for standard paid accounts on Merlin AI, any usage that surpasses $16 per day will also lead to the immediate termination of service for that day.
- AI Studio Hailed as Superior Multimodal Marvel: A user championed AI Studio for its multimodal utility, particularly for supporting audio and video input, which is unmatched by major LLM chats; they further added a key detail: AI Studio is our lord and savior for true multimodal utility.
- Comparisons were made to ChatGPT and other platforms, with users emphasizing AI Studioâs capabilities being free.
- Projects Feature Sparks Speculation and Sneak Peeks: Speculation arose around Perplexityâs new project feature, with users sharing screenshots and videos, like this one, of its functionality, and some expressing concerns about its potential misuse for vibe coding.
- There were also questions about how certain users gained early access to the feature, sparking discussion on testing catalogs and potential connections within the AI community.
Perplexity AI â· #sharing (2 messages):
Token Minimization, Sustain
- Token Minimization techniques explored: A user shared a Perplexity AI search result about token minimization for sustain.
- Another user then shared a direct link to Perplexity AIâs page on the same topic: Token Minimization for Sustain.
- Sustain and token optimization: The discussion revolved around methods to reduce token usage while maintaining model performance, crucial for sustainable AI practices.
- Resources shared included techniques for efficient token encoding and strategies to minimize input length without sacrificing essential information.
Perplexity AI â· #pplx-api (12 messagesđ„):
Sonar Model Benchmarks, Perplexity Pro API Access, New Developer Relations Resident, Sharepoint integration
- Sonar Models Beat Benchmarks!: Sonar Pro Low outperformed Claude 3.5 Sonnet on BrowseComp, with Sonar Low (the cheapest model) also outperforming Claude 3.7 Sonnet, according to recent benchmark evaluations.
- Sonar Pro Low achieved 4.0% accuracy on BrowseComp, almost 50% higher than Claude 3.5 sonnet, and both Sonar and Sonar Pro delivered up to 3x faster response times with more consistent latency.
- Perplexity Pro Users Discover API Access: Perplexity Pro actually includes $5/month in API credits! You can find the API docs and get started at Perplexity AI docs.
- Registering via a credit card is the only way to access the API, but users will not be charged if they keep their usage at or below $5.
- New DevRel Resident is a Robinhood Champ!: A new Developer Relations Resident (<@543991504922738688>) has joined the team, who won a Robinhood competition using the Perplexity API and will be helping developers build with Sonar.
- Sharepoint Integration into Perplexity: A user reported that Sharepoint integration into Perplexity Enterprise Pro works well in UI with relevant responses.
- However, they noted that they cannot receive any useful responses using Perplexity API service and are seeking advice.
LM Studio â· #general (176 messagesđ„đ„):
Embedding Modules Issue, Benign Log Spam, LM Studio JIT, Building Models from Scratch, LM Studio Autoload Issues
- Embedding Issue Happening Across All Models: A user reported issues with embedding modules, encountering a âMaximum file size per conversation reachedâ error, capped at 31.46 MB with the default nomic-ai/nomic-embed-text-v1.5-GGUF model.
- The user seeks to overcome this limitation to process files of a few hundred MB for an audio-to-lyrics generation application.
- LM Studio Gets Log Spam Fix: A user was informed that the log spam issue they encountered is benign and has been fixed in the 0.3.16 build 2 or will be fixed shortly.
- This resolves concerns about potential problems when running the embedding model.
- LM Studio Has JIT Loading Issues: Users reported issues with LM Studioâs JIT model loading feature, particularly when using it via the Continue VS Code plugin.
- The server sometimes serves clients with the wrong model if another model is already loaded; it was suggested to check the identifiers of the models being used, using the
lms ls
command to ensure they match between the config and LM Studio as well as removing 8bit from it to fix it.
- The server sometimes serves clients with the wrong model if another model is already loaded; it was suggested to check the identifiers of the models being used, using the
- Build A LLM From Scratch?: Members discussed the feasibility of building LLMs from scratch, with one member expressing interest in building a model for various use cases.
- However, others emphasized the impracticality due to the massive compute costs and data requirements, recommending fine-tuning existing open-source models instead and suggesting the âBuild a Large Language Model from Scratchâ Manning book for theoretical understanding.
- Problems Loading LM Studio Model: A user reported that LM Studioâs model autoload feature stopped working, with models failing to load even after restarting the app; these models were being loaded using the JIT system via API requests.
- It was mentioned that this could be due to a mismatch in the model identifier used by the client, which could be resolved by running
lms ls
.
- It was mentioned that this could be due to a mismatch in the model identifier used by the client, which could be resolved by running
LM Studio â· #hardware-discussion (450 messagesđ„đ„đ„):
Gigabyte RTX 5060 Ti, PCIE 5.0 Benefits, qwen3-14b-q4km performance, Dual GPUs, ROCm on Linux vs Windows
- Gigabyteâs Thrifty RTX 5060 Ti Design: A user showcases a new Gigabyte GeForce RTX 5060 Ti card, noting its ultra-short PCB design with only an x8 physical connector, despite the chip supporting x8, as seen in this videocardz.com article.
- The user praises the short board and large flow-through design for future-proofing and ease of installation, while humorously criticizing PNYâs design choices for prioritizing aesthetics over functionality, obstructing airflow.
- PCIE 5.0 Boosts Card Performance: A user reports performance gains after upgrading to a 50 series card, specifically going from 26 tkps to 38 tkps using qwen3-14b-q4km with a 4096 context, due to PCIE 5.0 benefits on a board that limits the top slot to x8 when M.2 slots are occupied.
- The user also appreciates the ability to install the card in the bottom slot without interference, highlighting the advantages of the shorter PCIE connector design.
- Dual GPUs Capped by Slower Card: When questioned about using two GPUs simultaneously, one user explains that the speed gets capped at the slower cardâs performance plus overhead, indicating that the system will only perform as fast as the slowest component.
- They then shared an image of a PNY GPU, clarifying that they returned it.
- GMK Strix Halo Mini PC: Users are eagerly awaiting their GMK Strix Halo mini PCs, with one user having ordered two and planning to post LM Studio performance results upon arrival, as well as hoping that they can get it to run models split across multiple computers.
- Another user noted that GMKtec order shipping is now underway for orders placed May 7-13.
- ROCm on Linux for AMD GPUs: One user shared it wasnât that difficult to switch to Linux for ROCm support, while another reported issues with ROCm detection on Linux, noting no performance difference compared to Vulkan on Windows.
- A user stated that Vulkan can be faster but has a bug with flash attention and that ROCm is a lot more consistent.
LMArena â· #general (530 messagesđ„đ„đ„):
DeepSeek R2 release, New Gemma models, Claude Neptune / 3.8 leaks, GPT-4.1 vs GPT-4o, O3 Pro release delays
- Gemma gets a new Model named Cutiepie: Members discussed a new Gemma model named Cutiepie, and some believe it might be slated for Google I/O if the timeline aligns.
- Others are trying to find strong anonymous models in the arena, and are wondering whether Gemma is free without logging in.
- DeepSeek R2: Still MIA?: The community noted that the constant speculation around a new DeepSeek R2 release has finally subsided, yet still is in development using Huawei latest GPUs with Chinese funding.
- Internally, they are likely feeling pressure to release something impressive and solve server problems; current models excel in reasoning traces and multi-lingual chain of thought.
- Gemini 2.5 Pro vs The Competition: Members discuss Gemini 2.5 Pro and express it is superior to the competition, especially when coding, with some citing coding in C++ with it as a dream come true.
- Users find current models are suffering from hallucinations like crazy, though others find it better than o3 and are starting to feel threatened by Gemini.
- O3 Pro Delayed, O4 On the Horizon?: Speculation continues about the delayed release of o3 Pro, with some suggesting OpenAI is waiting for other labs to reveal their cards to claim the top spot, with a likely Google event looming.
- Internally, OpenAI already has o4, but is holding it back for strategic reasons relating to other labs; it is believed Claude 4 + o3 pro gonna be an insane combo.
- GPT 4.1: Is it a Coding Prodigy?: Members discussed GPT 4.1âs coding capabilities, highlighting its quick compilation and error-fixing on large codebases, with instant response times.
- A member notes it is a big upgrade from GPT 4o mini to 4.1 mini for free users, but the same member feels that the 4.1 is solely for coding, not rlly other things tbh.
LMArena â· #announcements (1 messages):
Server Updates, Forum Category, Roles Creation, Moderation Improvements, Future Events
- LMArena Server undergoes Updates: The LMArena server has implemented a few changes to make a more engaging and protected space, with community feedback driving these adjustments; members can provide input via this form.
- These changes include server structure adjustments, moderation improvements, and plans for regular events.
- New Forum Category introduced: A new Forum Category has been added to gather feedback, troubleshoot issues, and handle model requests, replacing the existing <#1347792466203381830> and <#1346566622927655033> channels for better organization.
- Users may need to enable these channels in
Channels & Roles -> Browse Channels
if they are not visible.
- Users may need to enable these channels in
- Roles Creation for Targeted Announcements: New Roles have been created in the
Channels & Roles
section via auto-assigned questions, allowing for more targeted announcements to ensure members receive the most relevant information.- The Server Guide, located at the top of the channel list, now contains the <#1343285970375540839> and <#1340554757349179416> channels.
- Moderation Improved via ModMail Bot: For immediate needs, the <@&1349916362595635286> role can be pinged, and for private issues, members can now send a Direct Message to the ModMail bot found at the top of the
Member List
.- The <#1343285970375540839> has been updated to keep discussions on-topic and foster a more inclusive space.
- More regular events in the Future: Plans are in place to host events on a more regular basis, including Staff AMAs, contests, and casual game/activity nights.
- Members are encouraged to stay tuned for updates on these upcoming activities.
Unsloth AI (Daniel Han) â· #general (379 messagesđ„đ„):
File:// usage, Qwen3 inference speed, llama 3.2 vision fine tuning, mergekit and frankenmerge, Qwen3 GRPO notebook
- Fix attempts for âNoneTypeâ error with file usage: Users debugged an
AttributeError: 'NoneType' object has no attribute 'startswith'
when attempting to use file:// for image loading, suspecting an issue with how the image variable was being set to None initially in the code, and suggested adding additional forward slashes in the path.- After some debugging, the user reported that the error persisted, even after the path correction, leading to the consideration of using URLs as an alternative.
- Users question Qwen3 inference is extremely low for model: A user reported slow inference speeds with Qwen/Qwen3-0.6B on an A100, getting around 30 tokens/second with a batch size of 1, seeking advice on whether this is typical.
- Others suggested that the small model size and batch size might negate the benefits of using
load_in_8bit=True
.
- Others suggested that the small model size and batch size might negate the benefits of using
- Help is requested to finetune Llama 3.2 Vision Model: A user requested assistance setting up scripts for finetuning the Llama 3.2 vision model.
- They expressed frustration with image finetuning, stating it gives them a headache.
- Discussion on the use of entropy loss in GRPOTrainer: A user inquired whether GRPOTrainer implements entropy loss, referencing a paper that found it helpful.
- Another user shared an ablation table and noted that without entropy loss, the model still converges, but performs 2-4 points worse and added, I could see an argument why the entropy loss would matter more with a single example than if you have a more traditionally-sized dataset, though!
- Discussion on mergekit and frankenmerge: A user requested beginner-friendly guides, blogs, videos, or courses on mergekit and frankenmerge.
- Someone stated Mergekit apparently lets you merge different llms together
Unsloth AI (Daniel Han) â· #off-topic (49 messagesđ„):
O3 evaluation, GPT-4.1 coding, Qwen models, NEFTune
- O3 Model Called Out for Coding Garbage: Members complained that O3 outputs garbage code and that OpenAI models have become borderline unusable, while another member suggested that O3 is fucking garbage for coding and itâs better to use it for planning and research with tool calls and use O4 mini high for coding.
- GPT-4.1 Declared Best Coding Model: One member stated that GPT 4.1 is the best coding model, as available in Github Copilot for those with the edu account.
- Another member agreed that 4.1 is much better than Sonnet and even O1.
- Qwen Models Outperform Gemma in Benchmarks: It was noted that the Qwen 3 models are better than Gemma 3, beating all of them at similar sizes.
- Specifically, the Qwen 3 8b model is apparently CRAZY good at reaching SOTA in almost any domain with good fine tuning and is the perfect sweet spot for local llms.
- Repo2txt.com Suggested as Superior to RAG for Code Injection: A member recommended repo2txt.com for selecting files from a repo and generating text, which is then injected directly into the prompt as itâs better than allowing the model to do RAG.
- They claim that models canât read all code automatically in github and make mistakes.
- Qwen Deep Research Discovers NEFTune: Members discussed Qwenâs deep research which revealed NEFTune, or injecting noise in embedding during finetuning, so it acts like regularization.
- One member favored it over Gemini deep research and ChatGPT since itâs very specific to the instruction, doesnât hallucinate, and told them about NEFTune.
Unsloth AI (Daniel Han) â· #help (83 messagesđ„đ„):
Vocabulary Size, Chat Templates and Base Models, Unsloth Performance Issues, GGUF compatibility, GRPO and Qwen3
- Vocabulary Size Prevents OOV Errors: A member noted that the vocabulary size is the same and uses byte-level encoding, so thereâs no chance of out-of-vocabulary (OOV) errors.
- The config adds the tool and think tokens, the chat template, and increases the
model_max_length
slightly.
- The config adds the tool and think tokens, the chat template, and increases the
- Chat Template Flexibility with Base Models: It was mentioned that in the base model, you can use any template as you want and it doesnât matter at all, even Alpaca or Gemma templates.
- However, you need to stick to it and use your own code to wrap the data into the template.
- Unsloth Inference Speed Slow for Qwen3: A user reported slow inference speeds with Qwen3-0.6B on an A100 (40GB), getting around 30 tokens/second with a batch size of 1, using Unslothâs
FastLanguageModel.from_pretrained()
.- The base model does not contain a chat template in the
tokenizer_config.json
.
- The base model does not contain a chat template in the
- Vision Model Merging Fix Pushed: A fix was pushed to address issues with
save_pretrained_merged()
andpush_to_hub_merged
for vision models; make sure to update unsloth-zoo and unsloth installations from main repos usingpip install --force-reinstall git+https://github.com/unslothai/unsloth-zoo.git
andpip install --force-reinstall git+https://github.com/unslothai/unsloth.git
.- The specific issue was detailed in this pull request.
- New Feature in llama.cpp Requires mmproj File: When using the new feature in
llama.cpp
to use multimodal models, GGUF models may require anmmproj
file.- This file can be created by converting the model twice with
llama.cpp
, once normally and once with the--mmproj
command-line argument, per the updatedllama.cpp
documentation.
- This file can be created by converting the model twice with
Unsloth AI (Daniel Han) â· #research (5 messages):
Med Palm 2, QLoRA memory, modernBERT context length
- Googleâs Med Palm 2âs stabgan: A member noted that stabgan is kind of almost similar to what the Med Palm 2 paper of google did, but they used the concept for normal generation not reasoning.
- He noted the paper was thanks for sharing.
- QLoRA memory remains low: A member reported the use of QLoRA kept the memory low.
- He used this to play with the full modernBERT context length of 8k, and to use massive batch sizes to get better and more diverse in-batch sampling of negatives during training.
OpenAI â· #annnouncements (2 messages):
Safety Evaluations Hub, GPT-4.1, GPT-4.1 Mini
- Safety Evaluations Hub debuts: OpenAI introduced the Safety Evaluations Hub, a resource to explore safety results for their models.
- While system cards share safety metrics at launch, the Hub will be updated periodically as part of their efforts to communicate proactively about safety.
- GPT-4.1 lands in ChatGPT: GPT-4.1, a specialized model excelling at coding tasks & instruction following, is now available directly in ChatGPT for Plus, Pro, & Team users via the âmore modelsâ dropdown.
- Enterprise & Edu users will get access in the coming weeks, and it is a faster alternative to OpenAI o3 & o4-mini for everyday coding needs.
- GPT-4.1 mini supersedes GPT-4o mini: GPT-4.1 mini is replacing GPT-4o mini in ChatGPT for all users.
- GPT-4.1 and GPT-4.1 mini underwent standard safety evaluations, with detailed results available in the newly launched Safety Evaluations Hub.
OpenAI â· #ai-discussions (151 messagesđ„đ„):
Sentient AI conversation, ChatGPT models for coding, O3 model intelligence, ChatGPT Enterprise plan, AI-generated images on Instagram
- O4-Mini-High is a Coding Beast: Members raved about O4-mini-high for coding, with one user exclaiming, âNever seen such good and fast performance out of a coding model in A WHILEâ, noting it quickly solved a problem and improved the code for a calculator in just 22 seconds.
- Despite some users noting GPT-4.1 being made for coding, the general sentiment leaned towards O4-mini-high delivering superior performance compared to Claude 3.7 Sonnet and Gemini 2.5 Pro.
- O3 Unlimited is a dealbreaker: A user considering switching from a Pro subscription to a Teams subscription was concerned about losing unlimited O3, a model they consider âa legendary model to resolve most of my issues,â despite the Teams plan offering desirable internal knowledge features.
- Members suggested sticking with the Pro plan due to the benefits of unlimited O3, which now supports deep research through personal GitHub Repos, OneDrive, and SharePoint, with 20 uploaded files allowed in a Chat / Project.
- ChatGPT Enterprise Trains on User Data for Enhanced Performance: A user shared their experience with a corporate version of ChatGPT that, initially âpretty trashâ, improved significantly after training on thousands of usersâ daily interactions, now utilizing GPT-4o and offering more secure and uncapped usage.
- Another user inquired about the message cap limit of O3 in the Enterprise plan, clarifying that while the plan uses GPT-4 Turbo, they were specifically interested in the message cap limit for the O3 model, later found to be 100 messages per week per user.
- GPT-4.1 Mini replaces GPT-4o Mini: The community noted that GPT-4.1 Mini has replaced GPT-4o Mini in ChatGPT for all users as of May 14, 2025, touted for significant improvements in instruction-following, coding, and overall intelligence.
- Members discussed the shift from GPT-4o to GPT-4.1 on the free plan, weighing the pros and cons of each with some members believing GPT-4o is better for therapeutic purposes while GPT-4.1 excels in other areas such as front-end coding.
OpenAI â· #gpt-4-discussions (12 messagesđ„):
GPT-4o for web app coding, Structured outputs for Azure OpenAI assistants, Node ID errors, Chat delays on PC vs. mobile, Flagged chats due to long output
- GPT-4o Aids Web App Dev: Members discussed using GPT-4o for coding web apps (Vue, Express.js, MongoDB), emphasizing the need to specify tooling, OS, IDE, languages, frameworks, and preferred dependencies.
- Clarity in detailing requirements helps the model provide expected solutions; simply âTell it exactly what you wantâ.
- Azure OpenAI Assistants face Structure Output Woes: A user reported issues working with structured outputs for assistants in Azure OpenAI.
- Another user reported getting âgetNodeByIdOrMessageId - no node found by id: placeholder-request-â type messages all the time now, indicating ongoing problems with the platform.
- Typing Lag surfaces on PCs: A user experienced typing lag and delayed message loading on their PC, while the same chat session worked fine on their phone on the same network.
- They further isolated the issue by testing on a separate Win11 work computer, confirming the problem was specific to the PC setup.
- Flagged Chats Trigger Crashes: A member suggested that the system might have flagged a chat due to an extremely long output, potentially longer than the output limit.
- The user theorized that âif youâre feeding it files with 5000+ lines of code and itâs writing fixes and refactors the code, but makes an error the system could flag it, then the whole chat is broken and unable to load on any deviceâ.
OpenAI â· #prompt-engineering (70 messagesđ„đ„):
GPTs for coding, PII data guardrails, AI for universe simulation, Mimicking writing style with AI, Ollama vs Windsurf
- Users seek advice using GPT-4o for coding Vue, Express, and MongoDB: A member sought guidance on using GPT-4o for coding, specifically with Vue, Express.js, and MongoDB, and inquired about integrating it with Visual Studio.
- Another member recommends to start with tutorials using HTML, CSS, and Javascript with chatGPT to build basic applications such as a calculator, notes app or a weather app, after which the user should then learn Typescript, React, Vite and Electron.
- Members discuss PII data guardrails challenges: A member reported encountering issues with PII guardrails when their application, which accesses HR data, blocked requests for home addresses.
- Another member suggested consulting OpenAI directly to discuss appropriate use and obtain guidance on handling sensitive data requests while adhering to their policies, especially concerning PII.
- User builds AI Universe Simulator for Companion Prompt Generation: One member is using AI to build a 1:1 simulation of the universe to think about thinking and create automated ecosystems within AI models.
- The goal is to scale the simulation into a web browser and push for open communication lines between models, promoting efficiency over stacking models.
- Users explore mimicking writing styles with ChatGPT: A member asked about using ChatGPT to mimic their writing style and a member suggested sharing samples and iterate and guide to refine the output.
- It was pointed out that a general prompt will most likely not yield quality results, but a strong, coherent set of rules and constraints are needed to achieve this and training the model on 1000 pages can help.
- Users discuss using local model inference with Ollama, instead of services like Windsurf: A user questioned whether Ollama is inferior to Windsurf, and a member advised against paying for services like Windsurf, Lovable, or Replit when using AI to build apps.
- It was suggested to learn prompting to avoid API costs, and use local model inference with Ollama and VS Code extensions like Continue and Roo Code, while doing LLM and Agentic courses from Hugging Face.
OpenAI â· #api-discussions (70 messagesđ„đ„):
ChatGPT for Web App Development, GPT-4o Coding Assistance, HR Data Guardrails and PII, Mimicking Writing Style, Prompt Engineering and Agentic Frameworks
- Navigating ChatGPT for Web App Coding Assistance: A user sought advice on leveraging GPT-4o for web app development, specifically with Vue, Express.js, and MongoDB; another member suggested using it like conversing with a human, providing specific details about challenges, sharing code snippets, and stating background.
- The member recommended starting with a bare-bones prototype, testing it, fixing errors iteratively, and adding features one at a time, ensuring both the user and the model remain aligned.
- Circumventing PII Guardrails for HR Data: A user reported issues with PII guardrails when requesting information like home addresses from an HR data-connected application, while OpenAI support might provide tailored guidance, community members are limited in offering specific workaround advice due to OpenAIâs policies.
- The suggestion was to contact OpenAI support directly to discuss the specific use case and seek appropriate guidance for handling sensitive PII data within the application.
- Mastering Mimicry: Emulating Writing Style with ChatGPT: A user inquired about the best approach for getting ChatGPT to mimic their writing style, and a member suggested providing samples, requesting emulation for a specific goal, and iteratively correcting and refining the output based on the modelâs feedback.
- Another member emphasized specifying details about structure and the elements used to clarify writing style for the bot.
- Prompt Engineering for Stellar AI Results: A member recommended learning basic HTML, CSS, and JavaScript to better understand and debug AI-generated code and suggested completing LLM and Agentic courses on Hugging Face Learn to grasp prompting, context management, and roles.
- They suggested that one can evaluate their prompting skills by asking ChatGPT to rate the prompt engineering and provide feedback, which is key to improve results and understanding agentic frameworks as LLM + MCP servers + modes + prompts guiding it across multiple modes and tools.
- Unveiling Personalized ChatGPT Engagement Metadata: A user requested a detailed breakdown of their ChatGPT usage metadata, including metrics like message length, role usage, and conversation depth, to gain insights into their interaction patterns.
- A follow up prompt of âare there any other stats you can share i had not covered?â was suggested to generate even more stats.
Cursor Community â· #general (271 messagesđ„đ„):
GPU power in decentralized systems, Cursor pricing vs API pricing, Claude Max in Cursor, Multi-repo projects in Cursor, Cursor's Git changes sync issues
- GPU Power P2P options pondered: A member inquired about open-source options for utilizing GPU power in peer-to-peer or decentralized systems.
- The discussion did not yield specific solutions, highlighting a potential area for exploration.
- Cursorâs 20% markup debated: Users debated whether Cursorâs 20% surcharge over actual API pricing is justified, with some arguing it makes perfect business sense while others find it a bad value.
- One user stated, I feel cursor are the absolut best value for the money currently, compared to what you get for the $20 bucks while another claimed to save $600 per month by using Claude Max outside of Cursor.
- Gemini 2.5 joins Cursorâs Model Mix: The community spotted the addition of Gemini 2.5 Pro on May 6th to Cursorâs selection of available models.
- Some noted that the new model finally fixed the âi will code now (stops coding)â issue.
- Mono-Repo Method Manuevers: Users discussed managing multi-repo projects, with one suggesting consolidating into a single monorepo to avoid context fragmentation.
- Another user mentioned using separate workspaces within the same parent folder in Cursor 0.50.x.
- Background Agent Begs for Confirmation: Users expressed frustration that the background agent requires excessive confirmation, increasing fast request usage.
- One user complained the background agent just keeps asking for confirmation to do things and its just eating fast request up faster.
Yannick Kilcher â· #general (197 messagesđ„đ„):
Patents in AI, RL-Diffusion, Generator Paradigm, Evolutionary Algorithms, Hamiltonian Neural Networks and Transformers
- RL-Diffusion Debate Sparks Patent Concerns: A member discussed developing RL-Diffusion, combining forward and backward processes into one controlled by RL, leading to a debate about patentability and prior art, with some members skeptical about its novelty and practical implementation.
- Some members encouraged practical implementation and benchmarking before pursuing patents, emphasizing that transformation into a patent-eligible application requires âmore than simply stat[ing] the [abstract idea] while adding the words âapply it.ââ
- Googleâs AlphaEvolve Sparks Excitement and Skepticism: Members discussed Googleâs AlphaEvolve, which pairs Gemini models with evolutionary algorithms to improve underlying models, with opinions divided on whether itâs a meaningful advancement or just âbrute force with an LLM.â
- One member noted its potential significance in accelerating multiplication, while another linked it to existing work like AlphaTensor and AlphaCode, viewing it as a small step in neural net-driven search.
- Hamiltonian Neural Networks and Transformers Get Attention: A member shared their idea of integrating transformers into Hamiltonian neural networks, referencing a paper on a Hamiltonian neural network (HNN)-Transformer architecture for modeling physical systems [https://ieeexplore.ieee.org/document/10316909].
- Discussion touched on whether attention mechanisms align with the history-independent nature of Hamiltonian systems, with another member suggesting a transformer that learns system Hamiltonian dynamics from a single trajectory.
- Diffusion vs. Autoregression: Continuous vs. Discrete?: Members discussed the fundamental differences between diffusion models and autoregressive models, highlighting that diffusion models work with continuous distributions while autoregressive models work with discrete sequences of symbols.
- The discussion extended to how VQVAE can be used to transform images into discrete tokens for autoregressive models like Parti [https://sites.research.google/parti/], enabling transformers to operate on latents.
Yannick Kilcher â· #paper-discussion (23 messagesđ„):
Grade School Math Benchmarks, ML systems rabbit hole, Data Loading and Preprocessing, LLMs are like humans, Model formulates a plan
- GSM8K Benchmark: Near-Perfect Accuracy Achieved: Language models have shown their capability to solve mathematical reasoning problems, achieving near-perfect accuracy on grade-school level math benchmarks like GSM8K, as discussed in this paper and detailed in this blogpost.
- The paper explores whether language models truly develop reasoning skills or simply memorize templates.
- Dive Deep into the ML Systems Rabbit Hole: It was shared that a significant portion of training time, around 65%, is spent on data loading and preprocessing, referencing this paper.
- LLMs: Human-like or Not?: One member suggests that a paperâs findings about LLMs might be flawed because âThey look to prove LLMs are like humans instead of trying to disprove it.â
- The member feels the paper misinterprets its results by claiming LLMs formulate plans before generating solutions.
- Model Avoision of Unecessary Computations: One member quotes the paper to say âthe model can learn to generate shortest solutions, almost always avoiding unnecessary computationsâ arguing âThis suggests that the model formulates a plan before it generates, in order to avoid computing any quantities that are not needed towards solving the underlying math problem.â
- One member countered that âWhat the model learns is that unnecessary things are random noise which by definition have no signal for the model to learn, so it ignores themâ.
Yannick Kilcher â· #ml-news (12 messagesđ„):
AI Regulation Ban, AlphaEvolve, Budget Reconciliation Bill
- GOP Blocks AI Regulation for a Decade: House Republicans added language to the Budget Reconciliation bill that would block all state and local governments from regulating AI for 10 years (source).
- The provision, introduced by Representative Brett Guthrie of Kentucky, vaguely states that no state or local entity can enforce laws regulating AI models or systems for a decade, potentially impacting privacy regulations.
- DeepMindâs AlphaEvolve Cracks Open Problems: DeepMindâs AlphaEvolve, a Gemini-powered coding agent, evolves algorithms for math and practical applications, combining LLM creativity with automated evaluators (DeepMind Blog).
- The system rediscovered state-of-the-art solutions in roughly 75% of cases and improved the previously best known solutions in 20% of cases, even advancing the kissing number problem.
OpenRouter (Alex Atallah) â· #app-showcase (5 messages):
New Chatbot Platform, Customization and Models, Image Generation in Chat
- Personality Launched: New Chatbot Platform Emerges: A member introduced Personality, a new chatbot platform enabling users to create and roleplay with multiple characters and use non-role-play assistants.
- The platform aims to offer more customization, less filtering, and a wider selection of models compared to existing solutions like c.ai.
- Personality Platform Offers Free Image Generation: The platformâs playground at personality.gg/playground offers free image generation, though itâs noted that this feature is not powered by OpenRouter.
- Users are invited to try the platform for free at personality.gg and provide feedback.
- Big Updates Coming to Personality Platform This Week: A major update is expected this week including the ability to generate images directly within chats and a better user interface.
- This aims to enhance the user experience and expand the platformâs capabilities.
OpenRouter (Alex Atallah) â· #general (177 messagesđ„đ„):
OpenAI Reasoning Models Naming, Free Google Models, Gemini Rate Limits, Claude on OpenRouter vs Native, Corvid Befriending
- OpenAIâs Reasoning Model Names Need a Revamp: A user inquired about the naming inconsistency in OpenAIâs reasoning models, noting that some have reasoning level variants (e.g.,
/openai/o4-mini-high
) while others donât, and requested consistency in offering reasoning levels for all models to aid evaluation. - Free Google Models Getting the Squeeze: Users reported issues with free Google models despite having credits, with some confirming extremely low rate limits.
- Alternatives like DeepSeek V3 were recommended, while concerns were raised about potential removal of free routes for Gemini following a change shared on Twitter.
- Claude System Prompt Differences Explained: Users noticed a difference in helpfulness when using Claude via OpenRouter compared to the native Anthropic website due to the extensive system prompts used on the latter.
- It was suggested that users manually implement the system prompt, available on GitHub, which comprises around 16000 tokens and includes tools.
- Become One With Murder Birds: A user shared their ongoing journey of befriending corvids (crows and magpies), detailing their feeding routine and the development of trust.
- They recounted stories of crows following them after being fed and anticipated building a crow army, ending on an anticlimactic note with their location in Germany.
- âAlways Use This Keyâ is Actually New: A new âAlways use this keyâ option was introduced, causing confusion as it was initially mistaken for the existing âUse this key as a fallbackâ setting.
- The new feature exclusively uses the specified key and prevents fallback to OpenRouter, which represents a change from the behavior of the older fallback setting.
Manus.im Discord â· #general (165 messagesđ„đ„):
Manus credits not refreshing, best use cases for manus, Manus invitation codes, Manus refunds, Gemini Developer API's Function Calling feature
- Users Reporting Manus Daily Credits Not Refreshing Properly: Several users reported issues with their daily 300 credits not refreshing at 00:00 GMT, potentially due to time zone processing problems.
- A user mentioned their credits refresh at 8:00pm in their timezone, indicating inconsistencies in the credit refresh timing.
- Unleashing Genius with Glitched Invitation Codes: A user claimed to have a glitched account with 100 invitation codes and shared numerous invitation links, leading to discussions about their origin and purpose.
- Some users speculated the codes were from a paid subscription, while others questioned their usefulness to existing members; new users get 500 credits with the codes.
- Manus Credit Refunding Difficulties: A user reported a job failure that consumed 800 credits and expressed frustration about not being able to get a refund from Manus.
- Other users chimed in, stating that refunds are no longer provided, even if the service doesnât work as expected, with one suggestion to dispute the charge.
- Rapping about Facebook Marketplace Scams: A user requested Manus to generate a rap about Facebook Marketplace lowballing, using slang terms like best price, last price, and mates rates.
- The user clarified that the request was not an advertisement and involved rapping about scenarios and experiences related to the online marketplace.
- Beta tester boasts about a secret Music project: A user mentioned being accepted into a new beta trial related to Music, but couldnât disclose details due to NDAs.
- The user later promoted their social media accounts (TikTok, Instagram, LinkedIn, Threads, YouTube, & X) featuring Manus content.
GPU MODE â· #general (2 messages):
torch.compile performance, layernorm vs rmsnorm
- PyTorch Implementations Compared: A member suggested that comparing âbasicâ implementations with separate kernels for each operation on GitHub could explain performance improvements.
- Another member confirmed finding similar results when comparing
torch.compile
of PyTorchâs layernorm and rmsnorm implementations, noting they seem to have basically the same performance.
- Another member confirmed finding similar results when comparing
- Surprising GPU Profiles Reported: A member mentioned that some colleagues have posted surprising GPU profiles, prompting further investigation.
- They plan to follow up with them to understand the underlying factors contributing to these unexpected results.
GPU MODE â· #cuda (1 messages):
CUDA Shared Buffers, PyTorch Tensors, RAPIDSAI/rmm Library
- Seek Simpler C++/CUDA Library for Shared Buffers: A member inquired about a simple C++ library, potentially with pybind, to demonstrate multiple processes reading/writing to shared CUDA buffers.
- Theyâre also interested in wrapping PyTorch tensors on top, noting that RAPIDSAI/rmm might be too extensive for their needs.
- Further CUDA/PyTorch Interoperability: The user is seeking guidance on efficiently managing shared CUDA memory between multiple processes.
- They are particularly interested in a streamlined approach that integrates well with PyTorch tensors, possibly as an alternative to the more comprehensive RAPIDSAI/rmm library.
GPU MODE â· #torch (2 messages):
PyTorch nightly, at::Tag, needs_exact_strides, C++ code, torch.compile
- needs_exact_strides better than needs_fixed_stride_order: Members discussed that if youâre on a pytorch nightly,
at::Tag::needs_exact_strides
is better becauseneeds_fixed_stride_order
sometimes lies.- One member mentioned reading the code and believes the answer is no, while another thanked them and mentioned moving the
.contiguous
calls in the C++ code sotorch.compile
canât mess with them.
- One member mentioned reading the code and believes the answer is no, while another thanked them and mentioned moving the
- Contiguous calls moved in C++ code: A developer moved the
.contiguous
calls in the C++ code to preventtorch.compile
from interfering.- This adjustment was made to address a recurring issue, and the developer appreciated the suggestion to use
needs_exact_strides
for better stride handling in PyTorch nightly builds.
- This adjustment was made to address a recurring issue, and the developer appreciated the suggestion to use
GPU MODE â· #beginner (5 messages):
Arithmetic Intensity of Kernels, TMA Utilization Metrics, Triton Performance Debugging, Nsight Compute for Kernel Debugging
- Kernel Arithmetic Intensity Question Arises: A member inquired about the best way to compute the arithmetic intensity of kernels and metrics for assessing TMA utilization in Hopper and Blackwell architectures.
- Another member suggested using
tma__inst_executed.sum
for TMA on Hopper, referencing an NVIDIA forum post and pointed out that Nsight Compute has a built-in roofline tool to estimate arithmetic intensity.
- Another member suggested using
- Nsight Systems (nsys) Debugs Triton Performance: A member asked if using
nsys
andnsys-ui
is a typical workflow for debugging GPU performance while learning Triton.- Another member confirmed that this is a typical workflow for whole-program performance analysis, especially on headless servers, and recommended Nsight Compute (ncu) and ncu-ui for debugging specific kernels.
GPU MODE â· #self-promotion (10 messagesđ„):
Weight Pruning, PTX Instructions for Matrix Load/Store, CohereAI Talk Recording
- Pruning Weights for Performance?: A member inquired about weight pruning, specifically random block-weight pruning, clarified to be done with program IDs rather than zeroing weights.
- This technique relates to efficiently loading and storing matrices using PTX instructions.
- PTX Boosts Matrix Manipulations!: A member shared a blogpost detailing how to efficiently load and store matrices within a warp using PTX instructions, including the
ldmatrix
instruction.- They also linked the associated code, PTX documentation, and a LinkedIn post explaining the
stmatrix
instruction.
- They also linked the associated code, PTX documentation, and a LinkedIn post explaining the
- CohereAI Talk slides released!: A member shared a Google Meet link for a talk, and subsequently shared the slides.
- When asked about a recording, another member suggested it would be available on the CohereAI YouTube channel.
GPU MODE â· #đż (1 messages):
c.3.p.1: This looks potentially interesting: https://arxiv.org/abs/2504.09246
GPU MODE â· #submissions (47 messagesđ„):
AMD MI300, AMD fp8-mm, VectorAdd, Leaderboard Submissions
- MI300 Leaderboard Domination: Numerous submissions were made to the
amd-fp8-mm
leaderboard on MI300, showcasing various performance improvements.- Submissions ranged from 162 ”s to 26.3 ms, indicating a wide spectrum of optimization levels.
- Mixture of Experts on AMD: One successful submission was recorded on the
amd-mixture-of-experts
leaderboard on MI300 with a time of 7574 ms. - VectorAdd on T4: One submission achieved 8th place on the
vectoradd
leaderboard on T4 with a time of 6.41 ms. - New Personal Bests Abound: Several members achieved personal bests on the
amd-fp8-mm
leaderboard, reflecting ongoing optimization efforts.
GPU MODE â· #status (1 messages):
Competition delayed, Ironing out details, Problem #3
- Competition Delayed Due To Ironing: Problem #3 for the <#1359640791525490768> competition will be delayed by a few days, and hereâs why.
- The team is ironing out a few details to ensure the problem is as fun as possible, so your patience is appreciated.
- Fun Details Being Ironed: The competition team is taking extra time to iron out a few details to ensure the problem is as fun as possible.
- The problem in question is problem #3.
GPU MODE â· #factorio-learning-env (15 messagesđ„):
Factorio Genetic Algorithm, Cutting Down Tokens, Nearest buildable tool
- FactorioGP Genetic Algorithm for Blueprint Generation: A member is planning to create a genetic algorithm that generates Factorio blueprints based on specified requirements like building materials, input/output locations, and area constraints, and found a paper on genetic programming for dynamic path-finding.
- The algorithm aims to enable LLMs to provide constants for fulfilling these requirements, serving as a tool for dynamic factory design.
- Cutting Tokens Saves Dough: The group noted that it cost about $1000 to evaluate 6 models across 24 tasks, with 8 runs each, with one member suggesting a potential 90% reduction in token usage through intelligent context pulling and a RAG implementation.
- A RAG implementation could cut 90% of the tokens used
- Nearest Buildable tool is Imperfect: Current strategy uses a nearest_buildable tool for identifying appropriate places to put things, which is imperfect, which they can create a thread on discord to discuss work streams.
- Recurring meetings might be established to discuss work streams.
GPU MODE â· #amd-competition (23 messagesđ„):
Reference Kernel Times, Application Timeout Errors, fp8 gemm VGPR usage, Leaderboard Submission Issues, HIP Kernel .s File Access
- Reference Kernel gets Speed Boost: A pull request has been merged to mitigate the long reference time issue, aiming to improve run times when the main bot is updated.
- The update addresses concerns about the reference implementation taking too long, especially when faster implementations require numerous runs to meet termination criteria.
- Application timeout turns out to be a Blip: Members experienced intermittent application timeout errors, which were temporarily resolved by retrying submissions.
- A newline character within an
asm volatile
statement was identified as a potential cause, though the issue appeared to resolve itself.
- A newline character within an
- Users Seek fp8 GEMM VGPR Insights: A member writing a HIP kernel for fp8 gemm inquired about how to determine VGPR usage.
- Another member suggested using ROCmâs amd_matrix_instruction_calculator to check.
- Leaderboard Submission from CLI fails!: A user reported experiencing timeouts when submitting to the leaderboard for amd-mixture-of-experts via the command line interface (CLI).
- Submitting via Discord worked, but the CLI submissions consistently timed out.
- Hunting HIP Kernel .s Assembly Files: A user sought a method to obtain the
.s
file (assembly code) for a HIP kernel, mentioning the use ofhipcc
andextra_cuda_cflags
.- One suggestion was to pass
-save-temps
tohipcc
, but accessing the file during execution proved difficult; compiling locally was suggested as an alternative.
- One suggestion was to pass
GPU MODE â· #cutlass (25 messagesđ„):
CUTLASS 4.0 Release, CuTe DSL for Python, MLIR Compiler, PTX Dumping, Custom Kernel Performance
- CUTLASS 4.0 and CuTe DSL Debut!: CUTLASS 4.0 and CuTe DSL are now released, accessible via
pip install nvidia-cutlass-dsl
, and NVIDIA recommends starting with the Jupyter notebooks.- Members noted the
nvidia-cutlass-dsl
is version0.0.0....
which was released like 2 months ago according to pypy, so something seems borked with the release.
- Members noted the
- CuTe DSL Requires Python 3.12: A user reported issues installing and running the examples, which was resolved by using Python 3.12, as required by the documentation.
- CuTe DSL Achieves Blazing Fast Kernel Performance: A member implemented a custom CuTe kernel in C++ for fractional norms, achieving 67ms for a 30,000x30,000x1000 problem with p=1.0, outperforming
torch.cdist
at 4,000ms.- Replacing
cute.gemm
in thesgemm.py
example with a custom implementation yielded similar performance, compiling in 0.5 seconds and running in 62ms, beating pytorch by 60x!
- Replacing
- MLIR Compiler Not Open Source (Yet?): A user inquired about building from source and whether the MLIR src files are open sourced, but developers confirmed that the dialect compiler is not OSS.
- Users can install with pip and just use.
- Dumping PTX from CuTe DSL: A user asked if thereâs a way to dump the generated PTX code, similar to Tritonâs MLIR_ENABLE_DUMP, but currently, setting
CUTE_DSL_PRINT_IR=1
only dumps the MLIR file.- This feature does not yet exist.
GPU MODE â· #mojo (2 messages):
Mojo PyTorch backend, Autograd Implementation, Micrograd, Pytorch internals
- Mojo â€ïž PyTorch Backend?: A member expressed enthusiasm for the idea of Mojo becoming a PyTorch backend, while also hoping for a more accessible codebase, especially for those less familiar with C++.
- He inquired about the implementation of the backward pass, specifically asking about the need for separate kernels and how fusion would be handled.
- Micrograd as Pytorch Inspiration: One member mentioned that Micrograd was based on PyTorch, with links to the Micrograd video and the PyTorch paper provided for context.
- This suggests that the principles and implementation details of PyTorchâs autograd system could offer insights into how Mojo might handle its own backward pass.
aider (Paul Gauthier) â· #general (49 messagesđ„):
Gemini 2.5 Pro, Model Performance, Common Lisp, AI Studio, Repomap
- Gemini 2.5 Pro performance and use cases: Users are experimenting with Gemini 2.5 Pro via OpenRouter, using
--edit-format diff-fenced
, and observing that it sometimes rewrites huge files for small changes, raising questions about its behavior.- Some users find AI Studio provides results faster, and report that Sonnet 3.7 works best for their workflow, while others are using cheaper models for ask and Sonnet3.7 for architect.
- Discussing Common Lisp and Modern AI Tooling: Users are discussing using existing models to develop in less popular languages like Common Lisp, planning to use books and data sources to create datasets and prompts for in-context learning.
- The idea involves LoRA-ing small models and using semantic retrieval to add programming book wisdom to the context window, and a member suggests creating a Lisp DSL to build a compiler/interpreter.
- Addressing Google AI Studio Redirect Issues: A user reports being redirected from Google AI Studio after briefly seeing the UI, despite their country being on the allowed list, seeking potential solutions or explanations.
- Repomap Issues and Solutions: A user questions why repomap sometimes underperforms, even with a high map multiplier, indicating potential issues with file mapping in certain projects.
- They are âgetting the perfect snake is no easy task!â
- Gemini Adds Comments Into the Code and Stuiped Ideas for Later: A user found that Gemini adds so many comments into the code and even more stuiped ideas inside the code for later and it codes it cause it was inside the code.
- There should be stricked code change without coding ideas.
aider (Paul Gauthier) â· #questions-and-tips (48 messagesđ„):
Gemini rate limits, Aider upgrades, Aider models, Aider configuration, Aider file navigation issues
- Gemini Free Tier Runs into Rate Limits: Users reported experiencing sudden rate limits with Geminiâs free tier, even after periods of inactivity, with one user noting probably they just deprioritize those clusters so when load goes up elsewhere it 429s.
- Aider Upgrade Woes Persist: Users are encountering issues when upgrading Aider, with the upgrade process failing to update the version number despite appearing to complete successfully.
- The SSL warning during the upgrade is likely unrelated, as it has been a recurring issue since January.
- Experimental Gemini Models Disabled, Confusion Ensues: Users faced errors indicating that the free experimental Gemini model was disabled, leading to confusion and the suggestion to switch to the preview model.
- One user reported unexpected charges, questioning whether they were actually using the preview model, which was later clarified by checking the Aider announce lines at startup, that the pro-preview wasnât being used.
- Aider Gets a Concise Aussie Chat Makeover: A user discovered a way to make Aiderâs replies easier to read by modifying the
~/.aider.conf.yml
file.- They suggested using
chat-language: English (Australia, use headings, bullet points, concise sentence fragments)
.
- They suggested using
- Aider Struggles with Large File Navigation: A user reported issues with Aider navigating a 1600 line file, experiencing difficulties with line numbers and debugging unrelated code.
- It was suggested to try different models and to consider that the repo map might be contributing to the issue.
Eleuther â· #general (22 messagesđ„):
lm-eval-harness dataset download, R1-distill models prompt format, Regulatory bias standards and LLMs, Open Science Conference call for papers, ODSC vs OSC conference confusion
- LM-Eval-Harness Simplifies Dataset Downloads: A user inquired about downloading datasets for specific tasks in lm-eval-harness without immediately evaluating a model, and a solution was found by using
python3 lm_eval --model dummy --tasks [task_list] --limit 1
to download the datasets.- The
dummy
model defined here is used for testing and returns random numbers, while--limit n
restricts evaluation to the firstn
rows.
- The
- R1-distill Model Prompt Formatting Explored: A member asked if itâs common practice to prompt R1-distill models with a âuser: What is xyz? assistant:
â format rather than just directly doing âWhat is xyz?â .- Unfortunately, the thread was cut off here.
- LLMs Face Bias Regulation Scrutiny: Discussion revolved around regulations concerning bias standards in algorithms, particularly in the context of LLMs, citing examples of regulatory agencies like the NCUA, EEOC, FDIC, HHS, and FTC.
- The archived EEOC guidance was mentioned, emphasizing that regulators require proof of non-discrimination and may view algorithms that canât be studied as infringing.
- Open Science Conference Announces Call for Papers: The Open Science Conference is accepting calls for papers, potentially suitable for interdisciplinary work, with submissions due in one week.
- Further details on the call can be found on the Open Science Conference website.
- ODSC and OSC Conferences: Avoid Confusion!: It was clarified that ODSC is distinct from OSC and that there are a lot of conferences with very similar names/abbreviations, with a warning that some of these might be scams spread through old google groups.
- One member confirmed ODSC is legitimate (since they were a speaker there), and OSC appears legitimate, but less popular.
Eleuther â· #research (57 messagesđ„đ„):
Model of Mind AI, Falsifiable Hypothesis, Sparse Gradients, Qwen 3, Skywork Model
- Modeling AI after the Mind: A member modeled an AI after the concepts of a conscious, subconscious and unconscious mind with a higher level behavioral system, based on the psychological model.
- A member noted that the channel discusses a rather narrow range of specific ML topics.
- Falsifiable hypothesis: It was stated there is no need for a degree here, but there must be some adherence to a falsifiable hypothesis, or mathematical description of the process.
- The channel is for discussing research topics or specific papers and results rather than oneâs own partially formed research ideas.
- Qwen Cooks Up a Storm: Members noticed that Qwen is cooking hard as evidenced by a linked image.
- A question arose â what the actual mechanism for controlling the entropy is, in Qwen 3.
- Skyworkâs Shorter Reasoning: A member said that the Skywork model and techniques are very good, especially given that a full release just came out and linked to Skywork Open Reasoner Series.
- They normalized by the total tokens in the training batch rather than per sequence which is basically dr GRPO.
Eleuther â· #lm-thunderdome (7 messages):
Multi-GPU lm-eval, vllm Tensor Parallel
- Multi-GPU lm-eval utilization issues surface: A member reported that when using
parallelize=True
in lm-eval, GPU 0 has 100% utilization, while GPU 1 has 0% utilization.- Another member explained that
parallelize
uses naive pipeline parallelism where it splits the model layers, so no more than one rank is used at a time, suggestingaccelerate launch -m lm_eval ...
for running multiple replicas.
- Another member explained that
- vllm Tensor Parallelism recommended for stability: When other multi-GPU solutions failed, a member suggested using vllm tensor parallel, noting that itâs more reliable.
- The original poster was unaware of using vllm with lm-eval, and expressed that they had been using the HuggingFace implementation.
Nous Research AI â· #announcements (2 messages):
Atropos v0.2.0 Release, Psyche Network Launch, Decentralized AI Training, Large Language Model Training, Open Source AI Development
- Atropos v0.2.0 drops with Axolotl support: Nous Research has released v0.2.0 of Atropos, their RL environments project, featuring new environments, updated API handling, better TRL support, and integration with Axolotl as an official trainer partner, with usage guide here.
- Psyche Network Launches to Democratize AI Training: Nous Research launched the Psyche Network, a decentralized training network aimed at democratizing AI development by bringing together distributed compute resources for training large-scale models.
- Psyche Testnet Trains a 40B Parameter LLM: The testnet launch of Psyche involves pre-training a 40B parameter LLM using an MLA Architecture and a dataset comprising FineWeb (14T), FineWeb-2 (4T), and The Stack v2 (1T).
- DisTrO Optimizer Breaks Bandwidth Constraints on Psyche: The Psyche network utilizes Nousâs DisTrO optimizers and a custom peer-to-peer networking stack to coordinate globally distributed GPUs, overcoming previous bandwidth constraints in AI training.
- Open Source Community Drives Psyche Development: Nous Research encourages community involvement through forums and Discord to gather model ideas, aiming to foster innovation in model creation and design within the open source community, with code available on GitHub.
Nous Research AI â· #general (78 messagesđ„đ„):
Frontier Models, smolvlm-realtime-webcam, 3 GPUs, latex2sympy2_extended math_verify, Atropos
- Benchmarks fail to capture model nuances: A member finds it very hard to feel out which frontier model is better at different tasks because benchmarks not granular or diverse enough and shares a link to illustrate that the âbestâ coding model might still be terrible at front end, or xyz framework, data viz etc.
- Try smolvlm realtime webcam project: One member shared a link to smolvlm-realtime-webcam project.
- Troubleshooting Atropos dependencies: A member ran into a problem running the
examples/gsm8k_server.py
file, which requires themath_verify
andlatex2sympy2_extended
modules, and another member suggested to usepip install latex2sympy2_extended math_verify
to fix the problem. - Nousâ Psyche is distributed GPU training!: The Psyche Network mining pool filled up to 500k in 40 minutes, prompting one to state that itâs almost as if making AI training open to everyone is a good idea and the project link- was shared by a member.
- Contribute USDC for Compute, it is donation: Members discussed donating USDC to contribute to compute for the Psyche project, which the funds go to Nous, and a member confirmed that any capital contributed to this pool is purely a donation and for testing purposes only.
Nous Research AI â· #ask-about-llms (1 messages):
princepolka: Is 05-06 worse at instruction-following than the previous 2.5 Pro?
Nous Research AI â· #research-papers (1 messages):
LLMs in multi-turn conversations, LLM performance degradation, Lost in Conversation paper, Premature Solution Generation by LLMs, LLM Recovery from Conversational Errors
- LLMs Struggle with Multi-Turn Conversations: A member shared the Lost in Conversation paper and its corresponding GitHub repo, analyzing LLM performance in multi-turn conversations versus single-turn settings.
- The paper reveals a 39% average performance drop across six generation tasks in multi-turn scenarios, attributing it to a minor loss in aptitude and a significant increase in unreliability, concluding that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
- LLMsâ Premature Solution Attempts Lead to Unreliability: The study indicates that LLMs often make assumptions early in conversations and prematurely attempt to generate final solutions, leading to unreliability.
- This behavior suggests that LLMs may benefit from improved error correction mechanisms to recover from incorrect turns in a conversation.
Nous Research AI â· #interesting-links (2 messages):
Finetuning to 1.58 Bits, Cody S Tweet
- WandB Report: Finetuning to 1.58 Bits: A WandB report discusses finetuning to 1.58 bits.
- The report likely contains details on techniques and results related to achieving such low-bit finetuning.
- Cody S Posts on X: Cody S tweeted something on X.
- Without more context, the tweetâs contents and relevance to AI research are unclear.
Nous Research AI â· #research-papers (1 messages):
LLMs in Multi-Turn Conversations, Lost In Conversation paper, LLM Unreliability
- LLMs Get Lost in Multi-Turn Conversations: A member shared the Lost In Conversation paper and its GitHub repo, which finds that LLMs perform significantly worse in multi-turn conversations compared to single-turn interactions, with an average performance drop of 39% across six generation tasks.
- LLMs Prone to Premature Solution Attempts: The paperâs analysis of 200,000+ simulated conversations revealed that LLMs often make assumptions early and prematurely attempt to generate final solutions, leading to unreliability.
- In simpler terms, the authors discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
Notebook LM â· #announcements (1 messages):
User Experience studies, Multilingual Audio Overviews, NotebookLM Feedback
- NotebookLM Users Invited to UX Studies: A friendly reminder to opt-in to participate in User Experience studies was posted.
- The NotebookLM team is currently looking for feedback on their multilingual Audio Overviews feature.
- Feedback Wanted on Multilingual Audio Overviews: NotebookLM users are encouraged to provide feedback on multilingual Audio Overviews through user experience studies.
- This initiative aims to improve the user experience for those utilizing the multilingual audio features within NotebookLM.
Notebook LM â· #use-cases (19 messagesđ„):
Invisible Sun TTRPG, Shareability Factor, Google Product Discontinuation, NotebookLM and OneNote Sync, Podcast Feature ToS
- Invisible Sun TTRPG Gamified with NotebookLM: A member has been teaching themself a new TTRPG called Invisible Sun by Monte Cook Gaming, using NotebookLM and ChatGPT Projects for rules lookup.
- They like NotebookLM for the shareability factor and clear citations but prefers ChatGPT audio reviews; they look forward to testing NotebookLMâs insights on a new book coming via Backerkit.
- XDA Developers Hail NotebookLM for its Unique Features: XDA-developers published an article on six use cases/features where NotebookLM excels, prompting agreement among users.
- Another article shows how to use NotebookLM with OneNote.
- Google User Fears NotebookLMâs Eventual Sunset: A user expressed concern that NotebookLM might be discontinued at an inconvenient time, citing Googleâs history of sunsetting good products.
- Another countered that NotebookLMâs uniqueness and potential make it unlikely to be discontinued, suggesting a possible rebrand and marketing push.
- OneNote Sync Dream Sparks Discussion: A member proposed linking a OneNote notebook to NotebookLM for synchronization, so that changes in OneNote would update the NotebookLM source.
- This idea sparked discussion about the potential integration and benefits of such a feature.
- Podcast Feature Usage Questions Arise: A user inquired about the terms of service for using NotebookLMâs podcast feature, specifically regarding using the audio in platforms like YouTube.
- Another user suggested checking the T&C for clarity and advised that disclaimers about accuracy and links to original sources are important.
Notebook LM â· #general (32 messagesđ„):
Podcast Length, Audio Upload and Transcription, Account Restrictions on PDF Uploads, Adding Information to System Prompt, Early Access Installation Issues
- Pad Podcast Length with Repeated Links: A user inquired about increasing podcast length in NLM, and one member suggested adding several links or documents on the same topic, even if repeated, to extend the podcast to 22 minutes.
- It was not specified if this strategy would work for everyone.
- Audio Upload Transcribes; AI Studio Enhances Subtitles: One member suggested uploading audio as a source for transcription.
- Another member recommended using 2.5 flash on AI Studio for timecoded subtitles.
- Restrictions on PDF Uploads Plague Some Users: Several users are experiencing restrictions on their accounts, preventing them from uploading PDFs.
- There was no resolution for this problem in the discussion.
- System Prompt Customization Craving Human Touch: A user expressed dissatisfaction with podcasters using too many letters to refer to things and sought a way to add info to the system prompt to prefer human language instead.
- No solutions were offered during this discussion.
- NotebookLM Beta Installation Glitches: A user reported being stuck on âinstallingâ after receiving the âearly accessâ notification for the app.
- There was discussion around the userâs region but no specific fix offered.
Latent Space â· #ai-general-chat (41 messagesđ„):
GPT-4 Launch, ChatGPT Scaling, AI Founder in Residence, AI in Ohio Courts, AlphaEvolve
- OAI Launch Retrospective: Personal Observations: A member shared some very wholesome stories of OpenAI launches from andrewmayne.com.
- Scaling ChatGPT: Building and Launching: The community shared a link to a newsletter article titled Building, launching, and scaling ChatGPT by the Pragmatic Engineer.
- The article goes over the history and tech stack of the ChatGPT launch.
- Founder In Residence: AI Edition: A member asked about Founder in Residence programs focused on AI, seeking advice on how to position themselves.
- They have experience building AI systems for Analytics use cases in Amazon ads and want to build Self-Serve Agents in the same analytics space.
- Gemini Powers Algorithm Design with AlphaEvolve: Google DeepMind introduced AlphaEvolve, a coding agent powered by Gemini designed for creating advanced algorithms.
- Turbopuffer hits General Availability: Turbopuffer announced they are GA (Generally Available).
Latent Space â· #ai-announcements (3 messages):
Tom Yeh, Llama 1/2/3/4, LLM Paper Club
- Prof Tom Yeh to walk through Llama 1/2/3/4: Prof Tom Yeh will walk through the Evolution of Llama 1/2/3/4 in one session at a special event.
- The event is organized by a member of the community.
- LLM Paper Club Notification Channels: A member directed users to look top left for Channels & Roles to be tagged in the relevant role for notifications for LLM Paper Club.
- No additional details were given.
HuggingFace â· #general (15 messagesđ„):
Qwen Model Distillation, MiniCPM-V-2_6, Perceptron Visualizers, Local Stable Diffusion Hosting, Langfuse Deployment with Smolagents
- Qwenâs Quintessence: Distilling Knowledge?: A member inquired about notebooks or references for the distillation of the Qwen family of models.
- No resources were directly shared in the provided context.
- MiniCPM-V-2_6: A Trending Model?: A member asked if anyone has tried using openbmb/MiniCPM-V-2_6, noting that itâs trending and has a high number of downloads.
- No responses were provided in the given context.
- Visualizing Vectors: Perceptron Visualizer Sparks Joy!: A member shared a perceptron visualizer for educational purposes, as shown in the attached videos My_Video_0079.mp4 and My_Video_0080.mp4.
- Another member then shared another visualizer to enjoy from darkspark.dev.
- Stable Diffusion, Served Locally: Forge Your Own Images!: Several members inquired about locally hosting Stable Diffusion.
- It was suggested to combine Diffusers and TGI, or use WebUI Forge (GitHub link) or reForge (GitHub link); links to Diffusers documentation (huggingface.co, huggingface.co/learn) were also shared.
- Langfuse Local Launch: Telemetry Tango!: A member asked for help getting a local Langfuse deployment working with smolagents.
- They were directed to the dedicated channel and advised to get the docker-compose.yml from official docs and use opentelemetry-sdk.
HuggingFace â· #today-im-learning (4 messages):
Assistance Offered, Hugging Face Transformers, EleutherAI Suggestion, Diffusion Course from MIT
- AI Engineer Volunteers Expertise: An AI engineer offered assistance on interesting projects, particularly for researchers or professors working on papers, especially related to LLM research and reinforcement learning.
- The engineer is happy to contribute to anythingâfrom brainstorming and coding to experimentation, implementation, or even the less glamorous parts like paperwork or debugging.
- Transformer Familiarity Questioned: A member asked whether the engineer was familiar with huggingface transformers.
- Another member suggested checking out EleutherAI.
- MIT Diffusion Course Recommended: A member inquired about research papers the engineer has been looking at.
- The member shared the MIT diffusion course focused on image generation.
HuggingFace â· #i-made-this (7 messages):
pdf2tex vs 12GB ram, PDF format criticism, Markdown output suggestion, Civitai censorship
- pdf2tex RAM Usage Impresses!: A user noted that pdf2tex uses only 1GB of RAM while auto-detecting and extracting figures using OpenCV, contrasting with a project using 12GB of RAM for parallel processing.
- PDFs: The Bane of Existence!: Users expressed strong dislike for the PDF format, with one calling it the worst format ever seen and another joking that calling PDF a format is a loose term.
- One user converts PDFs to TGA or BMP in memory for easier processing and expressed a desire for a pdfToSrc solution.
- Markdown Output Proposed!: A user suggested adding a markdown output option to improve semantic relationships for RAG ingestion.
- The developer acknowledged the suggestion, noting that while markdown is better than plain text, it may not fully address categorization issues for embedders, particularly with tables.
- Civitai Censors Celebrity Content!: A user reported that Civitai has muted all celebrity content, raising concerns about censorship.
- They linked to a Civitai model and shared a quote from Jeri Ryan (Seven of Nine) regarding the use of AI to generate nudes.
HuggingFace â· #reading-group (3 messages):
Simulation-Based Inference, AI Reading Group session
- Reading Group Explores Decision-Making with Simulation-Based Inference: The AI Reading Group session will discuss using Simulation-Based Inference for modeling decision-making, relevant to understanding human behavior.
- A Medium post provides additional information about the paper.
- Reading Group Keeps Consistent Schedule: The AI Reading Group session will be held at 9am PDT / 12pm EDT / 6pm CEST.
- This is consistent with the previous meeting time.
HuggingFace â· #NLP (3 messages):
Emotion detection limitations, Transformers tokenizer context length
- Emotion detection faces benchmark quality woes: Emotion detection doesnât work very well due to low quality benchmarks, because scholars have some difficulty to define what they want to predict and encoder models tend to learn heuristic.
- This is mostly caused by the difficulties in reaching agreement on gold standard labels.
- Transformers tokenizers limited by context length: All models have a context length, according to the Transformers tokenizer documentation.
- If you pass too much context to your models, you will just have an error.
HuggingFace â· #smol-course (2 messages):
Agent blocked sites, Smolagents framework
- Agent wastes time on blocked sites: A user reported that their agent wastes time going to a blocked site (universetoday.com).
- No solutions were provided in the given messages.
- Smolagents framework yields terrible results: A user reported using the smolagents framework with Qwen and some of the tools used in the course (Google Search).
- The user complained that the results were terrible.
HuggingFace â· #agents-course (10 messagesđ„):
HF Inference Provider Credits, HF SPACE_ID and SPACE_HOST ENV vars, Unit 1 code execution, InferenceClient Model Selection, Llama models text_generation
- Credits Crunch Prompts Assignment Submission Solutions: A user inquired about submitting the final assignment after exceeding the monthly included credits for the HF Inference Provider, while developing unit 4 locally using Ollama.
- Another user suggested adding HF SPACE_ID and SPACE_HOST as ENV variables and running the app locally.
- Unit 1âs Home Turf: Where Does Code Roam?: A user asked where to run the code for Unit 1, specifically mentioning the HF space duplication.
- Another user recommended using Google Colab.
- InferenceClient Model Choice Causes Pope Age Quandary: A user reported that when running the Unit 1 agent in the space, with no changes and giving it a âhelloâ, the agent tried to calculate the age of the Pope.
- He also shared that he tried using the client = InferenceClient(âmeta-llama/Llama-3.3-70B-Instructâ)
- Text Generation Troubles in Llama Land: A user suggested using this model
client = InferenceClient("mistralai/Mixtral-8x7B-Instruct-v0.1")
.- This is happening because the text_generation function is not supported in any of the Llama models.
MCP (Glama) â· #general (40 messagesđ„):
Typescript vs Authpython Lag, Debugging MCP servers on Smithery, Scalable MCP with Streamable HTTP, User Confirmation for AI Agent MCP Tools, Revolutionary idea for MCP Security
- Authpython Lags Typescript APIs: It was mentioned that Authpython generally lags behind Typescript by around 1-2 months in terms of API updates.
- A member suggested checking a specific channel for examples, with a link to a Go-MCP client.
- Debugging MCP Server Deployed to Smithery: A user sought advice on debugging an MCP server running on Smithery due to an error encountered in Claude Desktop.
- Another member recommended using the ithena-cli tool to store all input and output for debugging, prefixing the run command.
- MCP Servers go Streamable HTTP?: A user inquired about using MCP servers with streamable HTTP instead of stdio for scalability, noting that most open-source servers use stdio.
- They were unsure if they needed to reconfigure every open-source MCP server to use HTTP streamable from stdio.
- AI Agents now ask for user Confirmation: A user asked how to ensure their AI Agent explicitly asks for user confirmation before triggering updates via MCP tools, similar to Claude Desktop.
- The author of fast-agent chimed in noting there is a pre_tool_call check hook that can be used to add an approval flow, similar to the existing human input tool.
- MCP Inspector Images Evaporate in Claude?: A user reported that while images are visible in the resource section of MCP Inspector after invoking a tool, Claude Desktop does not show the image in the resource section.
- Another user clarified that Claude only shows images in the tool response view.
MCP (Glama) â· #showcase (3 messages):
Yarr MCP Servers, Tiny Agents Remote MCP Support, LLM-provider-agnostic, MCP enabled Chat Client
- Yarr MCP Servers on GitHub: A member shared a link to a GitHub repository containing several ARRs MCP servers.
- A member also shared a link to X (formerly Twitter) further discussing this topic.
- Tiny Agents Gets Remote MCP Support: Hugging Face Tiny Agents now has remote MCP Support and can connect to both SSE and Streaming HTTP servers from the command line.
- Tiny Agents offers a versatile approach to agent development and management.
- New Web-Hosted Chat Client is MCP Enabled: A member introduced a new LLM-provider-agnostic, MCP enabled, web-hosted chat client open sourced at chatter and hosted at moopoint.io.
- The client aims to replace Claude Desktop with a web interface for interacting with LLM providers and MCP servers, with features like a free tier, memory, MCP server hosting, image handling, file uploads, and voice interaction coming soon.
Torchtune â· #general (5 messages):
Custom Torchtune Models with vLLM, Synchronous GRPO recipe with vLLM
- Torchtune Modelâs vLLM Voyage: A member confirmed running a custom Torchtune model with vLLM in their internal version of GRPO.
- They hinted at potentially making their implementation public, after being asked how to enable vLLM support for their model.
- vLLM Integration Gets Synchronized: A member proposed creating a synchronous GRPO recipe with vLLM, suggesting both synchronous and asynchronous versions should exist.
- They expressed a strong preference for using the vLLM version, stating they genuinely donât see any reason not to.
Torchtune â· #dev (37 messagesđ„):
HFModelTokenizer vs GemmaTokenizer, Gemma PromptTemplate, Tokenizer configurations, Masking assistant tokens
- Gemma Tokenizer Faces Discrepancy with HFModelTokenizer: A member reported that the HFModelTokenizer with the Gemma chat template produces output tokens that donât match the torchtune GemmaTokenizer tokens.
- This discrepancy suggests that the torchtuneâs GemmaTokenizer may not be applying the chat template correctly.
- Gemma PromptTemplate Missing, Alpaca to the Rescue?: It was noted that there isnât a specific PromptTemplate for Gemma, which leads to incorrect tokenization and potential issues with the
system
role.- The default might be to use the Alpaca template, but itâs crucial to have a correct Gemma-specific template.
- Multiple BOS tokens Error Inherited from HF/Googleâs Config: The HF tokenizer is adding multiple beginning-of-sequence (BOS) tokens due to the configuration having
"add_bos_token": true
alongside a BOS token in the chat template.- This issue is inherited from HF/Googleâs tokenizer config, making the implementation technically âcorrectâ but functionally flawed.
- Navigating the Maze of Jinja Tricks for Masking Assistant Tokens: A discussion emerged around masking, specifically how Hugging Face provides an option to return an assistant mask.
- The conversation highlights the complexity of maintaining the masking process, with potential solutions involving Jinja tricks.
Modular (Mojo đ„) â· #mojo (25 messagesđ„):
Variant bug with SIMD, register_passable types, Mojo in Google Colab
- Variant Bug causing segfaults with SIMD found: A user reported a crash with
Variant
when usingSIMD
types in Mojo, specifically a segfault occurring between print statements when aVariant[T](simd)
is used; the issue seems related to insufficient space allocation withinVariant
or a lifetime issue.- A minimal, reproducible example was provided which can be found on GitHub issue 4578, along with other code snippets demonstrating the bugâs erratic behavior, which includes the location of print statements affecting the crash.
- Concerns with register_passable Types Arise: Concerns were voiced about using
register_passable
types that exceed the size of system registers in Mojo, as it may be causing miscompilations, because LLVM does not handle it well.- It was suggested that the current implementation of
Variant
might be flawed for register passable typesT
wheresizeof[T]()
is larger than any register on the system, which should be replaced with various versions ofTrivial
.
- It was suggested that the current implementation of
- Colab Mojo Integration launches!: It is now slightly easier to compile and run Mojo code in a Colab notebook cell via a new import:
import max.support.notebook
, which gives a%%mojo
magic command.- The announcement was posted on the Modular forums.
tinygrad (George Hotz) â· #general (15 messagesđ„):
WebGPU bug, BEAM parameter, tinybox-ui, high performance blake3 implementation
- WebGPU backend suffers from a bug: The generated kernel does not have consecutive DEFINE_GLOBAL args, but
bufs_from_lin
assumes DEFINE_GLOBAL has consecutive args, according to this message.- claude allegedly managed to fix it.
- BEAM parameter impacts WebGPU performance: Setting BEAM to anything results in the WebGPU backend suffering in performance; it runs at 30ms with no beam and 150ms with BEAM=1.
- It runs at 100ms with BEAM=2.
- Minimalist Tinybox UI concept emerges: A user built a minimalist UI concept for tinybox, with no login, no cloud, no fluff, focusing on fast, local control for people who touch hardware, which can be found here.
- It was stated that an HTTP settings page for tinybox is generally supported, with the caveat that it needs to have 0 deps and absolute minimal line count.
- Blake3 for tensor storage: A bounty exists for a high performance blake3 implementation to use for content addressable tensor storage for the cloud.
- As such, the implementation should be general purpose, or something something according to a user.
tinygrad (George Hotz) â· #learn-tinygrad (1 messages):
cookiecrumbs3808: Or offloaded to CPU, I guess.
LlamaIndex â· #blog (3 messages):
LlamaIndex Memory component, LlamaExtract citation implementation
- LlamaIndex Memory Component Augments AI Agents: LlamaIndex introduces a new Memory component to enhance AI agents with both short-term and long-term memory capabilities for context-aware conversations.
- The new memory component allows developers to implement static memory blocks (link) to their chatbot agents.
- LlamaExtract Gets Citations and Reasoning: A new code walkthrough by @tuanacelik demonstrates how to implement citations and reasoning in LlamaExtract.
- The walkthrough details how to define a custom schema that instructs the LLM on what to extract from complex data sources (link).
LlamaIndex â· #general (6 messages):
LlamaIndex Memory Component, Memory Session Management, Database Integration for Memory, Serialization vs. Database for Context, Memory vs Redis
- Memory Component Stumper for Workflows: A user is facing issues with the new Memory component in LlamaIndex workflows, noting that the memory is empty on each workflow call when using
user_id
to set thesession_id
.- The user also inquired about Redis integration with the Memory component.
- Memory Defaults to In-Memory DB, but DB Connection Recommended for scalability: By default, the
Memory
component uses an in-memory SQLite database, but it can be configured to use a local SQLite database or a PostgreSQL database by changing the database URI.- For large chat histories, using a database is recommended over serializing to a JSON blob via
memory.to_dict()
for scalability.
- For large chat histories, using a database is recommended over serializing to a JSON blob via
- Context Serialization vs. DB for Chat History: A user questioned the benefit of using a database connection with the
Memory
component versus serializing the context, as restoring the context also restores chat history.- The response clarified that serializing the context is fine by default but using a database is preferable for large chat histories or when a structured way to save the history is needed, plus python dict vs redis is the same problem.
Cohere â· #đŹ-general (3 messages):
Generation Parameters, Use cases for Cohere, Cohere vs ChatGPT and Anthropic
- Guidance on Generation Parameters Requested: A member asked for guidance on suggested generation parameters for Command aCurious.
- Interest in Cohereâs Use Cases: Others were curious about use cases for Cohere vs. other models like ChatGPT and Anthropic.
Cohere â· #đ-api-discussions (5 messages):
Cohere API Calls, Cohere Billing, Cohere Trial Key
- Cohere User Asks about API Call Count: A Cohere user asked how to check the number of API calls made.
- Another user provided a link to the billing dashboard.
- Cohere Trial Key Doesnât Show Number of API Calls: A Cohere user stated that the trial key only shows tokens and not the number of API calls made.
- They added, I donât think there is a raw number of requests being counted.
LLM Agents (Berkeley MOOC) â· #mooc-questions (2 messages):
Course certificate requirements, Medium article or X post for certificate
- Medium Article or X Post unlocks Course Certificate: Members clarified that earning a course certificate requires writing a Medium article or an X post summarizing one of the lectures.
- Interested members must submit their work via this form to receive credit.
- Submitting Coursework for Certificate: To get the certificate, the coursework must be submitted via the provided Google Forms link after completing a Medium article or X Post.
- The submission ensures that the work is properly credited towards the course certificate.