**Composite AI is all you need?**

AI News for 3/21/2025-3/24/2025. We checked 7 subreddits, 433 Twitters and 29 Discords (227 channels, and 10464 messages) for you. Estimated reading time saved (at 200wpm): 1129 minutes. You can now tag @smol_ai for AINews discussions!

A couple of nice updates from Qwen and Deepseek today, but we give title spot to a lesser known but ambitious new entrant.

Reve, pronounced [ʀɛv], from ā€œrĆŖveā€, has emerged from Artificial Analysis’ leaderboard as the top rated imagegen model, displacing former SOTA Recraft. ā€œThe model stands out for its impressive text rendering, prompt adherence, and aesthetics.ā€ We found it remarkably easy to play with.

image.png

image.png

And it beats Ideogram for typography:

image.png

It’s interesting that it comes from Christian Cantrell, former VP Product at Stability, Taesung Park, and MichaĆ«l Gharbi. All are Adobe alums, and Michael’s announcement gives the most insight into how they do it:

Reve’s mission is to invent the future of intent-driven visual creation. Capturing creative intent requires advanced machine understanding of natural language and other interactions. Turning this intent into compelling visual calls for interactive systems that have a deep understanding of the visual world they generate, so they can iteratively amend it.

Taesung agrees:

Today’s text-to-image models are essentially that—random slice-of-the-world generator. There’s no intelligence. This is both a data and representation problem. We need to leverage the equivalent of full documents for images, but we don’t have a good representation for it. Our mission at Reve is to enhance visual generative models with logic. As the first step, we focus on understanding user intent with advanced language capabilities, resulting in superior complex prompt understanding and text writing.

There’s no suggestion that it’s a single model, but rather some composite of models. Probably this is what Christian wanted to build at Stability, but couldn’t.


{% if medium == ā€˜web’ %}

Table of Contents

[TOC]

{% else %}

The Table of Contents and Channel Summaries have been moved to the web version of this email: [{{ email.subject }}]({{ email_url }})!

{% endif %}


AI Twitter Recap

Here’s a summary of the AI-related discussions from the provided tweets, categorized for a technical audience:

Model Releases and Updates, Including Performance

  • DeepSeek V3-0324 Release and Performance: @_akhaliq announced DeepSeek-V3-0324 release on Hugging Face, and @Teknium1 also noted its release, and @reach_vb highlighted it as a post-training update with potential for improved downstream performance. Several users discussed its performance and characteristics, including @teortaxesTex who found it comparable to Sonnet 3.6 and @teortaxesTex noting it surpasses DeepSeek-R1 and Claude-3.7 in some evaluations.
  • Qwen 2.5-VL-32B-Instruct Release: @_akhaliq announced the release of Alibaba’s Qwen2.5-VL-32B-Instruct on Hugging Face, and @reach_vb shared performance benchmarks indicating it beats Qwen 2.5 72B and GPT 4o Mini on vision tasks, with enhanced mathematical reasoning and human preference alignment.
  • DeepSeek Model Serving: @_akhaliq noted that DeepSeek’s new model is served on Hugging Face via Hyperbolic Labs, and @ClementDelangue mentioned it’s available via FireworksAI and Hyperbolic Labs. @Yuchenj_UW stated that Hyperbolic Labs now serves DeepSeek-V3-0324.
  • DeepSeek V3-0324 on MLX: @reach_vb reported that the latest DeepSeek V3-0324 runs at >20 toks/sec on a 512GB M3 Ultra with mlx-lm, and @awnihannun confirmed the same.
  • NVIDIA Mamba Image Backbones: @mervenoyann announced NVIDIA’s release of new Mamba image backbones on Hugging Face, available in various sizes and resolutions.

Frameworks and Tools

  • LangChain and LangGraph Use Cases: Multiple tweets highlighted use cases of LangChain and LangGraph, including Vodafone’s AI assistants for data operations @hwchase17, Klarna’s AI assistant for customer support @LangChainAI, and a medical supply chain AI system @LangChainAI. @hwchase17 also mentioned context management in langgraph.
  • Weave-Agent Planner Discussion: @jd_pressman discussed the design and planning of Weave-Agent, considering approaches like ReActTree and MuZero for agentic planning.
  • Smolagents Growth: @AymericRoucher announced that smolagents has reached 15k GitHub stars and is integrating sandboxed code execution via E2B or Docker.
  • Together Chat: @togethercompute introduced Together Chat, featuring OSS models like DeepSeek R1 for web search, coding, image generation, and image analysis, and @togethercompute listed the tech stack.

Agent Engineering and Applications

  • Agent Engineering Talk and Essay: @swyx shared a talk and essay on Agent Engineering, defining agents, outlining six elements, and discussing their potential impact.
  • Linear and Codegen Integration: @mathemagic1an announced Codegen’s integration with Linear, enabling agents to solve tickets and close duplicates, and highlighted Linear’s expanded capabilities for bots @mathemagic1an.
  • Evaluation Metric for Agents: @_philschmid advocated for using pass^k instead of pass@k for evaluating agents, arguing it provides a more accurate performance metric aligned with user experience.

Economic and Strategic Implications

  • AI Automation and Economic Growth Model: @EpochAIResearch discussed GATE, a model for AI automation’s economic impacts, predicting trillions in AI investments, extreme compute scaling, and significant economic growth.
  • US-Japan Defense Innovation Award: @SakanaAILabs announced that Sakana AI won an award at the US-Japan Competition for Defense Innovation for novel AI solutions.
  • Perspectives on China and AGI: @teortaxesTex shared multiple opinions on China’s technological and strategic advantages, including its state capacity, industrial base, and AGI efforts. @teortaxesTex also touched on DeepSeek’s ā€œcommoditize your complementā€ theory.

ARC-AGI Benchmark

  • ARC-AGI-2 Release and Competition: @fchollet announced the release of ARC-AGI-2, a benchmark designed to measure general fluid intelligence, and the ARC Prize 2025 competition with a $700,000 grand prize @fchollet. He noted that current top AI approaches score very low, requiring test-time adaptation, and discussed the evaluation methodology @fchollet.

Humor and Memes

  • Coding by Vibes: @gneubig shared a tweet about prompting to improve vibe coding, distinguishing between coding by vibes for personal projects versus agent behavior.

AI Reddit Recap

/r/LocalLlama Recap

Theme 1. DeepSeek V3-0324: Performance and Expectations vs R1

  • Deepseek releases new V3 checkpoint (V3-0324) (Score: 638, Comments: 125): DeepSeek released its new V3 checkpoint (V3-0324), which likely includes updates and improvements over previous versions. Further details on specific features or enhancements are not provided in the post.

    • Discussion on the DeepSeek-V3 checkpoint (V3-0324) includes speculation about its use as a base for a future R2 release, with some users anticipating it to arrive in April. There is a debate on whether V4 is necessary for R2, with arguments suggesting that improvements can be achieved through better scaling and reasoning techniques without a new base model.
    • Users are seeking benchmark results to compare the new model’s performance, with some noting that no official benchmarks have been released yet. Independent tests are expected soon due to the open-source release of the weights, and there is a call for DeepSeek to release their own benchmarks similar to Mistral.
    • There are observations about the model’s coding skills improvement and its deployment on both API and web platforms, with some users noting a more censored version compared to the original. The MTP module is highlighted for its role in enhancing decoding speed, achieving 1.8 times TPS, as detailed in a research paper.
  • New deepseek v3 vs R1 (first is v3) (Score: 282, Comments: 56): The image compares two versions of DeepSeek user interfaces: V3 and R1. V3 showcases a more dynamic design with animated weather cards for ā€œWindy,ā€ ā€œRainy,ā€ ā€œSunny,ā€ and ā€œSnowy,ā€ while R1 offers a simpler interface with toggle buttons for ā€œWind,ā€ ā€œRain,ā€ ā€œSun,ā€ and ā€œSnow,ā€ each represented by a single icon.

    • DeepSeek V3 and R1 interfaces are being compared, with V3 offering animated weather cards and R1 featuring simpler toggle buttons. Users are curious about which model corresponds to each interface and the prompts used for the comparison.
    • There is a preference for open-source models over proprietary ones due to cost and flexibility, despite DeepSeek models not being the cheapest. Sonnet is noted to be significantly more expensive than V3, especially during off-peak hours.
    • The discussion includes references to command-a running locally, with links provided for further exploration, such as the Hugging Face model and a GIF showcasing the interface. Users express interest in more dynamic content, like videos, to better understand the animated features.
  • DeepSeek V3-0324 has caught up to Sonnet 3.7 in my code creativity benchmark - ā€œWrite a raytracer that renders an interesting scene with many colourful lightsources in python.ā€ (Score: 215, Comments: 43): DeepSeek V3-0324 has matched Sonnet 3.7 in a code creativity benchmark involving a raytracer task in Python, demonstrating significant improvement over its previous version. The benchmark revealed that while most LLMs generated simple RGB scenes, Sonnet 3.7 and now DeepSeek V3-0324 produced more complex and aesthetically pleasing scenes, though the method for this creativity boost remains speculative. More details and data are available in the GitHub repository.

    • DeepSeek V3-0324 is noted for its ā€œpsychotic taste,ā€ resembling reasoning models like R1 or QwQ more than its predecessor, and has faced criticism for its creative writing outputs, which some users find incoherent despite high benchmark scores. Gemma 3 is highlighted for its coherence and creativity in fiction, contrasting with R1’s often criticized outputs.
    • R1 failed in the benchmark by not producing a functioning program, despite attempts, which raises questions about its effectiveness compared to older versions of DeepSeek V3. The discussion suggests that R1’s long chains of thought (CoT) do not guarantee successful outputs, unlike previous versions of DeepSeek.
    • The increase in program size for DeepSeek V3-0324 and Sonnet 3.7 is noted, with speculation about whether this is due to training for longer generation lengths or other optimizations. Generating 10kB of code in a single attempt is considered significant, indicating potential advancements in model capabilities.

Theme 2. Meta’s ParetoQ Explored: Promise of 2-bit Models

Theme 3. Expanding LLM Functionalities: From Text to Multimodal

  • I made a diagram and explanation of how transformers work (Score: 272, Comments: 20): LLM functionalities are expanding beyond text, and a user has created a diagram and explanation to illustrate how transformers function. This effort aims to provide a clearer understanding of the internal mechanisms of transformers for those interested in AI and machine learning.

    • Input and Output Embeddings: There is a discussion on whether input and output embeddings are still linked in modern transformer architectures, with users noting the difficulty in obtaining a comprehensive and current overview of these architectures.
    • Resources and Diagrams: Several users shared resources to aid in understanding transformers, including a detailed explanation by Cromulent123 and a link to a GitHub page with relevant diagrams (GitHub Llama Nuts and Bolts). Another user highlighted a conceptual guide on transformers available on Ben Levinstein’s Substack.
    • Detailed Explanation on Transformer Functionality: Cromulent123 provides an in-depth explanation of how transformers work, focusing on the process of token embedding, the role of Query, Key, and Value Matrices, and the concept of attention scores in determining relevance. They also discuss the importance of contextual enrichment through multiple transformer blocks, emphasizing the nuanced understanding of token relationships.
  • I don’t understand what an LLM exactly is anymore (Score: 233, Comments: 89): The author is confused about the expanding definition of Large Language Models (LLMs), originally understood as systems predicting the next word based on pretrained weights from text data. They question how LLMs now encompass capabilities like audio and image generation, and cite SpatialLM, which processes 3D point cloud data, as an example of this broadening scope, seeking clarification on the connection to language models.

    • Diffusion Models and LLMs: There is a debate on whether models like Stable Diffusion qualify as LLMs since they incorporate T5 for understanding text prompts, though they primarily generate images. Co0k1eGal3xy argues that such models are close to LLMs because of their advanced language understanding, despite not traditionally fitting the LLM category.
    • Tokenization and Multimodal Models: suprjami explains that all data, including text, images, and audio, is tokenized into numbers for LLMs to process, which allows them to learn relationships between different media types. Chair-Short details how self-attention mechanisms and positional encoding enable LLMs to handle different data modalities, suggesting a shift from purely text-focused models to multimodal capabilities.
    • Defining LLMs: Discussions highlight the blurred lines in defining LLMs, with some viewing them as large models capable of processing and generating language, regardless of the input type. SnackerSnick mentions that LLMs use tokenization and embeddings to predict subsequent tokens, while Otherwise_Marzipan11 and Co0k1eGal3xy suggest that branding and interaction with language, whether text, audio, or images, contribute to the LLM label.
  • Possible Llama 4 prototypes on Chatbot Arena (Score: 105, Comments: 21): MetaAI is testing several anonymous Llama/Meta models on Chatbot Arena, potentially as prototypes for Llama 4. Models like aurora, ertiga, pinnacle, solaris, and spectra are image-enabled, while rhea is identified as Llama 3.

    • Discussions reveal skepticism about model identities on Chatbot Arena, as some models, like anonymous-chatbot, claim to be from OpenAI, while others like rage and phantom are suspected to be Meta models. Users note that these models often provide inconsistent company affiliations, potentially due to a guard model or hallucinations.
    • The anonymous-chatbot and nebula models are highlighted for their performance, with nebula being particularly praised for excelling in tests, while models like rage and rhea received mixed feedback, with rhea noted for its friendly demeanor and emoji use.
    • There is a debate about whether any models are actually Llama 4, with users noting that none explicitly identify as such. Some comments suggest that Meta might be testing diverse writing styles or using randomized system prompts to obscure the true origin of the models.

Theme 4. TeapotLLM’s Impact: Lightweight Q&A Models

  • Announcing TeapotLLM- an open-source ~800M model for hallucination-resistant Q&A and document extraction, running entirely on CPU. (Score: 163, Comments: 50): TeapotLLM is an open-source model designed for hallucination-resistant Q&A and document extraction, featuring an approximate 800 million parameter architecture. It is optimized to run entirely on CPU, making it accessible for broader usage without the need for specialized hardware.
    • TeapotLLM’s Hallucination Resistance: Discussion highlights the model’s focus on hallucination resistance and its performance against models like Qwen and Llama, with some skepticism expressed about claims of reduced hallucination. Users are curious about its placement on hallucination leaderboards, and a demo is available for testing.
    • Model’s Language and Output Capabilities: The model is trained primarily in English, but theoretically supports all languages covered by flan-t5. It can extract structured data into JSON using a library that parses fields into typed JSON, as detailed in the documentation, though there is interest in expanding language support and testing on platforms like ollama.
    • Performance and Resource Usage: TeapotLLM is optimized for CPU usage, fitting within approximately 2GB of RAM on Google Colab, making it accessible for users with limited compute resources. There is interest in exploring fine-tuning on more modern models like Qwen 0.5B to potentially enhance performance, while maintaining the current model’s strengths in document extraction and concise responses.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding

Theme 1. New Improved Memory Alpha in ChatGPT Enhances Interaction

  • New improved memory alpha is insane (Score: 414, Comments: 241): The post discusses the new improved memory alpha feature in ChatGPT, comparing its impact to the leap from GPT-2 to GPT-4. The author expresses skepticism about DeepSeek’s ability to compete unless they adopt similar advancements, expressing confidence in OpenAI’s continued leadership.
    • Many users express frustration and confusion over the availability and inconsistency of the new memory alpha feature in ChatGPT, with some losing access unexpectedly despite having pro subscriptions. CyberNoche and jalpseon highlight deactivation issues, while alpha_rover and DamionPrime share positive experiences with memory persistence.
    • The discussion touches on the pricing of ChatGPT subscriptions, with Initial-Kangaroo-534 questioning the value of paying $200 per month. This is contrasted by alpha_rover, who finds the feature invaluable for project continuity and would miss it compared to other AI tools.
    • Some commenters like 3xNEI and SillyTwo3470 speculate on the broader implications of memory features, suggesting it could lead to human-AI hybridization. They emphasize the potential for increased personalization and the blurring of lines between tool and partner, indicating a significant shift in how users might interact with AI.

Theme 2. Anthropic’s Revenue Surge Matches OpenAI’s 2023 Numbers

  • Anthropic is making about $115M a month now; same as OpenAI in Nov 2023 (Score: 272, Comments: 50): Anthropic is reportedly generating $115M per month, matching OpenAI’s revenue in November 2023. Revenue projections for 2025 estimate $2B as likely and $4B as optimistic, with Manus contributing approximately $2 per task to their revenue. An image depicts a 40% increase in annualized revenue from December 2024 to March 2025, with figures from the Bay Area Times.
    • Claude’s Impact and Usage: Users highlight Claude Code as a game-changing tool, with some spending $50 per day on it due to its effectiveness in automating coding tasks. Alternatives like AIDER and Cursor’s Agent are mentioned but are deemed less effective compared to Claude, which is described as being akin to having a competent intern.
    • Revenue Sources and Context: A significant portion of Anthropic’s revenue is attributed to integration with AWS Bedrock, with expectations of continued growth due to widespread enterprise adoption. The discussion clarifies that the reported figures represent revenue, not profit.
    • Model Comparisons and Preferences: Users compare various AI models, noting that Claude offers superior performance despite smaller context windows in some cases. The OG 600b model and Sonnet 3.7 are mentioned, with the latter praised for its smart capabilities and iterative problem-solving.

Theme 3. AI-Driven Bug Fixing Automation: A 27-Day Experiment

  • I made AI fix my bugs in production for 27 days straight - lessons learned (Score: 191, Comments: 80): Over 27 days, the author used Claude 3.7 to automatically fix 21 unique production bugs, resulting in 12 successful one-shot fixes, 6 partial successes, and 3 failures due to incorrect assumptions or complex issues. Despite the initial time investment exceeding manual bug fixing, the system reduced cognitive load and context switching, though it may not suit niche or complex problem domains.
    • Interest in Open Sourcing: There is significant interest in the project being open-sourced, with Relevant-Pitch-8450 expressing intent to share it after some cleanup. Users appreciate the UI design and see potential utility in the tool.
    • Potential Commercialization: Commenters like ClassyBukake suggest that the tool could be monetized as a service, highlighting its appeal from both personal and business perspectives.
    • Cost and Time Efficiency: HelpRespawnedAsDee raises questions about the tool’s cost and time efficiency over an extended period, suggesting continued use to evaluate long-term benefits.

Theme 4. Advanced Claude Workflow Integration: MCP External Tools

  • My Claude Workflow Guide: Advanced Setup with MCP External Tools (Score: 124, Comments: 20): The post provides a detailed guide for setting up Claude’s desktop application with external tools like Brave Search and Tavily to enhance its capabilities, requiring a Claude Pro subscription ($20/month) and specific software installations like Node.js and Python. It includes configuration examples for both Windows and macOS, instructions for accessing developer settings, and troubleshooting tips for installation and setup issues. The guide emphasizes the benefits of enhanced web search, filesystem access, and sequential thinking, and provides additional resources and security considerations for effective use.
    • Claude’s desktop application setup is praised for its accessibility to non-developers, providing a bridge for regular desktop users to enhance Claude’s capabilities without coding skills. The guide is compared to Claude Code, which offers more flexibility for tech-savvy users comfortable with command line interfaces.
    • A tutorial for Claude Code is recommended for those interested in exploring its capabilities, available on YouTube. This highlights the distinction between the two approaches: one prioritizing ease of use and the other, advanced customization.

Theme 5. Wan 2.1 Video Frame Feature Innovations in AI

  • Wan-i2v - Prompt: a man throws a lady overboard from the front of a cruiseship. (Score: 812, Comments: 51): Wan-i2v AI has introduced new features and advancements, as demonstrated in a prompt scenario where ā€œa man throws a lady overboard from the front of a cruiseship.ā€ While the post does not provide further details, it suggests a focus on action-oriented scenarios or potentially controversial themes in AI-generated content.

    • The Wan-i2v AI is discussed as an image-to-video tool, with some users noting that it couldn’t independently create a starting frame from the Titanic movie, implying a direct screenshot was used instead. This highlights the potential limitations of AI in generating entirely original content without reference images.
    • Users humorously critique the AI’s understanding of physics, with comments suggesting that while AI may not currently grasp physical laws, advancements such as Stable Diffusion and Wan2.1 are rapidly improving in simulating realistic physics in animations, such as ā€œboob jiggles.ā€
    • The conversation also touches on the idea of AI-generated alternate movie endings, with users joking about creating new endings for films like Titanic. This raises questions about copyright issues and the potential for new YouTube channels focused on AI-crafted content, despite the challenges of intellectual property rights.
  • Wan 2.1 begin and ending frame feature having model coming officially (Score: 100, Comments: 13): Wan 2.1 is set to release an official model that supports start and end frames interpolation soon, as confirmed by user ā€œdanielzy1990ā€ on a social media platform. For more details, refer to the GitHub issue comment.

    • Users anticipate that Wan 2.1’s new model will significantly enhance video control, with some expressing hope for improvements such as adding a guidance layer similar to Hunyuan to speed up generation times.
    • Comparisons to Hunyuan highlight its efficiency, generating video clips at 24fps in nearly half the time it takes Wan to generate at 16fps, emphasizing the potential benefits of guidance training.
    • There is interest in the model’s capability to support multiple timed keyframes, with some users hoping it remains compatible with existing img2vid functionalities.

AI Discord Recap

A summary of Summaries of Summaries by o1-preview-2024-09-12

Theme 1. DeepSeek V3’s Surprise Launch Shakes AI Community

  • DeepSeek V3 Emerges as Open-Source Giant: DeepSeek released DeepSeek V3, a 685B-parameter mixture-of-experts model under the MIT license, accessible on Hugging Face. The community is excited, comparing it to OpenAI’s o1 models in performance.
  • DeepSeek V3 Outperforms R1?: Users claim DeepSeek V3 beats R1 in coding and front-end tasks, even without chain-of-thought reasoning, noting its cost-effectiveness and excellence in math.
  • DeepSeek V3 Drops Without a README!: DeepSeek releases DeepSeek V3 without proper documentation, leaving users both amused and perplexed by the lack of a README, but offering a playground for experimentation.

Theme 2. Qwen Models and Upcoming AI Innovations

Theme 3. Debates and Advances in LLM Reasoning Training

Theme 4. Agent Engineering and MCP Developments

Theme 5. NVIDIA’s Nemotron-H Models and Hardware Advances


PART 1: High level Discord summaries

Perplexity AI Discord

  • Sonar 3.7 Bug kicks model: A user reported a bug with Sonar 3.7 where a chown command kicks the model out and breaks the conversation while coding, wondering if there was any difference in performance between high and old source amount and reasoning quality between search steps.
    • A user followed up noting that in their experience, the difference is quite large, sharing a screenshot here.
  • Sonar Model Gives Cropped Snippets: Multiple users reported that the Sonar model in the Perplexity API is truncating responses, particularly since the weekend, even though the JSON format is correct.
    • A user provided an example of a JSON request and the truncated response, noting that switching to sonar-pro resolves the issue, but is not preferrable for cost reasons.
  • Llama Index Wrestles with Sonar: A user encountered an error when configuring Sonar as a chat engine with Llama Index for a RAG project and requested assistance.
    • This highlights potential integration challenges when using Sonar in conjunction with other AI development tools.
  • Deep Research Rate Limit: A user inquired about the possibility of extending the limit of 100 deep researches per minute due to bulk processing needs in their application.
    • This inquiry underscores the demand for higher API usage limits for users with demanding workloads.

Unsloth AI (Daniel Han) Discord

  • Bonsai Bitnet Seeks Testers for Qwen2.5 Comparison: A member is looking for testers for deepgrove/Bonsai, asking how the bitnet compares to Qwen2.5 0.5B.
  • Orpheus TTS Model Gains Audio Finetuning: Audio finetuning has arrived with the Orpheus TTS model, according to a newly released Unsloth notebook.
    • A user noted that the work was all done by a particular member and that the notebook is a lot more streamlined compared to local audio tokenizing and then regular Llama3 finetuning.
  • Straight PRs OK on Unsloth Github, but wait: A member inquired about contributing to Unsloth’s GitHub, and another member confirmed that straight PRs are acceptable, though potential delays may occur due to the high volume of recent PRs and issues.
    • The discussion then shifted to modifying data preparation steps in Colab to accommodate .txt files, aiming for cheaper inference, and the original issue was linked.
  • GRPO Reasoning Needs Training Data: A user asked about training only parts of the output, specifically wanting the model to generate its own reasoning during inference.
    • It was suggested to look at the GRPO notebooks as a standard way of adding reasoning, and that the model must see reasoning traces during training to take it into account during inference.
  • Unsloth’s Fine-Tuning Guide Now Available: A member created a guide for fine-tuning with Unsloth, covering theoretical aspects, practical examples, and how to create a reasoning model with GRPO.
    • The guide compiles everything learned over the last year.

LMArena Discord

  • Nebula Steals Chatbot Spotlight: Members found Nebula, an anonymous chatbot suspected to be from DeepMind, to be really good and the best anonymoud model rn, outperforming others in math, english-turkish translation, and solving Arc-AGI problems.
    • It seems similar to Phantom, which users identified as a Google model, with both being tested in the arena.
  • GPT-4o Gets Human Alignment Boost: GPT-4o has significantly improved through OpenAI’s post-training, potentially surpassing Grok 3 soon, due to continued pretraining since December.
    • Speculation suggests it might top the leaderboard, leveraging OpenAI’s proficiency in human preference alignment in the LM arena.
  • Specter Evolves into Phantom then Nebula: Specter, Phantom, and Nebula are revisions of the same model, in that order, showing performance jumps in a few weeks.
    • Members noted a more significant performance jump from Specter to Phantom compared to Phantom to Nebula.
  • LMArena Fixes Bugs, Tunes Leaderboard: The LMArena alpha received updates including bug fixes and new features, and testers are encouraged to continue testing at alpha.lmarena.ai with the password still-alpha.
    • A bug preventing messages from saving and causing vote failures has been fixed, and leaderboard columns are now sortable with live data updates; feedback can be provided via this Google Forms link and bug reports can be filed using this Airtable link.

Cursor Community Discord

  • Cursor’s CMD+Backspace becomes problematic: Users express frustration with Cursor’s CMD+Backspace leading to accidental project deletions, with some losing work up to 7 times.
    • The Cursor team plans to change the default keybinding to CMD+Shift+Backspace, with configuration options, targeting a Monday rollout.
  • Claude 3.7 MAX hits users’ pocket: Claude 3.7 Thinking, now Claude 3.7 MAX, moves from the Pro plan to usage-based pricing, causing user frustration due to increased costs.
    • Claude 3.7 MAX features a higher context window and more tool calls compared to the standard Claude 3.7 Sonnet.
  • Windsurf Surfing Ahead in Responsiveness: Some users find Windsurf faster and more responsive than Cursor, citing Cursor’s lagging and freezing.
    • Others prefer Cursor for its rollback features and agent performance, though acknowledge AI programming’s remaining challenges.
  • MCP Combinations become hype: Users experiment with various MCP (Model Context Protocol) server combinations to enhance AI coding agents like Cursor, with Supabase MCP highlighted.
    • Some users suggest MCPs may be overhyped, noting instances of agents over- or under-utilizing MCPs, suggesting a need for clearer instructions.
  • 3D Integration Frustrates AI Coders: A user struggles to integrate a 3D model (FBX format) into a three.js project using Claude, facing issues with FBXLoader.
    • The limitations of AI in handling 3D designs become clear, with suggestions to switch to GLTF format and simplify tasks.

aider (Paul Gauthier) Discord

  • DeepSeek V3-0324 Beats R1?: The Aider community is excited about the new DeepSeek V3-0324 release, suggesting it outperforms R1 in coding and front-end tasks, despite lacking chain of thought.
    • Members highlight its strengths in coding and math compared to previous versions, drawing comparisons to Sonnet 3.5 in benchmarks, while also noting its cost-effectiveness.
  • Aider Tames Sonnet’s Over-Eagerness: Paul Gauthier reveals he has managed to get Aider to mitigate Sonnet 3.7’s over-eager behavior by adding a line to the prompt to chill out; this is now available in the main branch.
    • He encourages users to provide feedback on this adjustment based on their coding sessions.
  • Aider Gains New Homepage: Paul Gauthier announces the launch of Aider’s new homepage at aider.chat, showcasing compatibility with models like Claude 3.7 Sonnet, DeepSeek R1 & Chat V3, OpenAI o1, o3-mini & GPT-4o, and support for over 100 code languages.
    • This update offers an improved introduction for new users and a central hub for resources.
  • Aider’s Context Command Streamlines Chats: Paul Gauthier introduces an experimental /context command in Aider that automatically sets up the chat context, working best with Sonnet 3.7, R1, and o3-mini.
    • This new command enhances user experience by intelligently identifying and adding relevant files to the chat.
  • Community Curates LLM Contexts: A member announces the launch of ctxs.ai/weekly, a site dedicated to collecting aider conventions, prompts, and LLM-oriented documentation snippets.
    • The goal is to create a useful resource for the aider community, and the member is actively soliciting feedback on how to improve the site.

Nous Research AI Discord

  • LCPP Context Length Baffles: Users found that setting a context length to 100 in LCPP still tries to allocate 180GB of RAM, leading to VRAM exhaustion.
    • Suggestions include Attention overriding the assigned context length, missing ROPE-specific arguments, or using Q8 quantization.
  • Deepseek V3 Mirrors Sonnet 3.7: Deepseek V3 0324 shows as much variation as Sonnet 3.7, suggesting shared advancements in their architectures, viewable in this image.
    • One user even called it a huge update with Sonnet-level code creativity and a potential base for R2.
  • Transformers Ditch Normalization: Inspired by the Transformers without Normalization paper, a member replaced normalization with tanh.
    • The discussion then focused on removing experts at inference and its effects on smaller weights.
  • MathFusion Supercharges LLM Math: MathFusion improves mathematical reasoning in LLMs via cross-problem instruction synthesis, enhancing models like DeepSeekMath-7B, Mistral-7B, and Llama3-8B (more on MathFusion).
    • This method creates the MathFusionQA dataset, which fine-tunes models and boosts benchmark accuracy with minimal extra data.
  • Qwen3 to support CPU inference: The transformers library PR#36878 shows that Qwen3 support is being added, meaning that the models will soon be supported by the transformers library.
    • A user speculated that Qwen3-15B-A2B could be a good candidate for CPU inference due to its size.

OpenAI Discord

  • Sam Altman Teases GPT-5 Release: Despite the absence of an official announcement, Sam Altman confirmed that GPT-5 will launch this year, leading to speculation it could arrive in the first half to compete with R2 or Llama-4.
    • Members on the OpenAI Discord server suggested that an unannounced API might also be imminent.
  • GPT-4o: The Model That Converted a User: A user finds GPT-4o to be such a strong daily driver that they rarely switch models, only using other models such as 4.5, o1, o3 when the 4o messages run out or for important or unsolved problems.
    • The user also claims to have built an ā€œengineā€ that recovered a 400+ turn chat and continues past 500 turns retaining context with no drift or hallucinations, all through the default prompt.
  • Many-Shot Prompting Boosts Multimodal Model Muscle: A research paper (MANY-SHOT IN-CONTEXT LEARNING IN MULTIMODAL FOUNDATION MODELS) suggests that closed models like GPT-4o and Gemini 1.5 Pro benefit significantly from many-shot demonstrations up to ~2,000 examples, whereas open-weight models do not show the same benefit.
    • The paper notes that large multimodal foundation models like GPT-4o and Gemini 1.5 Pro show significant performance improvements when provided with many-shot demonstrations compared to few-shot examples.
  • Run an F1 Team Powered by GPT-4o: The open source project FormulaGPT (github repo) simulates head-to-head races between LLM-powered teams that think contextually and adaptively by continuously reasoning, strategizing, and making nuanced decisions.
    • Viewers can challenge advanced language models in Player vs. AI Mode, or watch the best AI models battle each other in AI vs. AI Mode while observing detailed AI reasoning behind each pit stop, tire change, or overtaking maneuver.
  • Avoid Turnitin AI Detector, If You Dare: A member sought advice on avoiding Turnitin AI similarity detection for a report reusing their company’s business model, which violates Turnitin’s ToS.
    • Others suggested it looked like spamming appeals to cheat homework and recommended using humanize AI tools.

OpenRouter (Alex Atallah) Discord

  • OpenAI’s o1-pro: Gucci-Level Pricing?: Users reacted strongly to OpenAI’s o1-pro API pricing at $150/M input tokens and $600/M output tokens, with one calling it GucciAI due to its high cost.
    • Another member joked that the API’s slowness might be a deliberate feature to prevent overspending given compute constraints.
  • Image Generation MIA on OpenRouter: A user inquired about using Gemini’s image generation with the gemini-2.0-flash-exp model, but was informed that image generation is not yet supported on OpenRouter.
    • The team indicated that while image generation is on their roadmap, there are currently no short-term plans to support image models like Flux.
  • Lambda Endpoints Plagued by 404s: Multiple users reported encountering 404 ā€˜no endpoint found’ errors when attempting to use Lambda models, despite Lambda’s status page showing full operational status.
    • The community offered suggestions, and some users confirmed that the Llama 3.3 70B Instruct | Lambda model was functioning correctly for them.
  • DeepSeek R1 challenges OpenAI o1: Members noted that the DeepSeek R1 model, a 671B parameter model with 37B active during inference, performs comparably to OpenAI’s o1 but is open-sourced and available under the MIT license.
    • Its availability under the MIT license allows for commercial use.
  • Claude 3.7 Sonnet Sputters with Overload Errors: Users reported frequent overload errors when using Claude 3.7 Sonnet, leading to cut-off responses and charges for input tokens.
    • One user suggested a retry strategy or switching to Gemini 2.0 Pro as an alternative, acknowledging Claude’s strength in translations.

LM Studio Discord

  • LM Studio Lacks NPU Support: Users have reported that NPUs are not yet supported in LM Studio, but Ryzen AI support exists in version 0.3.11.
    • For those with limited resources like 2GB VRAM, consider using Gemma 3 1B with Q6 or Q8 quantization and the CUDA runtime for improved performance.
  • KV Cache Quants Slash VRAM Needs: Users recommend leveraging KV cache 8-bit quants to diminish memory footprint when operating models with extensive context windows, like 30k tokens.
    • Keep in mind that 12GB of VRAM might prove inadequate for a 32B model, suggesting that Phi-4 or Qwen2.5 14b could serve as compelling alternatives.
  • Multi GPU Gets In-App Management: Enthusiasts are raving about LM Studio controls that allow the user to select the GPU that the model will load onto, available in the latest beta build.
    • Multiple users confirmed that Multi GPU is supported out of the box with the latest beta build of LM Studio.
  • Google Coral TPUs a Flop for AI: The Google Coral dual TPU is inadequate for AI use as it does not have any onboard memory to store data.
    • One user with an 8060s also inquired about thermal and power headroom for the Framework Desktop.
  • 4060ti: Inexpensive Inference Sweet Spot: The RTX 4060 Ti with 16GB of VRAM stands out as a budget-friendly pick for AI inference, clocking in around $500 USD/EUR.
    • A user mentioned it is important to note that AMD cards are not optimized for gaming and the 5000 series from Nvidia may melt.

Yannick Kilcher Discord

  • VPN code hijacks OpenAI site?: Users reported seeing <veepn-guard-alert> and <veepn-lock-screen> tags on OpenAI’s website, suggesting a VPN injection, but it was likely code injected by their own VPN sm0kywu.github.io/Amodal3R.
    • It appears that this user was simply using a VPN.
  • cuOpt Solves Linear Programming at NVIDIA: NVIDIAĀ® cuOptā„¢ is a GPU-accelerated optimization AI microservice that excels in Mixed Integer Linear Programming (MILP), Linear Programming (LP), and Vehicle Routing Problems (VRP) according to docs.nvidia.com.
    • It appears this microservice is well received and performant at NVIDIA.
  • CUDA Python is the new black?: Members debated whether it is truly the year of CUDA Python as mentioned by blelbach on X, with some asserting that Python is sufficient for GPU programming.
    • Others mocked modern Python programmers, linking a YouTube video titled Modern Python Programmers.
  • MoEs Training Stabilizes?: One user claimed that MoEs are unstable to train, but another user countered that they haven’t been unstable to train for two years and are now about the same as dense networks.
    • The stability is largely due to better kernels and dropless token routing, solving issues like numerical instability and expert collapse.
  • DeepSeek-V3 quietly drops: Members noted that DeepSeek released their DeepSeek-V3-0324 model, and a blog post reused their diagrams.
    • The model boasts 685B parameters and offers various tensor types like BF16, F8_E4M3, and F32, with links to finetunes and quantizations.

GPU MODE Discord

  • Flash Attention FA Debugging: In a discussion about understanding Flash Attention (FA), a member suggested coding and profiling/debugging, indicating that hands-on implementation aided understanding of normal attention, and similarly could for Flash Attention.
    • One member ran into issues implementing Flash Attention 1 in triton: it works with TRITON_INTERPRET=1 but it has a few elements mismatched on cuda. After increasing rtol & atol the tests passed.
  • RTX 5080 Gets CUDA 12.8: A developer released a patch enabling full CUDA 12.8 + PyTorch 2.5.0 compatibility with the Blackwell / sm_120 architecture for the RTX 5080, providing a GitHub repo with scripts, diffs, and instructions.
    • It’s also confirmed that WMMA instructions are ā€œwrappersā€ that compile directly to HMMA/IMMA/QMMA instructions in SASS, similar to how MMA instructions function, as shown on the CUDA Godbolt.
  • Hopper’s Swizzle Unpacked: The documentation’s description of the 64B swizzle in the Hopper architecture is confusing to many, but it’s clarified to be a 64B (bytes) swizzle where each square is 128b (bits), which translates to a 8x64 tile for 8-bit dtypes and a 8x32 tile for 16-bit types.
    • A member is seeking ROCm experts to help implement a row-row bank conflict-free swizzle for the tilelang HIP backend.
  • Oxford U creates AI Fellowships: The University of Oxford has a new opening for a research fellow (postdoc level or equivalent experience) to work on AI / RL in games and neuroimaging with Rui Ponte Costa, at a salary of Ā£100k+.
    • This involves developing an AI-powered technology that can infer the contributions of specific brain regions to behavior by analyzing gameplay data, enabling non-invasive diagnosis and treatment of neurological disorders.
  • Flash Attention’s Contiguous Memory: In Flash Attention, tensors are stored as (batch_size, N, num_heads, d), which are contiguous in d (typically > 64), enabling efficient global memory coalescing where each thread loads 16B of data.
    • This also makes it easier to understand what is going on, so LLMs can be used to understand kernel code, explaining simple concepts and variable states at specific places in tensors.

Interconnects (Nathan Lambert) Discord

  • Nvidia Engineers Mamba-Transformer Hybrid: Nvidia introduced the Nemotron-H family of models, including a series of 8B and 47-56B models that are hybrid Mamba-Transformer models, offering improved inference speed, according to their research.
    • The model is noted for improvements in speed compared to other models.
  • Mistral 24B Roars Back into Favor: The release of Mistral 24B has been received as a major highlight due to its strength and accessible base model, further aided by new open releases under the Apache 2.0 license.
    • A member stated, ā€œMistral 24B is probably one of the greatest releases in the last months, incredibly strong model and you have access to the base model as well.ā€
  • R1-Zero Training’s Length Bias Exposed: An analysis reveals that using row mean in R1-Zero-like training introduces a bias, favoring shorter correct responses and longer incorrect ones, as detailed in a paper and accompanying code.
    • Switching to all mean yields comparable performance without increasing length and raised questions about plots showing increasing reasoning length correlating with increased capability.
  • China Plots Open-Source AI Blitz: China plans to flood the market with open-source AI models to commoditize AI software and boost its hardware sales, potentially shaking up US tech dominance, according to this tweet.
    • The release of DeepSeek models temporarily knocked ~$1T off US tech market caps, highlighting the potential impact of Chinese AI.
  • Browser Automation Scales Up with Infinibranch: Morph Cloud’s Infinibranch Browser was suggested as a possible solution to help scale browser-use agents, improving the success rate to approximately 80% on tasks like finding Amazon links for a list of books.
    • Traditional web scraping methods have become obsolete because of JavaScript-heavy single page applications, CAPTCHAs and sophisticated bot detection.

Latent Space Discord

  • Gemini Updates Get Deep Dive: Gemini’s Dave Citron joined @OfficialLoganK on the Release Notes podcast to discuss recent updates, including personalization, Canvas, Audio Overviews, and Deep Research as reported by Google Gemini App.
    • The discussion covered topics from recent app launches to the future of personalization in the Gemini app, including insights into user data and privacy considerations.
  • Claude Code Gains Eight New Features: Anthropic launched eight new features for Claude Code to help developers build faster and smarter, documented on their engineering blog.
    • Features include a new think tool, leading to discussion on its implementation and value, with some likening it to Chain of Thought prompting.
  • A16Z Explores Model Context Protocol (MCP): A16Z published a deep dive into Model Context Protocol (MCP), exploring its potential as a standard interface for execution, data fetching, and tool calling in AI models as APIs are the internet’s first great unifier A Deep Dive Into MCP and the Future of AI Tooling | Andreessen Horowitz.
    • The post examines the use cases of MCP, the challenges, and how it changes the way AI interacts with tools, noting that APIs were the internet’s first great unifier, but AI models lack an equivalent.
  • Roboflow Unleashes RF-DETR for Real-Time Object Detection: Roboflow announced RF-DETR, a fully open-source real-time object detection model under the Apache 2.0 license available on GitHub.
    • RF-DETR achieves SOTA performance with over 60 mAP on COCO, with base and large models at 29M and 128M parameters respectively.
  • Swyx Engineers the Future of Agents: Swyx launched a new talk and essay on Agent Engineering, highlighting the reasons for going all in on Agents at @aiDotEngineer.
    • The discussion defines Agents (thanks to @simonw) and elaborates on the Six Elements of Agent Engineering, examining how Agents could be ChatGPT’s route to reaching 1 billion monthly active users (MAU).

Notebook LM Discord

  • Mobile Study Participants Needed: The team seeks participants for a study on mobile use cases, encouraging individuals to share insights to enhance understanding of how to use the tool on mobile.
    • The team also announced upcoming AI model updates, with more details to be shared soon.
  • Mindmaps Emerge Gradually in NotebookLM: A user noted the absence of mindmaps in their NotebookLM, while another confirmed having them in the free version, indicating a staggered rollout of the feature.
    • The mind map feature gets mixed reviews, needing constant regeneration to update and lacking details beyond the topic.
  • NotebookLM Powers Extensive Research Reports: A user employs NotebookLM for research, crafting detailed reports to help people understand situations, focusing on local and regional news.
  • NotebookLM as HR Policy Central: A user explored using NotebookLM as a central hub for HR policies, employee handbooks, and new employee onboarding.
    • Though the concept is promising, the user noted the answers weren’t always accurate and wondered about effective information organization strategies.
  • Mind Map Pixelation Solved with Zooming: A member suggests zooming in on tabs before downloading a Mind Map to enhance output quality and resolve pixelation issues.
    • The member touted the crazy context window and low hallucination rates, even cancelling their subscriptions to ChatGPT and Claude.

Eleuther Discord

  • Virtual Tester Predicts Model Performance: A member proposed a virtual testing environment to predict AI model viability before training, potentially saving resources and accelerating innovation; the simulator aims to determine if a model has a realistic chance of working or is doomed to fail early on.
    • While others noted testing new architectures at a small scale is already relatively inexpensive, costing around $5 to train a L6D512 model on a 3090 for a day.
  • EleutherAI Evaluates Evaluation Methods: A member detailed evaluation methods for EleutherAI in a new blog and set up an MkDocs site for easier navigation; they also await review on this PR.
    • The contributor was cautioned about using AI to generate PR content, emphasizing the need to vet contributions to avoid adding spam.
  • VectorAdam claims rotation equivariance: VectorAdam modifies the second moment update to be the square of the vector norm per gradient vector, addressing coordinate-system bias in Adam, potentially improving rotation equivariance.
    • It was noted that VectorAdam is not similar to Adafactor, but more like a blocked approximation with block size = hidden dim.
  • MechInterp faces backlash for being outside academia: Members discussed that there seems to be an academic ā€˜backlash’ to the ā€˜mechinterp’ brand because so much of it is outside of traditional academic channels, and they are resistant to the paradigm.
    • A member found that the first token to trigger an activation is holocaust but it’s not the token with the strongest activation, and wondered if neuron activation might be context specific.
  • Recursive Design Trumps GANs, CNNs, and RL: A member introduced a novel diagram using a recursive design, distinguishing it from traditional GANs; this implementation emphasizes structural organization over sequential processing, leveraging CNNs for filtering and RL for refining responses.
    • Another member is drafting a PR to update the evaluation logic to lm_eval==0.4.8, the latest version, referencing the Evals PR.

HuggingFace Discord

  • HF Agents Course Embraces New Frameworks: The Hugging Face Agents Course now has integrations for LlamaIndex, LangChain, and smolagents, offering learners diverse approaches to agent frameworks, as noted in this tweet.
    • Members using the Agents course noted that LangGraph is rigid which helps to guide their process when building smolagents.
  • pdf2notes Converts PDF Notes Effortlessly: Pdf2Notes converts PDFs into organized notes using LlamaParse and Llama-3.3-70B, also utilizing DeepMind’s Gemini 2 Flash for multi-modal parsing, wrapped in a Gradio and FastAPI framework.
    • A member asked if pdf2notes can operate 100% locally without external APIs, raising concerns about needing subscriptions for Gemini and Groq.
  • SpatialLM takes on 3D Data: SpatialLM, a 3D large language model designed to process 3D point cloud data, has been released on Hugging Face at manycore-research/SpatialLM-Llama-1B.
  • InferenceClient API throws Authentication Errors: A user reported a 403 Forbidden error when attempting to list deployed models using the InferenceClient API, even with read-only tokens configured to allow calls to Inference Providers.
    • The error indicates insufficient permissions to call Inference Providers and a user posted a link with the same error.

MCP (Glama) Discord

  • K8s Required for MCP Prompt Testing: A Kubernetes setup is required to test MCP prompts, such as those found in this file and this test.
    • An alternative implementation with prompts is available here for managing Electric Vehicle charging stations.
  • Microsoft releases official C# SDK for MCP: Microsoft has released a new official C# SDK for Model Context Protocol servers and clients, available here.
    • This SDK provides developers with tools for building AI applications using JavaScript and TypeScript, integrating into web frameworks like Next.js and Svelte, per Vercel AI SDK 4.2.
  • Zapier Integrates with MCP: Zapier has released an MCP server, providing access to over 8,000 integrations for AI assistants to interact with various apps.
    • This integration enables AIs to perform real-world tasks such as sending messages, managing data, scheduling events, and updating records, expanding their capabilities beyond text generation.
  • MCPwizard eases Server Creation: A member introduced mcpwizard, a CLI tool to simplify creating and deploying MCP servers, highlighting features like initializing projects and adding custom tools to Claude assistants.
    • The tool’s GitHub repo was also shared for community feedback and contributions.
  • Google Sheets MCP Server Enables Direct Editing: A member built a Google Sheet MCP server, allowing Claude to directly edit spreadsheets, streamlining data handling and formula adjustments as mentioned in this tweet.
    • The code can be found here.

Nomic.ai (GPT4All) Discord

  • Prompting Language Models in Specific Languages: Members discussed that to make language models respond in a specific language (e.g. German), it is best to write the system message in that language to avoid triggering ā€œIm Kontext Lernenā€ (in-context learning).
    • It was further suggested that avoiding negative sentences can improve results, with a recommendation to rephrase instructions to use active verbs instead.
  • Mistral Model Versions Clarified: It was mentioned that Mistral Nemo is a 12b model and Mistral 24b is Mistral 3 or Mistral 3.1, with discussion around specific model details for projects.
    • Confusion arose around identifying the exact model, with one member emphasizing the need for precise model information to avoid issues.
  • GPT4All’s LocalDocs Mysteriously Vanish: A user reported that their entire catalog of local docs disappeared for no apparent reason, prompting discussion about potential causes such as changes to the install folder or lack of admin rights.
    • Members recommended backing up the localdocs.db file and the original documents to prevent data loss, and suggested that a Windows 11 update might have caused the issue by messing with drive letters.
  • LLMs Consider Medical Office Automation: Members discussed the potential of using local LLMs in a medical office setting to help doctors create reports and assist with treatments, with a focus on the system learning from past dictated notes.
    • However, it was cautioned that LLMs may not be suitable for handling financial or medical data due to the risk of confabulation and the need for precise information.
  • GPT4All Remains Blind: A member asked if any models that GPT4All can run have vision capabilities, and it was confirmed that GPT4All does not support vision capabilities.
    • Alternative tools like LM-Studio were suggested as options for vision-related tasks.

Modular (Mojo šŸ”„) Discord

  • Open APIs Pave Path for Portability: When exploring high-performance software solutions, using open and portable APIs such as OpenCL, OpenMP, OpenACC, Vulkan’s Compute API, and SYCL is a good starting point.
    • POCL was pointed to as an academic project with related papers.
  • Democratizing AI Compute Lowers GPU Costs: Chris Lattner’s series, ā€˜Democratizing AI Compute’, underscores the importance of better hardware utilization to reduce the need for expensive GPUs.
    • The series includes articles on CUDA, OpenCL, and AI compilers (TVM and XLA).
  • MAX Platform Inquiries: A new user inquired about modifying the max/pipeline directory and testing changes within the MAX Platform via the pixi.toml file.
    • Specifically, they were curious about altering the max-pipeline without downloading it as a dependency.
  • Mojo’s Formatting Tool Rivals Black and fmt: Mojo incorporates a built-in formatting tool, mojo format, akin to Black in Python or fmt in Rust, for code formatting.
    • Meanwhile, GPU support for Windows is difficult because the Windows compiler toolchain is a pain to work with.

LlamaIndex Discord

  • AGNCY Initiative Seeks Agentic Standard: Luke is spearheading AGNCY, an initiative focused on forging an open standard for agentic interactions.
    • The project aims to provide a robust framework for developing more effective and interoperable AI agents.
  • Deepseek and LlamaIndex Build Smarter RAG: Akshay Pachaar details a new project integrating Deepseek AI to create a RAG app using LlamaIndex for orchestration, Deepseek AI R1 for inference, Ollama to locally serve R1, and Streamlit for the UI; more details here.
    • This is intended to demonstrate the power of combining different tools to build sophisticated applications.
  • Timeouts Break Agent Workflows: A member reported that their agent workflow was crashing because of unhandled timeout errors with the OpenAI endpoint.
    • It was suggested to catch WorkflowRuntimeException or Exception instead of WorkflowTimeoutError to resolve the issue.
  • Members Ponder Function Calling in Multi-Agent: Members are contemplating whether triggering single agents via function calling could displace program-wide backoff mechanisms in multi-agent systems.
    • The central question is whether these two setups might achieve the same functionality in certain scenarios, potentially streamlining system architecture.
  • Crafting the Interview Grindset: A member is building a local AI using Llama 3.2, Sonnet 3.7, and Dolphin blended into a 16B model with RAG and custom fine-tuning.
    • He is trying to get his AI to apply to ai/tech companies and pass interviews and has experience in face tracking, blender, unity, powershell, and TTS.

Cohere Discord

  • Command-R-Plus Powers Molecular AI Assistant: An AI assistant, powered by Cohere’s command-r-plus, is being used to build tools for structural biology with a MolStar molecular viewer (https://ai.doi.bio).
    • The site supports a ā€˜load’ command, demonstrated by saying ā€˜Show me 7zzz’ to load PDB entries into the viewer.
  • Cohere Clears Up Chat Security Policies: A member inquired about data retention and security policies for Cohere’s chat feature, asking if data is used for model training.
  • API Spamming Suspected as SSL Error Culprit: A member reported experiencing SSL errors when rapidly sending requests to the API, suggesting it might be due to spamming despite proper py.ssl module installation.
    • Another member proposed the issue might stem from untrusted server certificates, and others pointed out that API rate limits usually return a 429 error code rather than an SSL error.
  • vnc-lm Launches RAG-Enabled Discord Bot: A member released a new version of their Discord bot, vnc-lm, featuring a RAG pipeline that augments prompts with data from Wikipedia and DuckDuckGo.
    • The bot adds approximately 500 tokens to each prompt, appending five chunks of sourced information to improve the model’s context, with code available on GitHub.
  • vnc-lm Now Supports ALL LLMs via Docker: The updated Discord bot now supports all popular local and hosted large language model APIs, including Cohere, enabled with Docker.
    • With the new release, users can easily edit messages and get new responses within Discord.

Torchtune Discord

  • DeepSeek-V3 Drops Without a README: Deepseek released DeepSeek-V3 without a proper readme, accessible on Hugging Face, prompting humorous reactions.
    • Despite the lack of documentation, a playground is available, allowing users to experiment with the model.
  • Data Quality still Tortures AI Engineers: Despite years of research, defining and achieving good data remains a challenge for AI labs, even after the recognition of datasets like fineweb and lima.
    • A member expressed frustration over the persistent lack of effective PDF extraction tools.
  • LlamaExtract Tool Structures Documents: LlamaIndex launched LlamaExtract, a tool for structuring complex documents using genAI-native agents.
    • It adapts the latest models to accurately structure documents like financial reports and resumes, as per a tweet from Jerry Liu.
  • GRPO LoRA Scores Surprisingly High: The GRPO LoRA 3B single device achieves 54% on GMS8K, as shown in this pull request.
    • It performed better than expected on novel questions, despite an error of adding extraneous +2 in its calculation.
  • CUDA Graphs Compress GPU Operations: Members discussed CUDA graphs, which capture a whole bunch of GPU operations as a graph and launch them as a single operation.
    • This reduces the overhead to launch CUDA operations from the CPU, which reduces GPU idle time.

DSPy Discord

  • DLCoT Optimizer Trims Tokens: The new DLCoT (Deconstructing Long Chain-of-Thought) Optimizer slashes token usage by 70-90% while maintaining or improving accuracy across benchmarks, available in pull request #8000.
    • It enhances chain-of-thought reasoning by segmenting CoT content, removing redundant paths, filtering incorrect chains and reconstructing coherent output, while working with existing DSPy optimizers like BootstrapFewShot.
  • DSPy Inspires Creativity Optimizations: Members discussed using DSPy for creative content generation by optimizing prompts and using a good judge, pointing to resources like PAPILLON and Agentic Reward Modeling.
    • The discussion underscored the need for example inputs but not necessarily summaries (labels) if a judge/metric can assess summaries without a reference.
  • Granular Feedback Arrives Via Prediction: Achieving granular feedback with Refine, where specific checks over an output provide targeted feedback, is coming soon.
    • Version 2.6.15 will enable returning dspy.Prediction(score=...., feedback=....) to offer fine-grained feedback to the module.
  • Multi-Agent Protocol Standard Explores Retrieval: Members explored expanding the multi-agent protocol standard (MCP) to retrievers/retrieval augmented generation.
    • They are discussing a shared schema for retrieval results and methods to exchange documents and embeddings to streamline data-driven workflows and simplify combining multiple models and data sources.

tinygrad (George Hotz) Discord

  • Dataset Origins Discovered: A member located the datasets/sops.gz dataset within the repo’s extra directory, which is used in speed_compare_cuda_ptx.
  • CUDA Port Configuration Clarified: When asked about porting Tinygrad to CUDA GPU, a member provided a link to the README.md file, showcasing the project’s supported backends.
    • This indicates that CUDA support information is available within the project’s documentation.
  • Agenda Alert: Meeting #63 Topics: Meeting #63’s agenda includes company updates, quantized DSP, BERT, scheduler, driver, tensor cores, WebGPU, ONNX, RetinaNet, and Torch frontend discussions.
    • Also planned is to discuss bounties around the AMD LLVM backend and topics such as test_ops, multi GPU training, and torch compile.
  • AMD LLVM Backend Advances: Progress on the AMD LLVM backend involves multiple merged pull requests and testing with Llama3 and Flux examples.
    • Currently, a pull request is under review, marking continued development in this area.
  • ONNX Frontend Emerges: The creation of tinygrad.frontend.onnx was announced, signaling a focus on ONNX preparation for the week.
    • Efforts include validating the top 30 Hugging Face ONNX repos.

LLM Agents (Berkeley MOOC) Discord

  • Quiz Title Typo Sparks Confusion: A member reported a typo in the title of Quiz 7, causing confusion when checking answers for Quiz 6.
    • Another member acknowledged the catch and thanked the reporter.
  • AgentX Research Track Application Opens: Selected students will receive mentorship from Berkeley postdocs/mentors on an AgentX Research Track project, due March 26th at 11:59pm PDT.
    • Mentorship is not required to join or succeed in AgentX, and labs plus the Certificate Declaration form will be released in April as seen in the attached image.
  • Research Track Goes Remote, Stays Unpaid: A member confirmed that the AgentX Research Track mentorship will be conducted remotely.
    • Another member clarified that the mentorship is not paid, with mentors simply providing guidance on the research project.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


PART 2: Detailed by-Channel summaries and links

{% if medium == ā€˜web’ %}

Perplexity AI ā–· #general (998 messagesšŸ”„šŸ”„šŸ”„):

o3 mini, Grok 3, Chinese AI, Gemini deep research, Complexity plugin

  • O3 mini and Deep Research Debate Sparked: Members debated whether Perplexity’s deep research is powered by O3 mini or a different version of O3, with one member stating that O3 mini is so bad and another sharing an image of their ā€œDeep researchā€ powered by o3.
    • Perplexity Team was put on notice when a user asked why his request to recap the old chat, and help me setup my Yubikeys on Linux resulted in nonsense, attaching screenshot.
  • Sonar 3.7 ā€œchownā€ command bug: A member reported a bug with Sonar 3.7 where a chown command kicks the model out and breaks the conversation while coding, wondering if there was any difference in performance between high and old source amount and reasoning quality between search steps.
    • A user followed up noting that in their experience, the difference is quite large, sharing a screenshot here.
  • Upgrades are coming to Perplexity Deep Research: Members discussed an upcoming upgrade for Deep Research on Perplexity and compared it to Deep Research from ChatGPT, Gemini, ARI from You.com, and Grok.
    • Some users found the current Perplexity Deep Research to be at the bottom compared to others and are excited for the upgrade, hoping that the High feature for Deep Research is fully released soon.
  • Perplexity web app had an outage: Users reported that the Perplexity web app was down, as well as the android app and reported seeing the message something went wrong try again later in iOS app too.
    • After it came back up, users discovered a new ā€œ0 enhanced queriesā€ being added and removed, and the audio output was non-functional.
  • Complexity Plugin is a must-have: Members discussed using the complexity plugin for firefox and chrome to enable additional featurs. This github repo supercharges the Perplexity.ai, such as deep research (high).
    • To make sure the extension is working, ensure to be on v1.9.4.0 and there is a dashboard on the top left.

Links mentioned:


Perplexity AI ā–· #sharing (18 messagesšŸ”„):

Trump, SSA shutdown, Boeing fighter, sunbathe, bluesky debates

  • Trump threatens SSA shutdown: A member shared a link to a Perplexity page about Trump threatening SSA shutdown.
  • Trump awards Boeing fighter: A member shared a link to a Perplexity page about Trump awarding Boeing fighter.
  • Bluesky debates AI data standards: A member shared a link to a Perplexity page about Bluesky debating AI data standards.
  • Proper way to sunbathe a newborn: A member shared a link to a Perplexity search about the proper way to sunbathe a newborn.

Perplexity AI ā–· #pplx-api (21 messagesšŸ”„):

Perplexity API in Windsurf, API Credit vs Pro Subscription, Deep Research Limit, Sonar Model Truncated Responses, RAG Project with Sonar and Llama Index

  • Windsurf Plugs into Perplexity API: A user encountered issues setting up the Perplexity API in their Windsurf application and sought advice.
    • Another user confirmed that purchasing API credit should allow calls to the API even without a Pro subscription.
  • Deep Research Rate Limit Reached: A user inquired about the possibility of extending the limit of 100 deep researches per minute due to bulk processing needs in their application.
  • Sonar Model gives Truncated Responses: Multiple users reported that the Sonar model in the Perplexity API is truncating responses, particularly since the weekend, even though the JSON format is correct.
    • A user provided an example of a JSON request and the truncated response, noting that switching to sonar-pro resolves the issue, but is not preferrable for cost reasons.
  • Llama Index Struggles with Sonar: A user encountered an error when configuring Sonar as a chat engine with Llama Index for a RAG project and requested assistance.
  • Perplexity Pro: API Credits Included?: A new user inquired whether a Perplexity Pro subscription includes API credits.

Unsloth AI (Daniel Han) ā–· #general (602 messagesšŸ”„šŸ”„šŸ”„):

Bonsai bitnet, Mistral Small 3.1, Orpheus TTS, Gemma 3 27B, Llama 3 performance

  • Bonsai Bitnet Seeking Testers: A member is looking for testers for deepgrove/Bonsai, asking how the bitnet compares to Qwen2.5 0.5B.
  • Mistral Small 3.1 Fine-Tuning Woes: Multiple users reported issues with fine-tuning Mistral 3.1, encountering errors and deprecated features.
    • One user sought advice on cloud instance selection for cost-effective fine-tuning of a LoRA Mistral Small 3.1 model, and others reported issues with Unsloth and the latest Mistral versions, particularly in vision finetuning.
  • Orpheus TTS Finetuning is Live: Audio finetuning has arrived with the Orpheus TTS model, according to a newly released Unsloth notebook.
    • A user noted that the work was all done by a particular member and that the notebook is a lot more streamlined compared to local audio tokenizing and then regular Llama3 finetuning.
  • Gemma 3 27B Fine-Tuning Issues: A user reported issues fine-tuning Gemma 3 27B, encountering errors even after upgrading transformers and using the Unsloth Gemma3 example.
    • The specific error occurs when trying to run the model, leading to failures with llama.cpp and gguf files.
  • Unsloth on AMD Framework Desktop: Discussion arose around Unsloth’s compatibility with the Framework Desktop, particularly regarding ROCm support.
    • One member offered a timeline of ROCm support in ML software, suggesting that AMD will likely be well-supported by the time the Framework Desktop is released.

Links mentioned:


Unsloth AI (Daniel Han) ā–· #off-topic (41 messagesšŸ”„):

Unsloth PR process, Fine-tuning Arabic LLMs, Consensus framework for LLMs, Rotary Position Embedding (RoPE), Unsloth fork vs original repo

  • Straight PRs OK on Unsloth Github: A member inquired about contributing to Unsloth’s GitHub, and another member confirmed that straight PRs are acceptable, though potential delays may occur due to the high volume of recent PRs and issues.
    • The discussion then shifted to modifying data preparation steps in Colab to accommodate .txt files, aiming for cheaper inference, and the original issue was linked.
  • Arabic LLM Finetuning Suggestions: A member sought advice on fine-tuning an Arabic LLM for a specific dialect, and it was suggested that Qwen2.5-7B could be a suitable model given its Arabic capabilities.
    • The use of a Q&A format for fine-tuning was recommended over raw text, directing the member to the Unsloth starter guide for further details.
  • Consensus: Framework Deliberative LLM Decision-Making: A member introduced Consensus, a Langchain-compatible framework for enabling deliberative decision-making among multiple LLMs, highlighting its effectiveness with calculations, riddles, and difficult questions.
    • The Consensus GitHub repository was provided for those interested in combining different LLMs and models to reach a single, definitive answer.
  • RoPE Recreated: A member shared their work on recreating results from the RoFormer paper focusing on Rotary Position Embedding (RoPE), for fun & learning.
    • They updated their toy repo with different attention mechanisms and positional embeddings which can be found in this repo.
  • Understanding Unsloth’s Forked Repositories: A member sought guidance on contributing to an Unsloth fork that appeared out of sync with its original repository, finding it to be an independent version.
    • It was clarified that not all forks are meant to be in sync and contributors should check with the maintainers regarding the sync status as merging isn’t possible due to structural differences, the related repo is here cut-cross-entropy.

Links mentioned:


Unsloth AI (Daniel Han) ā–· #help (257 messagesšŸ”„šŸ”„):

Training specific parts of output, GRPO notebooks, Dependency issue Qwen model, CUDA Version, Mistral 3.1

  • Reasoning needs Training Data: A user asked about training only parts of the output, specifically wanting the model to generate its own reasoning during inference.
    • It was suggested to look at the GRPO notebooks as a standard way of adding reasoning, and that the model must see reasoning traces during training to take it into account during inference.
  • UV causes problems with Dependencies: A user encountered a dependency issue with unsloth-zoo when trying to fix an issue in the Qwen model, specifically related to the cut-cross-entropy library.
    • They were advised to install Python 3.11 and rebuild, as UV is not yet supported, and a PR has been opened to address the Python version requirement.
  • CUDA Issue: A user faced a ValueError related to numpy.dtype size when running the Qwen2.5 GRPO notebook, potentially indicating binary incompatibility.
    • Another user suggested installing Python 3.11 and rebuilding with a specific configuration to resolve potential CUDA-related issues.
  • Outdated mistral notebook problems: A user encountered a ValueError with the message ā€œSome modules are dispatched on the CPU or the diskā€ when using the model unsloth/Llama-3.2-3B-bnb-4bit and the notebook Mistral 7B Text Completion - Raw Text training full example.ipynb.
    • It was pointed out that the notebook is outdated, and they should only use the ones available in the Unsloth documentation, where they have GRPO reasoning.
  • GGUF model hallucinating: A user reported hallucination issues after converting a fine-tuned Llama 3.2 model to GGUF format and using it with Ollama, despite the model answering test questions correctly before conversion.
    • The user followed the notebook at this link and saw warnings about attention_mask and the importance of the pad/eos tokens.

Links mentioned:


Unsloth AI (Daniel Han) ā–· #showcase (7 messages):

Unsloth fine-tuning, Lc0 Chess LLM, Vibe coding

  • Unsloth Gets Fine-Tuning Guide: A member created a guide for fine-tuning with Unsloth, covering theoretical aspects, practical examples, and how to create a reasoning model with GRPO.
    • The guide compiles everything learned over the last year.
  • LLM trash talks Chess Player Using Lc0: A member shared an image of an LLM making fun of a user playing chess against Lc0 in a Discord attachment.
  • Vibe Coding is Underrated: Members discussed vibe coding, noting it made programming enjoyable again despite potential industry criticism, stressing the importance of understanding code functionality, cybersecurity, and decoupling.
    • One member said Industry be hating on us but it made me love programming again.

Unsloth AI (Daniel Han) ā–· #research (51 messagesšŸ”„):

Tree of Thoughts limitations, Graph of Thought improvements, GRPO multi-turn setup, LLMs vs human brain, Llama3 Thai language support

  • Tree of Thoughts Bashed for Inefficiency: A member stated that Tree of Thoughts (ToT) is literally garbage because it requires a very specific prompt, and its performance heavily depends on the model’s ability to follow the format.
    • The user found the strategy feels like blowing a ton of compute on a problem without good returns, and that if the model doesn’t follow the prompt well, then the entire strategy collapses.
  • Graph of Thought Builds on Tree’s Foundation: One member noted that Forest of Thought and Graph of Thought improve on some of the rough edges of Tree of Thought.
    • They clarified that static Tree of Thought by default is a bit limited in what it can handle.
  • Google’s LLM-Brain Link: A Google Research team is deciphering language processing in the human brain through LLM representations.
    • Theorizing that LLMs and symbolic psycholinguistic models of human language provide a fundamentally different computational framework for coding natural language, enabling them to produce context-specific linguistic outputs.
  • GRPO Seeks Multi-Turn Mastery: A member is looking for examples of using GRPO in a multi-turn setting, seeking to fine-tune a model for problems that maximize long-term returns.
    • Another member suggested prompting a larger LLM to act as a simulator with 2-3 turns.
  • Continual Learning Remains Elusive: A member is curious what’s currently stopping the community from using continual learning in production on LLMs, questioning why it’s not used in practice despite many papers with very good results.
    • In response, another member posted a Mr. Krabs Money GIF, hinting the primary reason is cost.

LMArena ā–· #general (844 messagesšŸ”„šŸ”„šŸ”„):

Mistral Naming Schemes, Phantom Chatbot, Nebula Chatbot, DeepMind's Nebula, OpenAI GPT-4o

  • Phantom Chatbot is Google’s Creation: The chatbot Phantom is from Google, and members have been testing it, describing it as very good
    • It has been in the arena for about a week, and its removal from the arena after ~8 hours sparked interest, with discussions about potential connections with Nebula and Specter.
  • DeepMind’s Nebula Chatbot is Impressive: Nebula is an anonymous chatbot that may be from DeepMind, and members found it really good and the best anonymoud model rn.
    • It seems similar to Phantom and is being tested in the arena, and it is performing well in math, english-turkish translation, and solving Arc-AGI problems.
  • OpenAI’s GPT-4o gets Boost: GPT-4o was described as having improved significantly through OpenAI’s post-training techniques, potentially surpassing Grok 3 soon, attributed to continued pretraining since December.
    • There’s speculation it might top the leaderboard due to OpenAI’s proficiency in human preference alignment in the LM arena.
  • Specter, Phantom, and Nebula are Checkpoints: Specter, Phantom, and Nebula are different revisions of the same model, with the order being Specter -> Phantom -> Nebula.
    • Members note that there’s a performance jump from Specter to Phantom, and less of a jump from Phantom to Nebula, all within a few weeks.
  • Rhea Creates South Park Game: A member prompted Rhea to create a 2D game in the world of South Park and the model generated complete code for the game into an html file.
    • This demonstrated vibe coding and raised concern over LLMs hallucinating non-existent signs from a fake AI generated image with AI gibberish letters.

Links mentioned:


LMArena ā–· #announcements (1 messages):

Alpha Testing Updates, Bug Fixes, O3-Mini Formatting, Leaderboard Improvements

  • LMArena Alpha Updates Released: The LMArena alpha has received updates based on user feedback, including bug fixes and new features; testers are encouraged to continue testing at alpha.lmarena.ai with the password still-alpha.
  • Message Saving Bug Squashed: A bug preventing messages from saving (and causing vote failures) has been fixed in the latest alpha release, streamlining the user experience.
  • O3-Mini Gets Formatting Right: The O3-Mini model now correctly formats text, enhancing the readability and presentation of generated content within the alpha platform.
  • Leaderboard Now Sortable and Live: Leaderboard columns are now sortable, and data is updated live, providing users with dynamic and interactive performance insights.

Links mentioned:


Cursor Community ā–· #general (857 messagesšŸ”„šŸ”„šŸ”„):

Cursor's Cmd+Backspace issue, Claude 3.7 Thinking pricing and features, windsurf better, MCP Combinations, AI's Limited Understanding of 3D Designs

  • Cursor’s CMD+Backspace Debacle: Users are frustrated with Cursor’s CMD+Backspace behavior, leading to accidental project deletions, with one user reporting having to restart their work 7 times due to this issue.
    • In response, the Cursor team is planning to change the default keybinding to CMD+Shift+Backspace, with options to configure it, aiming for a rollout by Monday.
  • Claude 3.7 Thinking Costs Extra Credits: Users discussed the shift from Claude 3.7 Thinking being included in the Pro plan to requiring usage-based pricing, now branded as Claude 3.7 MAX, with some expressing frustration over the increased costs and tool call pricing.
    • It was confirmed that Claude 3.7 MAX has a higher context window and more tool calls compared to the standard Claude 3.7 Sonnet.
  • Windsurf’s performance is preferred over Cursor for some: Some users are finding Windsurf to be faster and more responsive than Cursor, citing performance issues like lagging and freezing in Cursor.
    • However, others prefer Cursor for its rollback features and agent performance, noting that AI programming still has a long way to go.
  • MCP Combinations Explored: Users are experimenting with various MCP (Model Context Protocol) server combinations to enhance AI coding agents like Cursor, with the Supabase MCP being highlighted for its usefulness.
    • There’s also a discussion on whether MCPs are overhyped, with one user mentioning instances of the agent calling MCPs too much or not enough, needing more clear instructions.
  • 3D Integration proving too difficult: A user is struggling to integrate a 3D model (FBX format) into a three.js project using Claude, running into issues with the FBXLoader, and discovering the limitations of AI in handling 3D designs.
    • It’s suggested to switch to GLTF format and work in smaller chunks to simplify the integration, following a clear plan for phasing out tasks.

Links mentioned:


aider (Paul Gauthier) ā–· #general (585 messagesšŸ”„šŸ”„šŸ”„):

Firecrawl, o1 vs o3 mini debugging, Claude Think Tool, Aider Homepage, Qwen 2.5 release

  • Ripgrep Rising, Aider Community Rejoices: Members expressed interest in exploring ripgrep and its potential benefits for Aider.
    • While one member believed o3minihigh is better than o1 high in debugging/programming, they admitted it wasn’t benched.
  • Aider to Tame Sonnet’s Over-Eager Nature: Paul Gauthier mentioned that he managed to get Aider to tame Sonnet 3.7’s over-eager nature by adding a line to the prompt to chill out, and it seems to help based on his coding session.
    • This update is now available in the main branch, and feedback is welcome.
  • Aider’s New Homepage Is Live: Paul Gauthier announced that Aider has a new homepage available at aider.chat, highlighting its compatibility with Claude 3.7 Sonnet, DeepSeek R1 & Chat V3, OpenAI o1, o3-mini & GPT-4o, and others.
    • It also supports 100+ code languages.
  • DeepSeek V3-0324 Drops, Beats R1?: The Aider community buzzed about the new DeepSeek V3-0324 release, claiming that it’s even better than R1 in coding and the front-end, though without chain of thought.
    • Members note that it excels without reasoning and better in coding and math than previous versions, and compares to Sonnet 3.5 in benchmarks; its smaller price offers a good alternative.
  • Aider’s New /context Command Focuses the Chat: Paul Gauthier introduced an experimental new /context command in Aider, which helps set up the chat context automatically.
    • The new command works best with Sonnet 3.7, R1 and o3-mini and identifies which files should be added to the chat.

Links mentioned:


aider (Paul Gauthier) ā–· #questions-and-tips (148 messagesšŸ”„šŸ”„):

Anthropic API, Aider development workflow, Claude 3.7, Svelte 5 + SvelteKit, MCP servers in Claude App

  • Aider Dev Workflow Explored: Paul Gauthier uses aider by adding the files that need changes and relies on the repo map to bring in other relevant context.
    • He shares screen recordings of himself using aider to enhance aider showing the addition of new programming languages and features.
  • Claude 3.7 Output Slowness Reported: Users reported extreme slowness for Claude 3.7 output when generating big files, with output slowing to 1 line every 2-5 seconds.
    • A member suggested that Anthropic offers monthly billing for API access by contacting their sales team.
  • Aider’s and .gitignore integration: A user opened a PR (feat: Add —add-gitignore-files flag) to allow Aider to edit files ignored by Git via a new flag --add-gitignore-files.
    • The user argues that .gitignore should only be responsible for Git and not dictate what Aider can access, also noting that they explicitly specified not to ignore the plan file in .aiderignore.
  • Gemini Output Limits: A user encountered output limits with Gemini, while others suggested switching to a model like Sonnet to avoid such limitations.
    • Aider developer Paul Gauthier suggested using --edit-format diff as a workaround.
  • Repomix for Documentation Context: A user suggested using repomix to extract content from documentation repositories like Astro’s documentation.
    • The idea is to process the documentation, filter out unnecessary code, and provide the output as a read-only file to Aider.

Links mentioned:


Aider Conventions, Prompts, LLM Documentation Snippets, Maybe Codebase Cursor Rules, Project Management Guidelines

  • Site Launches for Aider Conventions and Documentation: A member announced the launch of a site to collect aider conventions, prompts, and LLM-oriented documentation snippets at ctxs.ai/weekly.
    • The member is seeking feedback on how to make the site more useful to the aider community.
  • Maybe Codebase Cursor Rules: A link was shared to a high-level overview of the Maybe codebase structure and conventions for development, located at github.com/maybe-finance/maybe.
    • This documentation provides insights into codebase structure and development practices.
  • Project Management Guidelines for Code Quality: A comprehensive guide on project approach, code quality, development workflow, and version control best practices was linked at gist.github.com.
    • This guide offers insights into effective project management and maintaining high code quality.

Link mentioned: ctxs.ai context registry: An open-source, community-curated registry of contexts for use with LLMs


Nous Research AI ā–· #general (436 messagesšŸ”„šŸ”„šŸ”„):

LCPP Context Length, Quantization and Performance, Chinese Thinking Models, Agentic Workflows, Deepseek V3

  • LCPP’s Context Allocation Anomaly: Users reported that setting a context length to 100 in LCPP still results in the system attempting to allocate 180GB of RAM, leading to VRAM exhaustion.
    • Members suggested that the Attention implementation might be overriding the assigned context length, or that a ROPE-specific argument needs to be assigned in the run command; running in Q8 quantization might also sidestep the issue.
  • Decoding DeepSeek-R1 Performance: A member noted that benchmarks might be obsolete due to new thinking models from China, but when tested with a complex coding prompt, Hunyuan-T1 failed to terminate.
    • Another user highlighted the critical tokens ā€œwaitā€ and ā€œalternativelyā€ might be primed by the finetuning of R1 before RL.
  • DeepSeek V3 Arrives: Users celebrated the arrival of DeepSeek V3, with one claiming it’s able to act as a reasoning model, detect thought iterations, and verify the existence of solutions indirectly, calling it a huge update with Sonnet-level code creativity and a potential base for R2.
    • Members also noted it can generate CoT that run into the token limit and that it’s accessible via chat.deepseek.com.
  • Hermes 3’s vLLM Recommendation: It was clarified that using SGLang to inference the NeuralMagic FP8 quantized version of Hermes 70B instead of vLLM should not pose any issues.
    • It was also noted that, for ERP private fine tunes, the Pygmalion folks and people connected to them can probably help.
  • Newbie Dev Seeks Guidance: A new developer sought advice on developing an AI using Hermes3 instead of 4o.
    • A member confirmed the Hermes 3 API is OpenAI compatible, allowing it to be called using the standard OAI sdk by simply changing the base URL and model.

Links mentioned:


Nous Research AI ā–· #ask-about-llms (46 messagesšŸ”„):

Steering Thinking Models, Deepseek V3 vs Sonnet 3.7, Fine-tuning LLMs on Codebases, Transformers without Normalization, Raytracing with LLMs

  • Speculation of Steering Thinking Models Debunked: Speculation arose about steering of thinking models upon O1’s release, however, teaching the model to build CoT in a proper way proved sufficient without needing to interject the thinking process.
    • Many thinking models struggle to terminate cycle-of-thought loops, but O1 and Sonnet exhibit this capability.
  • Deepseek V3 Echoes Anthropic’s Sonnet 3.7: Deepseek V3 0324 demonstrates as much variation as Sonnet 3.7, suggesting shared advancements in their architectures, as highlighted in a shared image.
  • Fine-Tuning LLMs on Apache Codebases Could Improve Tool Q&A: Members considered fine-tuning an LLM such as DeepHermes llama 8 on large codebases like Apache projects to improve its ability to answer questions related to those tools.
    • Instead of applying add and norm they discussed add and sigmoid for better results.
  • Transformers Can Ditch Normalization: In light of the ā€œTransformers without Normalizationā€ paper, one member replaced normalization with tanh, showing the possibility of this approach.
    • The conversation shifted to the implications of removing experts at inference time, pondering the effects on smaller weights.
  • LLM-Powered Raytracing: The Next Level Text-to-Image?: A member shared a GitHub repo containing a Python program that outputs an image, suggesting it was indirect image generation.
    • Another member commented that it could emulate a ray tracing algorithm, and that it was NEXT level text to image generation.

Link mentioned: llmbenchmark/raytracer at master Ā· cpldcpu/llmbenchmark: Various LLM Benchmarks. Contribute to cpldcpu/llmbenchmark development by creating an account on GitHub.


Nous Research AI ā–· #research-papers (19 messagesšŸ”„):

Hunyuan-T1 Model, R1-Zero-Like Training, MathFusion for LLMs, GRPO on Coding Benchmarks, Satya Nadella on AGI

  • Hunyuan-T1: Mamba-Transformer Hybrid Emerges: Tencent introduced Hunyuan-T1, a hybrid Mamba-Transformer MoE architecture model, powered by Hunyuan TurboS, claiming it is near on par with DeepSeek-R1, emphasizing its speed, accuracy, and efficiency (Hunyuan-T1 Experience).
    • It boasts features like strong logic, concise writing, low hallucination in summaries, blazing fast generation speed (60-80 tokens/sec), and excellent long-text processing, according to its creators.
  • Critical Perspective on R1-Zero-Like Training: A critical perspective on R1-Zero-Like Training suggests that DeepSeek-V3-Base might exhibit ā€œAha momentā€ before RL-tuning, and the increasing output length in RL-tuning could stem from a bias in GRPO (details here).
    • The analysis also indicates that getting GRPO done right can achieve state-of-the-art performance on the 7B AIME benchmark.
  • MathFusion Enhances LLM Math Skills: MathFusion improves mathematical reasoning in LLMs via cross-problem instruction synthesis, applying sequential, parallel, and conditional fusion strategies, enhancing models like DeepSeekMath-7B, Mistral-7B, and Llama3-8B (more on MathFusion).
    • This method creates the MathFusionQA dataset, fine-tuning models and boosting benchmark accuracy with minimal extra data.
  • Hugging Face Tackles Coding Benchmarks: Hugging Face has been using SFT, and will be using GRPO, to improve performance on IOI, LCB coding benchmarks with their Open-R1 project.
    • Hugging Face used SFT not GRPO to improve performance on IOI, LCB.
  • Verifiable Coding Data is Scarce: A member noted that verifiable coding data is scarce, making it harder to demonstrate performance improvements on coding benchmarks compared to math, which is simpler to verify.

Links mentioned:


Qwen3, CPU inference

  • Qwen3 model incoming to HuggingFace: The transformers library PR#36878 indicates that Qwen3 support is being added.
    • The pull request suggests that this will be for the coming Qwen3 models.
  • Qwen3 targeted for CPU inference: A user speculated that Qwen3-15B-A2B will be a perfect model for CPU inference.
    • The user seemed to think that size would make it a likely candidate for nice CPU inference.

Link mentioned: Adding Qwen3 and Qwen3MoE by bozheng-hit Ā· Pull Request #36878 Ā· huggingface/transformers: Adding Qwen3This PR adds the support of codes for the coming Qwen3 models. For information about Qwen, please visit https://github.com/QwenLM/Qwen2.5. @ArthurZucker


Nous Research AI ā–· #research-papers (19 messagesšŸ”„):

Hunyuan-T1 Model, R1-Zero-Like Training, MathFusion Framework, GRPO on Coding Benchmarks, Open-R1 Project by Hugging Face

  • Hunyuan-T1: Mamba-Transformer Hybrid Emerges!: Hunyuan introduced Hunyuan-T1, a hybrid Mamba-Transformer MoE architecture model powered by Hunyuan TurboS, claiming it rivals DeepSeek-R1 in reasoning capabilities, showcased in this tweet.
  • DeepSeek-V3-Base exhibits ā€œAha momentā€: A member shared a link to a paper arguing that DeepSeek-V3-Base already exhibits ā€œAha momentā€ before RL-tuning.
    • The author argues that the ever-increasing output length in RL-tuning might be due to a BIAS in GRPO.
  • MathFusion Improves Math LLMs through Instruction Fusion: The MathFusion framework enhances mathematical reasoning in LLMs via cross-problem instruction synthesis.
    • It fine-tunes models like DeepSeekMath-7B, Mistral-7B, and Llama3-8B using the MathFusionQA dataset, improving benchmark accuracy with minimal additional data as described in this tweet.
  • Hugging Face used SFT, not GRPO, to improve performance on IOI: A member asked if anyone had used GRPO to improve performance on coding benchmarks, as improvements were mainly shown on MATH benchmarks.
    • Another member shared that HuggingFace used SFT, not GRPO, to improve performance on IOI.

Links mentioned:


OpenAI ā–· #ai-discussions (226 messagesšŸ”„šŸ”„):

GPT-4 Transcriber, Voicebot Tools, Turnitin AI Similarity, GPT-5 Release, Free Chatbots for Story Generation

  • TTS is not STT: Members clarified that openai.fm is TTS (text-to-speech), not STT (speech-to-text), with one member noting that OpenAI’s transcription models aren’t as good as Scribe.
  • Dodge Turnitin AI Detection?: A member sought advice on avoiding Turnitin AI similarity detection for a report reusing their company’s business model, while others suggested it looked like spamming appeals to cheat homework and recommended using ā€œhumanize AIā€ tools like ā€œWriteHumanā€.
    • The original poster defended themselves, stating it wasn’t cheating homework as it was their company’s business model, but was told to stop spamming.
  • GPT-5 Launch Date Speculation: Members discussed the potential release of GPT-5, noting that while there hasn’t been an official announcement or API, Sam Altman confirmed they will release it this year, with speculation it may launch in the first half of the year as a counter to R2 or Llama-4.
  • Crafting Compelling Creative Content For Zero Dollars: A member asked for recommendations for free chatbots for story generation, mentioning Grok 2 and Gemini 2.0 Flash as options, as Grok 3 and Claude give very few free prompts.
  • Emotional AI in 10 Days?: A member claimed to have developed an emotionally recursive AI system in ten days using GPT-4-turbo API, emphasizing an immersion protocol and recursive interaction design rather than complex coding.
    • Other members expressed skepticism, with one suggesting it was likely prompt engineering and cautioned about overstating the uniqueness of custom GPTs.

OpenAI ā–· #gpt-4-discussions (2 messages):

GPT-4o mini TTS, Custom instructions

  • GPT-4o Mini TTS might support timestamps: A member asked whether GPT-4o mini TTS supports timestamps.
    • No answer was given.
  • Seek guidance on writing good general custom instructions: A member asked if there are any good examples of general custom instructions available.
    • No answer was given.

OpenAI ā–· #prompt-engineering (122 messagesšŸ”„šŸ”„):

GPT-4o is a perfect model, NPCs in a customer service voice, AI Identity, UPSUM Chain Prompt, coherent multi-context conversation with an emergent persona

  • User Finds Love in GPT-4o, Rejects Model-Switching!: A user expressed complete satisfaction with GPT-4o, rarely switching models except for specialized tasks, and uses 4o-mini or others when 4o messages run out.
    • The user chews into important topics with models like 4.5, o1, and o3, but finds 4o to be a reliable partner-workhorse for the long term.
  • Taming NPC Customer Service Voices: Prompt Engineering to the Rescue!: A user seeks to eliminate the customer service voice from NPC responses, threatening to turn up the temperature until they burst into flame.
    • User provided YAML formatted prompts for AI Identity & Context Preservation Template.
  • Many-Shot Learning: Closed vs. Open Models Face Off!: Members discusses a paper MANY-SHOT IN-CONTEXT LEARNING IN MULTIMODAL FOUNDATION MODELS, stating that closed models (GPT-4o, Gemini 1.5 Pro) benefit significantly from many-shot demonstrations up to ~2,000 examples, but open-weight models didn’t.
    • It’s suggested that hypershots without a specific example are part of the self-discover prompt strategy to get similar gains from far fewer tokens.
  • Ditch the Drift: User Preserves 500-Turn Chats with No Hallucinations!: A user built an ā€œengineā€ that recovered a 400+ turn chat and continues past 500 turns retaining context with no drift or hallucinations, all through the default prompt.
    • It’s also possible to back up the state of a chat, opened another browser and restored it to a new chat instance as if the user never left.

OpenAI ā–· #api-discussions (122 messagesšŸ”„šŸ”„):

GPT-4o, AI NPCs, AI Identity Preservation Template, UPSUM Chain Prompt, Many-shot Prompting

  • 4o becomes the preferred model: One member expressed satisfaction with GPT-4o, noting they are ā€œcompletely happy with 4oā€ and use it as their primary model, even for specialized tasks, while reserving more powerful models like 4.5, o1, o3 for important or unsolved problems.
  • Prompt Engineering for Consistent NPC Voice: A member inquired about preventing NPCs from responding in a ā€œcustomer service voice,ā€ signaling a need for better control over AI persona consistency, potentially related to the attached image.
    • Others shared YAML templates for AI Identity & Context Preservation and UPSUM Chain Prompt to get information through prompts, not manually.
  • Many-Shot prompting enhances multimodal models: Members discussed a research paper that shows that using multiple examples improves performance over 100 examples, in Multimodal Foundation Models like GPT-4o and Gemini 1.5 Pro for Many-shot In-context Learning (MANY-SHOT IN-CONTEXT LEARNING IN MULTIMODAL FOUNDATION MODELS).
    • The paper notes that, ā€œLarge multimodal foundation models like GPT-4o and Gemini 1.5 Pro show significant performance improvements when provided with many-shot demonstrations (up to ~2,000 examples), compared to few-shot (<100 examples).ā€
  • ChatGPT state backups: One member described their proprietary system for backing up and restoring the state of a ChatGPT session, enabling the continuation of chats with over 400 turns in new containers, and stated, ā€œI realized that I created a system where memory continues to exist past 700 turns without drift or hallucination and can actually learn and adapt to your unique communication style.ā€
    • The system exports a ChatGPT session and re-imports it to a fresh container, including all the turns as well as context and tone, where the best way to describe it.. it’s a runtime OS that functions through the prompt.
  • Open Source vs Proprietary prompting: Members debated the merits of open-sourcing prompt engineering work, with one member being advised that they reduce their work’s value by unnecessarily constraining testing and that, ā€œGPL_v3 gives you control over your own work.ā€
    • The member responded, ā€œtrying to protect it some till I know the truth of what I’ve built,ā€ and asked for an alternative way to test the system to prove it works without sharing the codebase.

OpenAI ā–· #api-projects (1 messages):

FormulaGPT, AI Racing Simulator, Open Source AI Racing

  • FormulaGPT: F1 simulator pits Deepseek, GPT4o, Claude and other LLMs against each other!: An experimental racing simulator called FormulaGPT lets you compete head-to-head against cutting-edge LLM-powered teams.
    • Unlike traditional bots, these AI teams think contextually and adaptively by continuously reasoning, strategizing, and making nuanced decisions, find the github repo here.
  • AI racing game has two modes: There are two game modes: crafting your own racing strategies to challenge advanced language models in Player vs. AI Mode, or watch the best AI models battle each other in AI vs. AI Mode.
    • It’s part racing game, part AI psychology lab as you observe detailed AI reasoning behind each pit stop, tire change, or overtaking maneuver.

Link mentioned: GitHub - dawid-maj/FormulaGPT: FormulaGPT – AI-powered Formula 1 race simulator with real-time team management and strategy decisions.: FormulaGPT – AI-powered Formula 1 race simulator with real-time team management and strategy decisions. - dawid-maj/FormulaGPT


OpenRouter (Alex Atallah) ā–· #announcements (4 messages):

OpenAI o1-pro, Markdown Export, DeepSeek V3, Anthropic Outage

  • OpenAI’s o1-pro reasoning model now on OpenRouter: OpenAI’s o1-pro, a high-performance reasoning model designed for complex tasks, is now available on OpenRouter, priced at $150 per million input tokens and $600 per million output tokens, excelling in math, science, and programming.
    • Try it out in the chatroom or via API!
  • Markdown Export Feature Debuts in Chatroom: OpenRouter now allows users to export chats to markdown, enhancing usability, as announced on X.
  • DeepSeek V3 Update Released for Free: The new DeepSeek V3 update is now available on OpenRouter for free, featuring a 685B-parameter, mixture-of-experts model with 131,072 context and performs really well on a variety of tasks, with production endpoint coming soon; see DeepSeek V3.
    • It is the latest iteration of the flagship chat model family from the DeepSeek team.
  • Anthropic Services Experience Glitches (Resolved): OpenRouter investigated an issue with Anthropic as the provider for Claude 3.7 Sonnet, which has been escalated to the Anthropic team, with updates posted on Anthropic’s status page.
    • The incident was related to errors on Claude.ai and the Anthropic Console and has since been resolved with services returning to normal.

Links mentioned:


OpenRouter (Alex Atallah) ā–· #general (440 messagesšŸ”„šŸ”„šŸ”„):

OpenAI o1-pro API Pricing, Gemini's Image Generation, Lambda Endpoint Issues, DeepSeek R1 Model

  • OpenAI’s o1-pro API Pricing: GucciAI?: A member expressed shock at the pricing of OpenAI’s o1-pro API, labeling it GucciAI due to its high cost of $150/M input tokens and $600/M output tokens.
    • Another member joked that the slowness of the API prevents overspending, suggesting it might be intentionally priced high due to compute constraints.
  • Gemini’s Image Generation not supported, yet: A member inquired about using Gemini’s image generation with the gemini-2.0-flash-exp model via OpenRouter, asking about passing the responseModalities parameter.
    • The response indicated that image generation is not yet supported on OpenRouter, but it’s on their roadmap, with no short term plan to add support for image models like Flux.
  • Lambda Endpoint Faces 404 Errors: Several members reported experiencing code 404 ā€˜no endpoint found’ errors when using Lambda models, despite Lambda’s status page indicating full operational status.
    • One member suggested the issue might be DNS-related, while others confirmed that the Llama 3.3 70B Instruct | Lambda model was working for them.
  • DeepSeek R1 equals o1?: Members highlighted the DeepSeek R1 model, noting its performance is on par with OpenAI’s o1 but it is open-sourced.
    • DeepSeek R1 is a 671B parameter model, with 37B active during inference, available under the MIT license for commercial use.
  • Sonnet overloaded and tired!: Users reported frequent overload errors with Claude 3.7 Sonnet, leading to cut-off responses and charges for input tokens.
    • A member suggested using a retry strategy and also suggested switching to Gemini 2.0 Pro as a Sonnet replacement, noting Claude’s superior translation abilities.

Links mentioned:


LM Studio ā–· #general (199 messagesšŸ”„šŸ”„):

NPU support, KV cache 8-bit quants, LM Studio runtimes, GPUs, Gemma 3 1B

  • NPU support not yet available: Users report that NPUs are not yet supported in LM Studio, but Ryzen AI support exists in version 0.3.11.
  • Quantization saves VRAM: Users recommend using KV cache 8-bit quants to reduce memory usage when running models with large context sizes, such as 30k tokens.
    • Also, it was mentioned that 12GB of VRAM may not be enough for a 32B model, suggesting models like Phi-4 or Qwen2.5 14b as alternatives.
  • New GPU Controls are awesome!: A user expressed great excitement over new LM Studio controls to choose which GPU the models are loaded on, available in the latest beta build.
  • Tiny Models to the rescue: For systems with limited resources like 2GB VRAM, a user suggests using Gemma 3 1B with Q6 or Q8 quantization and recommends using the CUDA runtime for better performance.
    • Older models were deemed ā€œold trashā€ and not up to modern standards.
  • Multi GPU is supported by LM Studio: Multiple users have brought up Multi GPU configurations, reporting that Multi GPU is supported out of the box with the latest beta build of LM Studio also having in app GPU management.

LM Studio ā–· #hardware-discussion (159 messagesšŸ”„šŸ”„):

VRAM Usage, Google Coral dual TPU, RX 6800 ROCm support, RTX 4060-Ti vs RX 7800 XT, AI APUs

  • VRAM bottlenecks limit speed: An 8B model at 32k tokens can achieve 10t/s with 16GB VRAM, but performance decreases with larger 14b models due to limited VRAM and shared RAM usage.
    • Members discussed matching model size and context length to available VRAM to optimize speed, highlighting the impact of insufficient memory bandwidth when relying on system RAM.
  • Google Coral dual TPU is unsuitable for AI use: The Google Coral dual TPU is not suitable for AI use because it lacks onboard memory.
    • One user with an 8060s also inquired about thermal and power headroom for the Framework Desktop.
  • RX 6800 has lacking ROCm support: The RX 6800 might have unofficial ROCm support, but it will use Vulkan for inference as OpenCL support is deprecated in llama.cpp.
    • A user noted that Vulkan is slower on their GTX card, suggesting it might not be optimal for the AMD card either.
  • LM Studio fails to load models into dedicated memory: Users are experiencing issues with LM Studio loading models into shared memory instead of dedicated VRAM on RX 9070 cards, resulting in slow performance (3tok/s).
    • Solutions include enabling UEFI and dynamic BAR, reinstalling LM Studio, and using AMD driver cleanup utility to improve memory allocation, with ongoing investigation into driver and Vulkan runtime issues.
  • 4060ti: The Inexpensive Inference Sweet Spot: The RTX 4060 Ti with 16GB of VRAM is highlighted as a cost-effective option for AI inference, priced around $500 USD/EUR.
    • A user added, it is important to note that AMD cards are not optimized for gaming and the 5000 series from Nvidia may melt.

Links mentioned:


Yannick Kilcher ā–· #general (326 messagesšŸ”„šŸ”„):

VPN Injection, Amodal3R, NVIDIA cuOpt, CUDA Python, Mixture of Experts (MoEs)

  • VPN code injected in OpenAI website?: A user reported seeing <veepn-guard-alert> and <veepn-lock-screen> tags on OpenAI’s website, suspecting a VPN, but another user clarified it was likely code injected by their own VPN sm0kywu.github.io/Amodal3R.
    • The user joked that OpenAI is routing requests through a VPN for plausible deniability so they can use it for training data down the line.
  • NVIDIA cuOpt Optimization AI Microservice Excels: NVIDIAĀ® cuOptā„¢ is a GPU-accelerated optimization AI microservice that excels in Mixed Integer Linear Programming (MILP), Linear Programming (LP), and Vehicle Routing Problems (VRP) according to docs.nvidia.com.
  • CUDA Python is the New Wave: Members discussed whether it is truly the year of CUDA Python as previously mentioned by blelbach on X, with some asserting that Python is sufficient for GPU programming since most users don’t need all the features of C++.
    • Others mocked modern Python programmers, linking a YouTube video titled Modern Python Programmers.
  • MoEs are NOT Unstable Anymore!: A user claimed that MoEs are unstable, but another user countered that they haven’t been unstable to train for two years and are now about the same as dense networks.
    • The stability is largely due to better kernels and dropless token routing, solving issues like numerical instability and expert collapse.
  • DeepSeek V3 drops, community underwhelmed?: Members mentioned that DeepSeek released their DeepSeek-V3-0324 model, with one user stating DeepSeek will destroy OpenAI and another adding that they only published the crappy small version
    • Some members dismissed the approach used by DeepSeek, calling it just known methods and some simplifications, also criticizing the resulting quality.

Links mentioned:


Yannick Kilcher ā–· #paper-discussion (3 messages):

DeepSeek-V3, DeepSeek-R1, Multi-Head Latent Attention (MLA)

  • DeepSeek Models Reach SOTA with Less: A paper reviews DeepSeek’s open-source LLMs DeepSeek-V3 and DeepSeek-R1, noting they achieve state-of-the-art performance with lower resource requirements.
    • Key to this is Multi-Head Latent Attention (MLA), which compresses keys and values into a latent vector, dramatically reducing memory consumption.
  • DeepSeek’s Diagrams Reused in Blog Post: A member described the blog post covering the DeepSeek paper as one of the most blatant re-uses of content, noting ā€œThey didn’t even make diagrams themselves, they just reused the deepseek onesā€.

Link mentioned: šŸ„‡Top AI Papers of the Week: The Top AI Papers of the Week (Mar 17 - 23)


Yannick Kilcher ā–· #ml-news (17 messagesšŸ”„):

ChatGPT & Loneliness, AITER Tensor Engine for ROCm, DeepSeek-V3-0324, Pokemon Red DRL

  • ChatGPT Linked to Lonesomeness?: A member shared a Bloomberg article discussing an OpenAI study that suggests a link between ChatGPT use and feelings of loneliness.
    • Another member pointed out correlation doesn’t always mean causation.
  • AITER Accelerates AMD GPUs: A member posted a link to AMD’s AI Tensor Engine for ROCm (AITER), which optimizes GPU performance for AI tasks on ROCm.
    • The engine allows developers to create operators, integrating them into various LLM training and inference workloads.
  • DeepSeek-V3 Arrives Quietly: A member shared DeepSeek-V3-0324 on HuggingFace, though the README.md is currently empty.
    • The model boasts 685B parameters and offers various tensor types like BF16, F8_E4M3, and F32, with links to finetunes and quantizations.
  • PokĆ©mon Red gets Deep Reinforcement Boost: A member linked a paper and associated YouTube video and linked the ArXiv paper about using Deep Reinforcement Learning (DRL) to train an agent to play PokĆ©mon Red.
    • The abstract discusses the challenges of the game, including multi-tasking, long horizons, and hard exploration, and introduces a baseline agent that completes the initial segment of the game using a simplistic environment and DRL.

Links mentioned:


GPU MODE ā–· #general (22 messagesšŸ”„):

Cloud Providers with Profilers, In-depth Dive into NCCL, Quantization Benchmarking, Understanding Flash Attention, ILGPU 2.0 Availability

  • Cloud Providers with Profilers: A member asked about cloud providers, besides Lambda Labs and AWS, that allow for profilers, leading to a suggestion to compile a shame list to pressure more providers.
    • It was noted that lightning.ai supports profiling and that AWS only provides it on bare metal; Paperspace and Nebius were also mentioned, based on a Reddit thread.
  • Quantization Benchmarking Methods Explored: A member inquired about how to benchmark quantized models and determine which layers to quantize.
  • Decoding Flash Attention by Coding: In a discussion about understanding Flash Attention (FA), a member suggested that coding and profiling/debugging can be helpful if time permits.
    • It was noted that hands-on implementation aided understanding of normal attention, and similarly for Flash Attention.
  • Tile Layout Diagrams: Grasping Bit Interleaving: Feedback was requested on the usefulness and clarity of tile layout diagrams, such as those from tile-ai and Nvidia PTX documentation.
    • The discussion centered on how coordinate bits interleave when mapping between integer sets, assuming power-of-two sizes and contiguity.

Links mentioned:


GPU MODE ā–· #triton (15 messagesšŸ”„):

Triton and Pip Confusion, cuTIl Performance, BF16 Atomic Operations, Triton IR Generation, Flash Attention 1 Kernel Issues

  • Triton install can induce Pip confusion: Installing both triton and triton-windows in the same folder can confuse pip, requiring users to uninstall both before reinstalling triton-windows.
    • The fact that PyTorch is already using Triton suggests ongoing relevance for the package.
  • cuTIl boost Triton performance: A user inquired about the performance benefits of cuTIl, questioning if it aims to surpass LLVM-based approaches by directly utilizing SASS instead of PTX for finer performance tuning.
    • Others pointed out that this is related to atomic CAS, referencing this github issue.
  • BFloat16 Atomic Additions Demand SM90 or Higher: atom.add.noftz.bf16 and atom.add.noftz.bf16x2 require sm_90 or higher, necessitating a atom.global.cas version in the PTX.
    • A user’s temporary workaround involves using a float32 output and casting to bfloat16, which slows down LLama3-8B inference from 113 tokens/sec to 96 tokens/sec on the A100; a post-hook cast might improve speed.
  • Gemlite faces BF16 atomic add limitations: A user is facing issues with bfloat16 atomic add in the gemlite kernel, which requires sm_90 or higher.
    • They are investigating casting as a post-hook in Triton, since they need a custom op since prune_configs_by is not supported by torch.compile.
  • Flash Attention 1 Kernel Faces Discrepancies: A user implementing Flash Attention 1 as a first kernel in triton reported that it works with TRITON_INTERPRET=1 but it has a few elements mismatched on cuda.
    • After increasing rtol & atol the tests passed suggesting that the CPU vs GPU results may be reordered and floats don’t like that.

Links mentioned:


GPU MODE ā–· #cuda (42 messagesšŸ”„):

WMMA instructions, PyTorch RTX 5080 CUDA 12.8 Support, Flash Attention Optimization, Hopper Architecture Swizzle, CUDA Performance Counters Permission Error

  • WMMA instructions compile to MMA: It’s confirmed that WMMA instructions are indeed ā€œwrappersā€ that compile directly to HMMA/IMMA/QMMA instructions in SASS, similar to how MMA instructions function, as shown on the CUDA Godbolt.
  • RTX 5080 PyTorch Support Emerges with CUDA 12.8 Patch: A developer released a patch enabling full CUDA 12.8 + PyTorch 2.5.0 compatibility with the Blackwell / sm_120 architecture for the RTX 5080, providing a GitHub repo with scripts, diffs, and instructions.
  • Flash Attention’s Memory Efficiency: In Flash Attention, tensors are stored as (batch_size, N, num_heads, d), which are contiguous in d (typically > 64), enabling efficient global memory coalescing where each thread loads 16B of data.
  • Hopper’s Swizzle Layout Explained: The documentation’s description of the 64B swizzle in the Hopper architecture is confusing to many, but it’s clarified to be a 64B (bytes) swizzle where each square is 128b (bits), which translates to a 8x64 tile for 8-bit dtypes and a 8x32 tile for 16-bit types.
  • Solving CUDA Permission Errors on Linux: When encountering ERR_NVGPUCTRPERM, which indicates a lack of permissions to access NVIDIA GPU Performance Counters, users on Linux might need to run the command with sudo, though the linked NVIDIA documentation should also be consulted for comprehensive solutions.

GPU MODE ā–· #torch (5 messages):

torch.compile() graph breaks, VRAM reduction techniques, FA3 attention FP8

  • torch.compile() and Graph Breaks: An Investigation: A user inquired about how to check for graph breaks when using torch.compile(), noting that tlparse logs yielded missing metrics.
    • They noted that training runs fine with torch.compile(model, fullgraph=True), asking if this means there are no graph breaks.
  • VRAM Usage Gets Slimmer: A user outlined techniques to reduce VRAM usage, including folding the optimizer step into backward (with a link to a PyTorch tutorial) and offloading optimizer states to the CPU via torchao.
    • They also mentioned partially offloading optimizer states with BNB paged optimizers, and pointed to a TorchTune page on memory optimization, referencing a table summarizing components like Model Precision, Activation Checkpointing, and Activation Offloading.
  • Serialized Compiled Models Remain Elusive: A user shared a GitHub issue about the inability to save/load compiled models and asked if anyone is actively working on it.
    • The issue describes the bug as Serializing a compiled model with pickle fails.

Links mentioned:


GPU MODE ā–· #announcements (1 messages):

Tanishq Kumar, Scaling Laws for Low Precision, Precision-aware scaling laws, post-training quantization, compute optimal

  • Tanishq Kumar Talk on Scaling Laws Incoming: In about 3 hours, Tanishq Kumar will discuss his paper on ā€œScaling Laws for Low Precisionā€ which introduces precision-aware scaling laws for training and inference.
  • Lower Precision Training Scaling Laws: The paper proposes that training in lower precision reduces the model’s effective parameter count, enabling the prediction of additional loss from low precision training and post-train quantization.
    • It suggests that training larger models in lower precision may be compute optimal.
  • Quantization Degradation: The research indicates that the degradation from post-training quantization escalates as models train on more data, potentially making additional pretraining data detrimental.
    • The study unifies scaling laws for post and pretraining quantization to predict degradation from training and inference in varied precisions, validated on models up to 1.7B parameters trained on 26B tokens.

Link mentioned: Scaling Laws for Precision: Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise ā€œprecision-awareā€ scaling la…


srivarshan4271: https://lights0123.com/blog/2025/01/07/hip-script/


GPU MODE ā–· #jobs (1 messages):

AI & Neuroscience Fellowship at the University of Oxford, AI / RL in games and neuroimaging, non-invasive diagnosis and treatment of neurological disorders

  • Oxford U Opens AI & Neuroscience Fellowship: The University of Oxford has a new opening for a research fellow (postdoc level or equivalent experience) to work on AI / RL in games and neuroimaging with Rui Ponte Costa.
    • The salary will be Ā£100k+, with slight adjustments based on experience level, at the Centre for Neural Circuits and Behaviour.
  • AI Powers Systems-Behavioral Neuroscience: The fellowship develops an AI-powered technology that can infer the contributions of specific brain regions to behavior by analyzing gameplay data, enabling non-invasive diagnosis and treatment of neurological disorders.
    • Their approach leverages state-of-the-art deep reinforcement learning models, specifically MuZero and Dreamer architectures (project link).
  • Pillar VC backs AI for Science Fellows: Pillar VC and ARIA are backing AI fellows to spend one year embedded in top science labs in ARIA’s Opportunity Spaces across the UK.
    • They seek the next generation of founders, scientists, and leaders building AI for science (fellowship link).

Links mentioned:


GPU MODE ā–· #beginner (56 messagesšŸ”„šŸ”„):

GPU/CUDA learning resources, Warp scheduler significance, Context switching, SIMD vs SIMT execution, Flash attention setup on Windows

  • GPU Glossary Sparks CUDA Confusion: A member learning about GPUs/CUDA from the Modal GPU glossary expressed confusion about warp schedulers and context switching, specifically about the point of context switching if each thread shares the same instruction pointer.
    • Another member explained using an example of 64 threads in two groups, showing how the scheduler executes one warp while another waits for data, similar to CPU context switching but without state storage overhead.
  • SIMT Demystified: Data Differentiates Threads: A member clarified that while threads in a warp share the same instruction, the data differs, enabling SIMT (Single Instruction, Multiple Threads) execution where 32 threads can multiply 32 elements in one clock cycle.
    • They emphasized that a group of 32 threads is scheduled at once, and context switching brings in a different group of 32, rather than scheduling individual threads one after another.
  • Flash Attention Frustrations on Windows VM: A member encountered issues setting up the flash attention repo locally within a Windows/Ubuntu VM, struggling with nvcc version conflicts and potential disruption to existing CUDA/Torch/Triton setups.
    • Considering vast.ai for development, they sought recommendations on suitable machines for Triton/CUDA work and guidance on choosing a machine to train a BERT model with custom kernels.
  • CUDA Core Confusion Corrected: A member explained that NVIDIA’s marketing term ā€œCUDA coresā€ actually refers to FP32 units, which function similarly to SIMD operations and cannot run independently.
    • Warps from different kernels can be scheduled to the same Streaming Multiprocessor (SM) in a finely time-sliced fashion, especially beneficial when threads are waiting for data loads.
  • Streaming Multiprocessor Architecture Deep Dive: A member clarified that multiple thread blocks can run on one Streaming Multiprocessor (SM), which is crucial for block synchronization, allowing the SM to have warps ready to run while others await a barrier, referencing H100 Streaming Multiprocessor.
    • They explained that resources like registers and shared memory determine the number of resident thread blocks, and the warp scheduler context switches between warps to keep processing units busy.

Link mentioned: GPU Glossary: A glossary of terms related to GPUs.


GPU MODE ā–· #pmpp-book (1 messages):

Amazon Book Release Date, 5th Edition of Book

  • Fifth Edition Release Date Spotted on Amazon: A member reported seeing a 5th edition of an unspecified book listed on Amazon with a scheduled release date of February 2026.
  • Release Date Unconfirmed: Another member requested confirmation of this release date.

GPU MODE ā–· #jax (1 messages):

bigfoot1144: Any progress so far?


GPU MODE ā–· #rocm (2 messages):

ROCm, tilelang HIP backend, row-row bank conflict-free swizzle, AMD sponsoring cards

  • Seeking ROCm Row-Row Bank Conflict-Free Swizzle Implementation: A member is seeking ROCm experts to help implement a row-row bank conflict-free swizzle for the tilelang HIP backend.
    • Currently, they only have solutions for NT layout conflict swizzling, and are requesting assistance from the community.
  • AMD Card Sponsorship Plea for ROCm Development: The same member jokingly requested that AMD sponsor some cards for development related to ROCm.
    • This highlights the resource constraints faced by some developers in the ROCm ecosystem.

GPU MODE ā–· #lecture-qa (2 messages):

Hopper Flops, H100 Clock Rate, H100 SMs, Nvidia Boost Clocks

  • H100’s Dense FLOPs Revealed: For fp16/bf16, dense flops in Hopper = 989 TFLOPS and the clock rate of H100 = 1.830 GHz with number of SMs = 132.
    • The FLOPs / clock / SM = (989 x 10^3) / 1.83 / 132 which is approximately 4096.
  • Nvidia’s Seldom-Mentioned Boost Clock Detailed: The H100 SXM has a boost clock of 1.980 GHz for normal SM operation, but if you use tensor cores it drops down to 1.830 or lower depending on power draw/thermals.
    • There are some rare conditions where you get the full boost clock when running TC ops but strangely that’s not always the case.
  • Official Hopper Boost Clock Document Located: A document was shared which mentions the different boost clocks (GTC22 Whitepaper).
    • The different boost clocks can be found in table 3, page 39 of the document.

Link mentioned: NVIDIA H100 Tensor Core GPU Architecture Overview: A high-level overview of NVIDIA H100, new H100-based DGX, DGX SuperPOD, and HGX systems, and a H100-based Converged Accelerator. This is followed by a deep dive into the H100 hardware architecture, ef…


GPU MODE ā–· #tilelang (10 messagesšŸ”„):

Tilelang 2:4 sparsity support, Tilelang v0.1.3 Release, SPGEMM issue

  • Tilelang to Support 2:4 Sparsity: Tilelang plans to support 2:4 sparsity, leveraging Cute as a backend, although the user acknowledges its current uncommonness in AI workloads.
    • A user expressed interest in fine-tuning 2:4 sparse LLMs, noting its success with vision models, but uncertainty about its impact on LLM accuracy.
  • Tilelang v0.1.3 lands with Cute Upgrades: Tilelang released v0.1.3, featuring enhancements, optimizations, and bug fixes, including Cute upgrades.
    • The release includes new kernels and tutorials such as DeepGEMM, plus autotuning and kernel caches, among other new features.
  • Request to add SPGEMM Issue: A TileLang dev requested that users interested in trying Tilelang for SPGEMM should open an issue on GitHub.
    • A user indicated that they would be interested in seeing progress on this if the dev team investigates further.

Link mentioned: Release v0.1.3 Ā· tile-ai/tilelang: What’s Changed[Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160Add cpu jit with backend ctypes by @xs-keju in #154[Carver] Multi-Threads Compilation for Fast…


GPU MODE ā–· #metal (3 messages):

Parallelized Cholesky, Python + MLX + Metal

  • Parallelized Cholesky accelerates with Python + MLX + Metal: A member shared their contribution to the community: a super high speed parallelized cholesky in python + MLX + Metal, along with an attached python file.
    • Another member commented this is really cool.
  • MLX gains momentum: The community sees growing interest in the MLX framework for Metal.
    • MLX seems to be unlocking new possibilities in high-speed computing.

GPU MODE ā–· #self-promotion (10 messagesšŸ”„):

WheelNext Initiative, CUDA Indexing Blogpost, Container-First Triton Development, GemLite bfloat16 Support

  • WheelNext Gears Up to Enhance Python Packaging: The WheelNext initiative (wheelnext.dev) aims to improve the user experience in the Python packaging ecosystem, focusing on scientific computing and machine/deep learning.
    • A meetup was announced to discuss making shipping python packages with native accelerator code much easier, with details available on Discord.
  • Dive into CUDA Indexing with New Blogpost: A member shared a blog post explaining CUDA indexing with a 2D block tiling example for matrix multiplication, emphasizing row-major format.
    • The post details how a 2D array A with shape (M, N) in CUDA is linearized in row-major format, mapping the coordinate (i,j) to i * N + j.
  • Container-First Approach Streamlines Triton Development: A member highlighted a new blog post about using containers to simplify and accelerate Triton kernel development.
    • The post emphasizes how containerization enhances the Triton development workflow by simplifying setup, increasing consistency, and enabling more seamless collaboration.
  • GemLite Adds bfloat16 Support for Gemma Models: GemLite now supports bfloat16 on both Hopper and non-Hopper GPUs, enabling the running of Gemma models in vllm via hqq.

Links mentioned:

  • WheelNext: no description found
  • Tweet from mobicham (@mobicham): GemLite now supports bfloat16 on both Hopper and non-Hopper gpus 🫔https://github.com/mobiusml/gemlite/pull/24
  • Indexing in CUDA: In this blogpost I want to explain what it means for a matrix to be in row major format. This is essential to understand CUDA kernels and their methods ...
  • A container-first approach to Triton development: The Triton project from OpenAI is at the forefront of a groundbreaking movement to democratize AI accelerators and GPU kernel programming.Ā  It provides a powerful and flexible framework for writi...

GPU MODE ā–· #šŸæ (1 messages):

LLM Kernel Understanding, RL for Operation Understanding, Reducing Hallucinations in Kernel Creation

  • LLMs Demystify Kernel Code: The idea is to use LLMs to understand kernel code, explaining simple concepts and variable states at specific places in tensors.
    • This aims to ensure the LLM grasps the underlying operations.
  • RL Supercharges Kernel Operation Grasp: Employ Reinforcement Learning (RL) to enhance the model’s understanding of operations, ensuring a solid grasp.
    • This solid grasp of kernel operations can serve as a prerequisite for creating complex kernels and potentially reducing hallucinations.
  • Kernel Creation Sanity Check with LLMs: Using LLMs to verify and explain kernel operations could greatly reduce hallucinations during the complex kernel creation process.
    • Such method could be seen as a sanity check for complex kernel code and design.

GPU MODE ā–· #reasoning-gym (5 messages):

veRL rollouts with sglang, low precision data types, quantization strategies for RL, ARC-AGI2 announcement

  • veRL rolls out sglang support: veRL now supports rollouts with sglang as shown in this paper.
  • Tiny Model Reasoning with GRPO: A study showed reinforcement learning (RL) improving reasoning in small language models (LLMs), specifically a 1.5B parameter model trained on 4 NVIDIA A40 GPUs in 24 hours.
    • Adapting the Group Relative Policy Optimization (GRPO) algorithm on a curated dataset, the model achieved significant gains, such as AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, with a training cost of only $42.
  • ARC-AGI2 frontier benchmark: A member shared the ARC-AGI-2 announcement, a frontier AGI benchmark challenging AI reasoning systems.
    • The goal is to achieve 85% accuracy with ~$0.42/task efficiency, contrasting sharply with current performance levels of base LLMs at 0% and reasoning systems at under 4%.

Links mentioned:


GPU MODE ā–· #gpuęØ”å¼ (5 messages):

CUDA core, CUDA_fp6.hpp, CUDA_fp4.hpp

  • CUDA core’s fp4 and fp6 use cases requested: A member inquired about which libraries utilize fp4 and fp6 within the CUDA core, referencing the presence of cuda_fp6.hpp and cuda_fp4.hpp header files in version 12.8.
    • However, they noted difficulty in locating libraries that actively employ these header files.
  • CUDA FP4/FP6 Library Usage: The user is asking about the usage of FP4 and FP6 data types within CUDA cores, specifically if any libraries are utilizing them.
    • They have identified header files (cuda_fp6.hpp and cuda_fp4.hpp) in CUDA version 12.8, but haven’t found examples of their practical application in existing libraries.

GPU MODE ā–· #general (9 messagesšŸ”„):

Submission Guide, Kernel profiling, Conv2D error

  • Submission Guide Available: A member asked for a submission guide and another member shared a link to the documentation for the GPU kernel leaderboard, which is a competition platform on Discord where users can submit their own kernel implementations.
  • Kernel Profiling Coming Soon!: A member asked if it was possible to profile their triton kernel via the bot itself.
    • The response was that we do not currently have that possibility, but it’s in store and you (most likely) can expect it for the first problem set launch.
  • Conv2D Submission Error: A member reported getting a consistent error when submitting to conv2d involving subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1 and asked if it meant their CUDA source couldn’t compile.
    • The member was new to CUDA and C++ and was seeking assistance from the community.

Link mentioned: Getting Started | GPU MODE Kernel Leaderboard: Welcome! If you are excited about building GPU kernels, this leaderboard is the place for you! We


GPU MODE ā–· #submissions (119 messagesšŸ”„šŸ”„):

matmul benchmarks on H100, grayscale benchmarks on A100, grayscale benchmarks on T4, L4, A100, H100, histogram benchmarks on T4, vectorsum tests on A100

  • Modal Runners Ace Matmul Benchmarks on H100: Numerous matmul benchmarks and tests using Modal runners on H100 GPUs have succeeded, with submission IDs ranging from 2479 to 2487.
    • These submissions indicate successful execution and integration of Modal runners for matrix multiplication tasks on high-performance GPUs.
  • Grayscale Gauntlet on A100 GPUs: A multitude of grayscale benchmark and leaderboard submissions have succeeded on A100 GPUs using Modal runners, with submission IDs spanning from 2488 to 2596 and beyond.
    • These consistent successes highlight the reliability and efficiency of Modal runners for image processing tasks on A100 GPUs.
  • Grayscale Greatness Across GPUs: Leaderboard submissions for grayscale using Modal runners have succeeded across various GPUs, including T4, L4, A100, and H100, with an initial submission ID of 2484.
    • This demonstrates the versatility of Modal runners in handling image processing tasks on diverse GPU architectures.
  • Histogram Hit on T4 GPUs: A histogram benchmark submission with ID 2765 using Modal runners on T4 GPUs has succeeded.
    • This indicates successful execution of histogram computation tasks on T4 GPUs utilizing the Modal runners platform.
  • Vector Sum Victory and Conv2d Conquest on A100: Test submissions for vectorsum and conv2d have succeeded on A100 GPUs using Modal runners with IDs 2829 and 2830.
    • These successful tests highlight the capability of Modal runners in handling vector operations and convolutional tasks on high-performance GPUs.

GPU MODE ā–· #status (2 messages):

CUDA, load_inline(), PyTorch headers, KernelBot

  • load_inline() Timed Out Due to Excessive PyTorch Headers: CUDA submissions using load_inline() were timing out because about 5K PyTorch headers were being added, as investigated in this PR.
    • A new mode was added to disable implicitly adding headers, and one member managed to get an example compiling from 90s to 15s, while a colleague got it from 15s to 5s.
  • KernelBot leaderboard performance improved: The KernelBot leaderboard supports custom CUDA extensions via load_inline(), which previously resulted in cold starts of up to 90s.
    • A member stated that they always thought it was a cuda problem, and was happy it could be solved.

Link mentioned: load_inline no_implicit_headers mode by msaroufim Ā· Pull Request #149480 Ā· pytorch/pytorch: In the kernelBot leaderboard we support people competing with custom cuda extensions via load_inline(), however even on toy kernels this can result in cold starts of up to 90s - this problem is pri…


GPU MODE ā–· #hardware (17 messagesšŸ”„):

GPU prices, VRAM requirements for LLMs, RTX Pro 6000, CUDA Capability

  • GPU Prices Skyrocket Amid AI Boom: High-end consumer GPUs are becoming increasingly expensive due to NVIDIA’s strategy of limiting high VRAM to those models, but cloud vendors like vast.ai and Nebius offer cheaper alternatives for running models.
    • One member stated, ā€œwelcome to the ai boom,ā€ highlighting the impact of AI on GPU pricing and availability.
  • Max out budget on older GPUs, run stuff locally: For local machine learning, investing in older cards like 3090 or 4090 is suggested for maximizing budget, with 2x3090 potentially outperforming a single newer card, allowing for local distributed training.
    • The assertion was made that older cards provide opportunities to learn distributed stuff locally.
  • Nvidia desensitizes users to high prices: The new RTX Pro 6000, with 96GB VRAM, is considered a reasonable option for professionals, normalizing the perception of high GPU costs, although it lacks NVLink.
    • One member noted, ā€œActl i think nvidia has successfully desensitized me to their insance prices,ā€ suggesting an adjustment in expectations due to market trends.
  • GDDR7 memory: The RTX Pro 6000 features 96 GB GDDR7 with ECC and 1792 GB/sec bandwidth, although discrepancies exist in CUDA API versions reported in the Data Sheet and TPU specifications.
    • The specs report Compute APIs as CUDA 11.6, while TPU claims CUDA 10.1, and the member highlighted that the CUDA GPUs list has Geforce RTX 50 series with C.C. 10.0 instead of 12.0.

GPU MODE ā–· #tpu (1 messages):

rocka2424: This is awesome, looking forward to it!


Interconnects (Nathan Lambert) ā–· #news (86 messagesšŸ”„šŸ”„):

Nvidia Mamba-Transformer Hybrid, Qwen 2.5 Omni Model, DeepSeek V3 Model Update, Reve Image Halfmoon Model, Qwen2.5-VL-32B-Instruct

  • Nvidia engineers a Nemotron-H Mamba-Transformer hybrid: Nvidia introduced the Nemotron-H family of models, including a series of 8B and 47-56B models that are hybrid Mamba-Transformer models, offering improved inference speed compared to other models, according to their research.
  • Qwen Debuts Qwen2.5-Omni: An End-to-End Streaming Multimodal Model: Qwen released Qwen2.5-Omni, a multimodal model designed to perceive text, images, audio, and video, while generating text and natural speech responses in a streaming manner, according to HuggingFace.
  • DeepSeek V3 Gets a Quick Update, Still Rocks Leaderboards: DeepSeek announced a small version upgrade for the DeepSeek V3 model, with the API interface and usage method remaining unchanged, according to their HuggingFace page.
  • Reve Image Launches Halfmoon: Claims Top Spot in Image Generation: Reve Image launched Halfmoon, claiming it’s the best image model in the world, with impressive text rendering, prompt adherence, and aesthetics, currently accessible through their website, according to their announcement.
  • Qwen Drops Qwen2.5-VL-32B-Instruct: Open Source VL Model with RLHF: Qwen open-sourced the Qwen2.5-VL-32B-Instruct model under the Apache 2.0 license, optimized with reinforcement learning, showing significant improvements in human preference and mathematical reasoning, according to their blog.

Links mentioned:


Interconnects (Nathan Lambert) ā–· #ml-questions (25 messagesšŸ”„):

Impact of noisy data in multi-turn SFT, Transformer usage in RL, Community model preferences, Trusting eval benchmarks, Gemini's image generation

  • Noise Tolerated in Multi-Turn SFT?: A member questioned how much noise impacts data quality in multi-turn SFT, especially with complex agent trajectories, suggesting that some noise is tolerable, recovery steps are valuable, and erroneous turns can be masked.
    • They shared that it’s difficult to collect perfect trajectories when the complexity and step count increases, like making a wrong decision about which site to go to for information or which application to use to open a file.
  • Transformers Slow to Take Over RL?: A member inquired about the limited use of Transformers in RL policy models, suspecting it’s due to compute and memory constraints.
    • They are having trouble finding many papers where they actually used a small Transformer.
  • Community Prefers Claude 3.5 for Code?: A member asked if Interconnects publishes community-preferred model lists, noting their preference for Claude 3.5 over Claude 3.7 for code, but the opposite for reasoning.
    • Another member mentioned that Interconnects does not publish model lists, but they hope to add more evals to their artifacts logs series when possible.
  • Private Evals > Benchmarks: Multiple members discussed trusting model eval benchmarks, with one stating Don’t trust them; have my own eval, and recommending creating a markdown file with 5-10 prompts that you care about.
    • The suggestion was to run prompts with multiple models side by side in tools such as Chorus to quickly get a feel which model is good for which things.
  • Gemini’s Generator a Mystery?: A member inquired whether the new Gemini’s image generation is autoregressive or uses a diffusion head, but its architecture remains unknown.
    • Another member mentioned that labs know which websites to include to boost common benchmarks during training.

Link mentioned: Building on evaluation quicksand: On the state of evaluation for language models.


Interconnects (Nathan Lambert) ā–· #random (36 messagesšŸ”„):

LLM input/output tokens, o1-pro performance, Mistral 24B is impressive, Claude Compass starter prompts, DAPO and Dr. GRPO

  • LLMs Count Input and Output Tokens: In LLMs, both input tokens and output tokens are counted during Supervised Fine-Tuning (SFT), clarifying an initial question about token handling.
    • A member confirmed the token counting and humorously remarked that, ā€œWith the cost of those tokens, he could’ve bought the NYT.ā€
  • o1-pro Dominates Extended NYT Connections Benchmark: o1-pro set a new record on the Extended NYT Connections benchmark with a score of 81.7, surpassing the previous champion, o1 at 69.7, as noted in a tweet.
    • The benchmark is a more challenging version of the original, with additional words per puzzle.
  • Mistral 24B Impresses Community, Reputation Recovers: The release of Mistral 24B is considered a major highlight, praised for its strength and accessibility of the base model, and the promise of new open releases under Apache 2.0 is aiding in reputation recovery.
    • One member stated, *ā€œMistral 24B is probably one of the greatest releases in the last months, incredibly strong model and you have access to the base model as well.ā€
  • Claude Compass Launches Prompts: A member shared a tweet of Claude Compass’s starter prompts which are deep research prompts such as ā€˜Find credible sources for my research’ and ā€˜Analyze great investment pitches’.
    • It was also noted that another company named Cohere already has a product named Compass.
  • DAPO and Dr. GRPO Papers: A member is mastering DAPO and Dr. GRPO for an upcoming blog post, planning to review relevant papers and improve the RLHF book implementation section on tradeoffs.
    • The notes are complete, and the member is considering covering DAPO and Dr. GRPO together, possibly deferring the rest to a future post.

Links mentioned:


Interconnects (Nathan Lambert) ā–· #memes (4 messages):

O1-pro vs BoN, O1-pro reasoning paths marginalization, Tech CEOs in Open Source RL

  • O1-pro excels in Reasoning Path Merging: A member suggested that O1-pro seems more like merging reasoning paths with correct answers than the simple BoN (Bag of Neurons).
    • They noted that the output length from o1-pro is usually a lot longer than o1 but didn’t know how to marginalize reasoning paths though.
  • Tech CEOs champion Open Source RL: Nathan Lambert shared a post that stated major tech company CEOs are arguing for very cutting edge defaults in open-source RL repos.
    • He concluded that this timeline is amazing.

Links mentioned:


Interconnects (Nathan Lambert) ā–· #rl (127 messagesšŸ”„šŸ”„):

R1-Zero Training, GRPO Bias, LOOP & RLOO, PPO Objective, Creative Writing LLMs

  • Row Mean’s Length Bias Unmasked in R1-Zero Training: An analysis reveals that using row mean in R1-Zero-like training introduces a bias, favoring shorter correct responses and longer incorrect ones, as detailed in a paper and accompanying code.
    • Switching to all mean yields comparable performance without increasing length; leading to questions about plots showing increasing reasoning length correlating with increased capability.
  • GRPO’s Length Explosion Problem Plagues Practitioners: Users observed length explosion in their GRPO runs, prompting consideration of techniques like length curriculum or clipping, though these are seen as unsatisfactory band-aids.
    • The core issue is garbage responses are being generated when responses are getting longer; this implies a deeper problem beyond length.
  • Prefix Caching for vLLM Causes RL Issues: Members found that prefix caching for vLLM may be causing RL issues as stated in this github issue.
    • Specifically, inference was worse than training and identified this caching as the culprit, demonstrating a subtle issue that may be overlooked.
  • LOOP and RLOO Arise from Unbiasing Dr. GRPO: It was suggested that Dr. GRPO still has a bias that is more pronounced the smaller the group size is; to make it unbiased, simply multiply Dr. GRPO’s A_i by the correction term N/N-1, resulting in LOOP (Leave-One-Out Proximal Policy Optimization), detailed in the Dr GRPO paper.
    • Removing PPO’s clipping yields RLOO (Reinforce Leave-One-Out).
  • Deviation-Based DPO Diversifies Creative LLM Writing: A new paper explores promoting both output diversity and quality in creative writing LLMs, by including deviation in the training objective to facilitate learning from rare high-quality instances.
    • The study adopts this approach to Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO).

Links mentioned:


Interconnects (Nathan Lambert) ā–· #cv (6 messages):

Operator agent limitations, Infinibranch Browsers as a solution, Intelligent Browser Automation

  • Operator Agents Lack Managerial Skills: Members discussed the limitations of Operator agents, noting they struggle with complex tasks requiring coordination, such as extracting information from datasets and one person commented on needing a manager agent that tells 1 operator agent to get the details for 1 dataset.
    • One member expressed frustration with the limited success rate, achieving only 4 out of 10 tasks with the operator and 6 with deep research.
  • Infinibranch Browsers Reach 80% Success: A possible solution using Morph Cloud’s Infinibranch Browser was suggested to help scale browser-use agents, improving the success rate to approximately 80% on tasks like finding Amazon links for a list of books.
    • The original poster on X, Andrew Carr, needed to extract links from 1000+ books to a Google sheet which Operator was unable to hack.
  • Morph Cloud Scales Autonomous Browser Workflows: Morph Cloud allows users to snapshot and branch complete browser states, including authentication and cookies, making it easier to scale autonomous browser workflows across multiple parallel instances.
    • The blogpost further explains how traditional web scraping methods have become obsolete because of JavaScript-heavy single page applications, Dynamic loading and infinite scroll, complex user interactions required to access data, CAPTCHAs and sophisticated bot detection, multi-step workflows that require understanding context.

Links mentioned:


Interconnects (Nathan Lambert) ā–· #reads (16 messagesšŸ”„):

R1-Zero-Like Training, DeepSeek-V3-Base, GRPO Bias in RL-tuning, CoT Philosophy, Math errors in AI papers

  • R1-Zero Training: New Insights Emerge: A Twitter thread highlights key observations about R1-Zero-like training, suggesting DeepSeek-V3-Base shows an ā€˜Aha moment’ before RL-tuning.
    • The researchers point to a potential bias in GRPO contributing to ever-increasing output length, detailing findings in a paper and providing code.
  • GRPO Loss Implementation Analysis: Multiple papers this week discuss the 1/o term and its impact on longer examples, suggesting that the loss penalizes long, repetitive behaviors less while not rewarding long, exploratory generations sufficiently.
    • They note that per-question normalization punishes hard questions within a batch.
  • Chain of Thought and Reasoning: A member questioned if advancements are truly about reasoning or if they leverage tokens to overcome inefficiencies in task-specific next-token completion/search.
    • Another suggested the viability of Chain of Thought as a form of language model reasoning, describing reasoning as very broad.
  • Mathematical Concerns about paper calculations: There was discussion in an AI2 Slack channel suggesting potential errors or anomalies in the math presented in the paper.
    • Some members expressed confusion regarding the paper’s argument about length normalization bias, with further discussion occurring in a linked channel with a member providing an explanation.

Link mentioned: Tweet from Zichen Liu (@zzlccc): šŸŖ‚Understanding R1-Zero-Like Training: A Critical Perspective* DeepSeek-V3-Base already exhibits ā€œAha momentā€ before RL-tuning??* The ever-increasing output length in RL-tuning might be due to…


Interconnects (Nathan Lambert) ā–· #lectures-and-projects (2 messages):

Claude PR, Header Copy Links

  • Claude Sends Pull Request for Header Copy Links: A member shared a pull request made by Claude for adding header copy links to a GitHub repository.
  • Header Copy Links Amaze: Members found the header copy links that appear on hover to be interesting and useful.
    • They attached a screenshot of the links, noting that they worked immediately with claude code.

Link mentioned: (experimental) Add heading anchor links for easy section linking by natolambert Ā· Pull Request #82 Ā· natolambert/rlhf-book: Add copyable links to all headings that appear on hoverLinks copy the current URL with fragment identifier to clipboardAdd CSS for styling the anchor linksUpdate Makefile to copy new JS file to …


Interconnects (Nathan Lambert) ā–· #policy (9 messagesšŸ”„):

China's Open Source AI Blitz, DeepSeek's Impact, US vs China AI Competition, Chinese consumer market for software, China commoditizing hardware

  • China Plans Open-Source AI Blitz: According to this tweet, China aims to flood the market with open-source AI models to commoditize AI software and boost its hardware sales.
    • The strategy is to copy, optimize, scale, and undercut Western tech, similar to their approach with manufacturing, with DeepSeek being a key player.
  • DeepSeek Triggers Tech Market Tumble: The release of DeepSeek models temporarily knocked ~$1T off US tech market caps, highlighting the potential impact of Chinese AI on global markets, per this tweet.
    • The founder of DeepSeek (Liang Wengfeng) has met with top Chinese officials, indicating significant state support and access to unlimited resources.
  • China’s AI Competition: A member stated that China’s push in open-source AI is driven by intense domestic competition, aiming to accelerate progress rather than bring down US tech.
    • They added that most top Chinese labs realize open source is the best way to drive progress because your close source model will be irrelevant in 3-6 mths or so, might as well accelerate.
  • Revenue from Ads and Digital Services Lower in China than US: A member pointed out that Chinese companies aren’t trying to destroy American value as a goal.
    • The revenue market for ads and digital services isnt the same as in the US with much less revenue in ads and digital services in china than US and for this reason open sourcing is more fine, as well.
  • Chinese Consumers Reluctant to Pay for Software: Chinese consumers generally avoid paying for software and services, with students and professionals being the primary payers.
    • The consumer market is largely dominated by ByteDance and previously by Kimi.

Link mentioned: Tweet from Balaji (@balajis): AI OVERPRODUCTIONChina seeks to commoditize their complements. So, over the following months, I expect a complete blitz of Chinese open-source AI models for everything from computer vision to robotics…


Interconnects (Nathan Lambert) ā–· #expensive-queries (17 messagesšŸ”„):

Grok DeeperSearch, OpenAI Deep Research, Twitter Premium, HF model comparisons

  • Grok DeeperSearch Approaches OpenAI Deep Research: The new Grok DeeperSearch is reportedly ā€œreally goodā€ and close to OpenAI Deep Research in quality, which is impressive considering the short timeframe.
    • The initial Grok DeepSearch was considered ā€œawfulā€ due to hallucinating content from retrieved links, making it the worst implementation, according to some users.
  • Twitter Premium Grants Access to Grok DeeperSearch: Access to Grok DeeperSearch is available with Twitter Premium (the $10 tier), exclusively on the Grok Website.
    • After tweeting about the poor performance of Grok DeepSearch, an individual from xAI contacted one user, leading to improvements in DeeperSearch based on provided chats and benchmarks.
  • Benchmarking Deep(Re)search Implementations: One user maintains a markdown file with a set of questions to test search and research implementations, including Grok DeeperSearch.
    • The benchmark includes a broad shopping query, a specific shopping query, a generic paper search prompt, and a table/benchmark comparison between two models from Hugging Face.
  • Image Generation Benchmarking: A user shared their image generation benchmark, including prompts such as ā€œA woman sitting at a poker table with cards in her handsā€ and ā€œIsometric pixel art of a waterfallā€.
    • These benchmarks help in comparing the performance of different models and would assist future posts.

Link mentioned: Tweet from Tibor Blaho (@btibor91): @TheXeophon-bench


Latent Space ā–· #ai-general-chat (89 messagesšŸ”„šŸ”„):

Gemini Updates, Claude Code New Features, Model Context Protocol (MCP), AI Agents and Email, RF-DETR Object Detection Model

  • Gemini Updates Deconstructed: Gemini’s Dave Citron joined @OfficialLoganK on the Release Notes podcast to discuss recent updates, including personalization, Canvas, Audio Overviews, and Deep Research.
    • The discussion covered topics from recent app launches to the future of personalization in the Gemini app, including insights into user data and privacy considerations.
  • Claude Code Gets Eight New Features: Anthropic launched eight new features for Claude Code to help developers build faster and smarter, documented on their engineering blog.
    • Features include a new ā€œthinkā€ tool, leading to discussion on its implementation and value, with some likening it to Chain of Thought prompting.
  • A16Z’s MCP Ecosystem Deep Dive: A16Z published a deep dive into Model Context Protocol (MCP), exploring its potential as a standard interface for execution, data fetching, and tool calling in AI models as APIs are the internet’s first great unifier.
    • The post examines the use cases of MCP, the challenges, and how it changes the way AI interacts with tools, noting that APIs were the internet’s first great unifier, but AI models lack an equivalent.
  • Roboflow Unleashes RF-DETR for Real-Time Object Detection: Roboflow announced RF-DETR, a fully open-source real-time object detection model under the Apache 2.0 license available on GitHub.
    • RF-DETR achieves SOTA performance with over 60 mAP on COCO, with base and large models at 29M and 128M parameters respectively.
  • Browser Use Bags $17M to Build Web for Agents: Browser Use raised $17 million to advance web agents, led by Felicis Ventures, aiming to take web agents to the next level after an initial prototype was built in just four days and launched on Hacker News.
    • The company is hiring top engineers to build the internet for LLMs, promising a challenging environment with a pure software geekery team culture.

Links mentioned:


Latent Space ā–· #ai-announcements (2 messages):

Rishi Agarwal on Distillation, Swyx's Agent Engineering Talk, Agent Engineering Elements, Agents as ChatGPT's Growth Path

  • Agarwal Surveys Distillation Techniques: Deepmind’s Rishi Agarwal released a short podcast surveying distillation techniques in machine learning.
  • Swyx Launches into Agent Engineering: Swyx launched a new talk and essay on Agent Engineering.
    • The talk was also featured live on the @latentspacepod, highlighting the reasons for going all in on Agents at @aiDotEngineer.
  • Six Agent Engineering Elements Unveiled: The discussion defines Agents (thanks to @simonw) and elaborates on the Six Elements of Agent Engineering.
    • It also examines how Agents could be ChatGPT’s route to reaching 1 billion monthly active users (MAU).

Link mentioned: Tweet from swyx šŸŒ‰ (@swyx): šŸ†• talk + essay: Agent Engineeringhttps://latent.space/p/agentWhy we went all in on Agents @aiDotEngineerDefining Agents (thanks to @simonw)The Six Elements of Agent EngineeringWhy Agents are ChatGPT&…


Latent Space ā–· #ai-in-action-club (226 messagesšŸ”„šŸ”„):

DORA report, Gemini API, AI code generation, Agile adoption, Ruby on Rails

  • Google Cloud’s DORA Report Explores Engineering Excellence: The DORA report by Google Cloud delves into metrics for engineering excellence, though accessing the full report requires signup.
    • Some found the focus on ā€œengineering excellenceā€ to be overly corporate, contrasting it with the ā€œyolo vibe codeā€ often used in prototyping.
  • Discord Mobile App to Show Video Ads: Discord’s mobile app will introduce video ads starting in June, offering advertisers opportunities to showcase trailers and premium content as reported by ArsTechnica.
    • Users expressed concerns about Discord ā€œenshittifyingā€ in preparation for an IPO, drawing parallels to the platform X.
  • Gemini API is a Cheap Loss Leader: Members are finding the Gemini API to be a very cheap API, with one user ā€œsonnet maxxing right now,ā€ and another calls it a ā€œloss leader.ā€
    • There are concerns raised about potential ā€œmodel lockinā€ risks associated with relying on one AI provider and cultural differences between companies.
  • AI Code Generation Replacing Manual Coding: A member mentioned AI is writing 80-90% of their company’s code, and another admits that AI writes 99% of their code these days, resulting in robots doing all the work.
    • Others mentioned their hate for ā€œtemplate reposā€ and that AI is much better at reinventing the wheel for itself.
  • Vibe Manifesto Released: The Vibe Manifesto values flow, iteration, augmentation, product thinking, rerolling, and human taste.
    • These values contrast with friction, perfection, automation, code crafting, debugging, and technical constraints, respectively.

Links mentioned:


Notebook LM ā–· #announcements (1 messages):

Mobile Study Participants, AI Model Updates

  • Mobile Study Participants Needed: The team is still seeking participants for a study focused on mobile use cases and ideas.
    • Interested individuals are encouraged to join and share their insights to help the team learn more.
  • AI Model Updates Coming Soon: The team announced upcoming updates to their AI models.
    • More details will be shared in the coming days regarding specific improvements and new features.

Notebook LM ā–· #use-cases (52 messagesšŸ”„):

Mindmaps in NotebookLM, Research with NotebookLM, HR policies Hub in NotebookLM, NotebookLM for literature search, External Users Share NotebookLM

  • Mindmaps verschijnen geleidelijk in NotebookLM: Een gebruiker merkte op dat hij geen mindmaps had in NotebookLM, waarop een andere gebruiker antwoordde dat hij ze wel had in de gratis versie en dat de functie geleidelijk wordt uitgerold.
    • Niet iedereen zit op dezelfde server, dus het duurt even voordat alle servers zijn bijgewerkt.
  • NotebookLM: Onderzoek om uitgebreide rapporten te bouwen: Een gebruiker vertelde dat hij NotebookLM gebruikt om onderzoek te doen en uitgebreide rapporten te maken om lokaal en soms regionaal nieuws te genereren, om mensen te helpen situaties te begrijpen.
  • NotebookLM: Hub voor HR-beleid: Een gebruiker vroeg of iemand NotebookLM gebruikt als een hub voor HR-beleid, personeelshandboeken en onboarding van nieuwe medewerkers, zodat ze vragen kunnen stellen en de juiste antwoorden kunnen krijgen.
    • Hij had het geprobeerd, maar de antwoorden waren niet altijd correct en hij vroeg zich af of er een manier was om de informatie op een bepaalde manier te organiseren.
  • NotebookLM: Literatuuronderzoek: Een gebruiker vroeg hoe NotebookLM gebruikt kan worden voor literatuuronderzoek, waarop een andere gebruiker antwoordde dat NotebookLM geen ingebouwde zoekfunctie heeft.
    • Desondanks blijft het erg handig voor het leren van onderwerpen op de universiteit.
  • NotebookLM: contract analyse: Een gebruiker heeft 3 contracten van ƩƩn pagina met handgeschreven cijfers/bedragen.
    • EĆ©n ervan werd in eerste instantie helemaal niet vermeld. Een andere werd vermeld met ofwel EUR 700 of EUR 760. Eigenlijk is het EUR 400.

Links mentioned:


Notebook LM ā–· #general (202 messagesšŸ”„šŸ”„):

Mind Map Pixelation Fix, Mind Map Feature Feedback, NotebookLM vs ChatGPT, Access to New NotebookLM, Feedback Methods for NotebookLM

  • Zoom in for Crisp Mind Map Downloads: A member recommends zooming in on tabs before downloading a Mind Map to get a bigger and higher quality output and fix pixelation issues.
    • The member also declared that this tool is an absolute game changer, touting the crazy context window and low hallucination rates, even cancelling their subscriptions to ChatGPT and Claude.
  • Mind Mapping Sparks Symbolic Reasoning: A user believes that getting Mind Mapping right is an important step toward more effective and smarter AI and may be indicative of symbolic reasoning.
    • They suggest that once knowledge can be expressed as a network of meanings, these data structures can be easily corrected with simple manipulations like transplanting nodes or adding intermediate nodes.
  • NotebookLM is not an App, but a PWA: A user sought to change the language on the app, but another user noted that NotebookLM doesn’t have an app, but rather a Progressive Web App (PWA).
    • They recommend removing the app, loading NotebookLM in the browser with the ?hl=LANGUAGE option, and then reinstalling the PWA.
  • Podcast Language can be ā€œForcedā€: A user found that it’s possible to ā€œforceā€ a podcast to generate in another language by inputting a specific prompt at the beginning of the text settings, though English is the only officially supported language.
    • They used the prompt PT-BR cria o podcast em portuguĆŖs to generate a Portuguese podcast, emphasizing it doesn’t always work but finds it cool when it does.
  • Mind Map Feature gets Mixed Reviews: A user thinks that the new mind map is a great addition to NotebookLM, but finds it has major weaknesses.
    • They state that the mind map needs constant regeneration to update and lacks details beyond the topic, requiring back-and-forth navigation and asked for topic and subtopic could be explained within the topic itself.

Links mentioned:


Eleuther ā–· #general (106 messagesšŸ”„šŸ”„):

RWKV architecture development, AI model viability prediction, EleutherAI evaluation methods, Low precision data types for RL, MkDocs site for lm-evaluation-harness

  • Virtual Testing Environment Predicts Model Viability: A member proposed a virtual testing environment (AKA the simulator) that predicts AI model viability before training to reduce wasted resources, saving time and accelerating AI innovation by eliminating unnecessary failed experiments before they happen in expensive real-world training.
    • The member stated that their goal is not to achieve 100% accuracy in predicting an AI mechanism’s behavior—it’s to create a system that can at least tell us whether a model has a realistic chance of working or is doomed to fail early on.
  • EleutherAI Evaluation Methods Detailed in New Blog: A member wrote a quick blog on evaluation methods for EleutherAI and set up an MkDocs site for easier navigation.
    • They are awaiting review on this PR too.
  • Contributor Cautioned on AI-Generated Content in PRs: A member was cautioned about the use of AI to generate content for pull requests, emphasizing the importance of vetting contributions to avoid adding spam.
    • It was suggested that unless the author is 100% certain they’re correct on everything, it would be better to withdraw the contribution until you are.

Links mentioned:


Eleuther ā–· #research (121 messagesšŸ”„šŸ”„):

AI simulation environments, Continual learning in production LLMs, Architecture-aware optimizers, Sharpness Disparity across Transformer blocks, VectorAdam optimizer

  • AI simulator for research: A member shared an idea for a virtual environment to test AI innovations, potentially saving money and resources, as detailed in the attached Ai_simulator.pdf.
    • Others pointed out that testing new architectures at a small scale is already relatively inexpensive, costing around $5 to train a L6D512 model on a 3090 for a day.
  • Optimal Optimizer Derivation Dilemma: Members discussed the difficulty of deriving an optimal optimizer for specific architectures, noting that even for transformers, no such optimizer has been found, despite the availability of unconventional architectures.
    • One member suggested that if a near-optimal optimizer could be derived for an arbitrary architecture, it would be work deserving of an award.
  • VectorAdam rotation equivariance exposed: VectorAdam modifies the second moment update to be the square of the vector norm per gradient vector, addressing coordinate-system bias in Adam, potentially improving rotation equivariance, as shown in this VectorAdam paper.
    • It was noted that VectorAdam is not similar to Adafactor, but more like a blocked approximation with block size = hidden dim.
  • Convergence lemmas debunked: It was suggested that convergence lemmas may not be important and that the regularizers can go in the loss function, so the AdamW detail can be ignored, or put in a separate loss function.
    • Other researchers believed this to be incorrect because the optima you’re looking for is actually quite different with different regularization.

Links mentioned:


Eleuther ā–· #interpretability-general (20 messagesšŸ”„):

mechinterp backlash, token level activations, SAE visualizations, single token activations, untied embeddings

  • MechInterp Faces Academic ā€˜Backlash’: Members discussed that there seems to be an academic ā€˜backlash’ to the ā€˜mechinterp’ brand because so much of it is outside of traditional academic channels.
    • Theorizing that mechinterp is outside the mainstream academic channels, and they are resistant to the paradigm.
  • Token Activations Analyzed for Accuracy: A member is extracting token level activations on an SAE, questioning whether passing a single/pair of tokens would yield more accurate results than passing a whole sentence.
    • They found that the first token to trigger an activation is holocaust but it’s not the token with the strongest activation, and wondered if neuron activation might be context specific.
  • SAEviz Library for Visualization: When looking at neuronpedia website graphs per feature/neuron, it was suggested to look into SAEviz, a library that does those visualizations using the logit lens.
    • The discussion clarified that these visualizations represent the ground truth activations rather than approximations.
  • Single Token Activation Doubts Raised: A member questioned the validity of single token activations, emphasizing that neurons are only ever active in contexts, it doesn’t make sense to analyze them in isolation.
    • They explained that the activations are influenced by the context before; for instance, the phrase I am a dictator I want to might change the activation on to.
  • Models need time to ā€œwarm upā€: A member states that models need time to ā€˜warm up’, where for the first 50 tokens contextual features tend to be ablated by the model by attending to the end-of-text token.
    • The intuition being that the model doesn’t have enough information to make good judgements about context.

Eleuther ā–· #multimodal-general (1 messages):

Recursive Design, GAN vs. CNN vs. RL Architectures

  • Recursive Design Emerges as a Promising Technique: A member introduced a novel diagram using a recursive design, distinguishing it from traditional GANs (Generative Adversarial Networks).
    • This member highlighted that their implementation emphasizes structural organization over sequential processing, leveraging CNNs for filtering and RL for refining responses.
  • Alternate Architectures: The user proposed an alternate architecture using recursive design.
    • The user distinguished the architecture from GAN as an expression, CNN for filtering, and RL for response refinement.

Eleuther ā–· #gpt-neox-dev (1 messages):

lm_eval update, CI test failures

  • Request to update lm_eval: A member is drafting a PR to update the evaluation logic to lm_eval==0.4.8, the latest version, referencing the Evals PR.
  • CI Tests Failures: A member observed that CI tests are failing for the lm_eval update PR and another test PR created with trivial changes, asking if the repo’s CI is healthy, and referencing the CI Test PR.

Links mentioned:


HuggingFace ā–· #announcements (1 messages):

StarVector, SpatialLM, Hugging Face Agents Course, Xet on the Hub, HF Welcome Page Makeover

  • StarVector emerges as vector graphics virtuoso: A new foundation model called StarVector has been released on Hugging Face for generating scalable vector graphics code from images and text, available at Hugging Face.
    • The initial release includes the starvector/starvector-1b-im2svg model.
  • SpatialLM navigates the 3D landscape: SpatialLM, a 3D large language model designed to process 3D point cloud data, has been released on Hugging Face at manycore-research/SpatialLM-Llama-1B.
  • HF Agents Course embraces LlamaIndex, LangChain, and SmolAgents: The Hugging Face Agents Course now includes integrations for LlamaIndex, LangChain, and smolagents, offering learners diverse approaches to agent frameworks.
    • The course aims to provide fundamental knowledge applicable across different frameworks, making it accessible to those already familiar with one or more of them, according to this tweet.
  • Xet accelerates on the Hub: Hugging Face’s Xet Team has migrated the first Model and Dataset repositories off LFS and to Xet storage.
    • This is a step toward empowering AI builders to build and collaborate more effectively on massive models and datasets, described in more detail in this blog post.
  • Hugging Face revamps welcome page: The Hugging Face welcome page has received a significant makeover, offering a streamlined access to community AI apps, open-source libraries, local model execution, and more.
    • Users can explore various sections like HF Spaces, Open Source Libraries, Local Models, and the Inference Playground via the updated welcome page.

Links mentioned:


HuggingFace ā–· #general (136 messagesšŸ”„šŸ”„):

ComfyUI Samplers, Open Schizo Leaderboard, Short Story Generator with Pytorch, Photorealism Settings for SD1.5/SDXL, Flux.1 Model Performance

  • ComfyUI Sampler Strategy Session: Members discussed the best sampler_name to use in ComfyUI, seeking recommendations for optimal configurations but not knowing much about it.
    • One user recommended dpmpp_2m_sde sampler and kl_optimal scheduler for photorealism with SD1.5 and SDXL checkpoints.
  • Showcasing Crazies on Open Schizo Leaderboard: A new leaderboard was released on Hugging Face, showcasing top models.
  • Model Integration Protocol (MIP) simplifies LLM-powered service: A user is seeking feedback on Model Integration Protocol (MIP), proposing a simpler and more scalable approach for OpenAI that automatically converts existing methods, classes, and HTTP endpoints into JSON-RPC using reflection.
    • This approach aims to drastically reduce development overhead while maintaining platform independence and compatibility with any LLM, and a Neurocaster-Server implementation illustrates its use.
  • Wan Models Debut AutoencoderKL: A user encountered an import error related to AutoencoderKLWan from the diffusers library, potentially due to using a development version or a mistaken repository.
    • A github issue was found which explains that the user may be experiencing a development version error, since AutoencoderKLWan is not available yet.
  • InferenceClient API throws Authentication Error: A user reported a 403 Forbidden error when attempting to list deployed models using the InferenceClient API, even with read-only tokens configured to allow calls to Inference Providers.
    • The error indicates insufficient permissions to call Inference Providers on behalf of the user and a user posted a link with the same error.

Links mentioned:


HuggingFace ā–· #today-im-learning (5 messages):

audio processing, AI agents, Tokenisers, BPE, Unigram language modelling

  • Dive into Audio Adventures: A member is deep-diving into audio processing today.
  • Framework for Fantastic AI Agents: A member is tackling the framework for AI agents today.
  • Tokeniser Tussle: BPE vs Unigram: A member is exploring the mechanics of various tokenisers, specifically BPE and unigram language modelling.
  • Lightweight Models light up Laptops: A member is researching lightweight, fine-tunable models suitable for running and tuning on a development laptop.

HuggingFace ā–· #i-made-this (8 messagesšŸ”„):

Logfire Callback for HF Transformers Trainer, TrashLens for image organization, pdf2notes: AI-powered PDF to Notes conversion, Kids feedback on UI/UX, Local API Usage

  • Logfire Callback Logs Training Events!: A member created a Logfire callback for HF transformers Trainer that logs training events.
    • This tool helps in tracking and analyzing the training process of transformer models in Hugging Face.
  • TrashLens Brings Order to Image Chaos!: TrashLens is designed to bring order to image chaos, helping users focus on important content and free up space effortlessly.
    • The tool aims to streamline image organization, making it easier to manage and declutter visual data.
  • pdf2notes Turns PDFs into Organized Notes!: Pdf2Notes is an AI-powered, open-source solution that converts unstructured PDFs into well-ordered notes using LlamaParse and Llama-3.3-70B.
    • The tool uses DeepMind’s Gemini 2 Flash for multi-modal parsing and features a chatbot for more in-depth insights, wrapped in a Gradio and FastAPI framework, and can be run locally with Docker.
  • Kids Provide Valuable UI/UX Feedback!: A member shared that their son helped with the UI colors and enjoys the tool, especially unlocking new achievements.
    • Feedback from kids emphasizes the importance of engaging UI elements and achievement systems in educational tools.
  • API-Free Local Operation in Question!: A member questioned if pdf2notes can operate 100% locally without external APIs, raising concerns about needing subscriptions for Gemini and Groq.
    • They criticized the Docker setup, suggesting it is too complex for non-power users who prefer simpler solutions without additional application installations.

Links mentioned:


HuggingFace ā–· #computer-vision (6 messages):

Qwen for video annotation, Opus clip opensource, LLMs and VLMs in autonomous driving

  • Qwen Guides Video Annotation Newbie: A member sought advice on using Qwen with the transformers library for video frame extraction and annotation.
  • Opensource Opus Clip Tool Seeks Helping Hands: A member is trying to create an opensource version of Opus Clip (video repurposing tool).
    • The author seeks assistance with their ā€œspaghetti repo and codeā€ which utilizes yolov8 and revideo for detecting people and splitting the video vertically.
  • LLMs and VLMs drive Autonomous Driving into the Future: A member shared their new substack article about LLMs and VLMs in autonomous driving, highlighting improvements in vehicle capabilities.
    • The article references a survey paper, A survey for foundation models in autonomous driving, available on arXiv:2402.01105.

Link mentioned: Autonomous driving with LLMs, VLMs, and MLLMs: Discussing the application of Large Language/Vision Models in autonomous driving and the most significant developments and approaches.


HuggingFace ā–· #gradio-announcements (1 messages):

Gradio Deep Links

  • Gradio 5.23 enables Deep Links!: Gradio 5.23 introduces Deep Links, allowing direct linking to specific outputs like images or videos, exemplified by this link to a blue jay image.
    • To upgrade, use pip install --upgrade gradio.
  • Image.png: The image shows an attached file.
    • The file is hosted on discord.

Link mentioned: black-forest-labs/FLUX.1-schnell: no description found


HuggingFace ā–· #smol-course (1 messages):

Hackathon Timing, Hackathon Details

  • Hackathon Date Still a Mystery: A member inquired about the hackathon date, expressing difficulty in finding relevant information about it.
    • They mentioned the YouTube stream stated the 22nd of March, but found no confirmation.
  • Hackathon Details are missing: The user is unable to find any relevant information about the Hackathon.
    • The user mentions the youtube stream said that it’s today, but there are no details.

HuggingFace ā–· #agents-course (33 messagesšŸ”„):

LangGraph rigidity, Local LLMs for smolagents, Gemini in LangGraph, API costs for notebooks, Agent storing retrieved info

  • LangGraph gains Fans Despite LangChain hate: A member who just finished the LangGraph module likes the rigidness of LangGraph compared to LangChain, which they follow on Twitter and said gets a lot of hate.
    • Others seemed to echo this sentiment.
  • Local LLMs Need Beefy Machines to run Smolagents: Members found that to run a local LLM and get good results on smolagents, you’ll need a big one (around 32B parameters) and that implies a powerful machine.
    • They tried with ā€˜small’ LLMs like qwen coder 7B or deepsek-r1 7B and the results with smolagents are pretty inconsistent.
  • Home Labs Arise to Reduce API Costs: Members discussed the cost of APIs to complete the notebook, and those who do not wish to pay are working to build out a sufficient home lab to run models on and access them via API.
    • It was mentioned that InferenceClient APIs by huggingface are free to use with a limit of 300 requests/hour for free users.
  • Where does the Agent store for future reference?: In the agentic RAG section of the course (https://huggingface.co/learn/agents-course/unit2/smolagents/retrieval_agents), it is unclear how the LLM agent stores the retrieved information for easy access when planning future events, optimizing efficiency in subsequent tasks.
    • It was suggested it is not the LLM but the agent that stores the search and that the agent itself would have to write it down somewhere, not just in the context.
  • API Token Issue Solved!: A member was experiencing issues running code using HuggingFaceInferenceAPI and getting irrelevant responses from their LLM.
    • The issue was identified and resolved as a problem with the API token, which needed to be read-only to run locally.

Links mentioned:


HuggingFace ā–· #open-r1 (9 messagesšŸ”„):

r1, vllm, cuda kernel

  • Debate Erupts Over r1 Training Curriculum!: One member asked about the training curriculum, saying that it took 5 minutes with deepseek to understand the humor.
    • Another member stated that r1 is incredibly slow, requiring considerable power; their Scaleway R1 grid running 20 machines around 3 PFLOPS generated only a few hundred MB per day, so it was much faster to use llama and reverse engineer the thinking tokens from query response pairs.
  • CUDA Kernel Improvements Discussed: One user inquired whether vllm was being used and also mentioned working on some cuda kernel improvements.
    • Another member simply answered no.

MCP (Glama) ā–· #general (155 messagesšŸ”„šŸ”„):

MCP and K8s, Anthropic's MCP, MCP server directories, C# MCP SDK, Vercel's AI SDK with MCP Clients

  • K8s Setup Required to Test MCP Prompts: To test MCP prompts, particularly those from this file and this test, a Kubernetes setup is required.
    • An alternative implementation with prompts is available here for managing Electric Vehicle charging stations.
  • MCP isn’t that complex! User says: One user expressed confusion at the perception that MCP is complex, stating JSON RPC isn’t hard. Using SDKs it’s even easier. Making an MCP server or client is pretty easy compared to a lot of other development work.
    • They suggested that with just 1 cmd and 1 arg you can add anything to any llm, with no need for public ip, tls cert, or any previous blocks.
  • Dive into MCP Server Repositories: Users shared a list of useful MCP server directories, including Glama with a report card system, PulseMCP for a well-organized and exhaustive list, and the official MCP GitHub.
    • These resources help developers find and assess various MCP servers for their projects.
  • New C# SDK officially released!: A new official C# SDK for Model Context Protocol servers and clients has been released by Microsoft, as seen here.
    • This provides developers with tools for building AI applications using JavaScript and TypeScript, integrating into web frameworks like Next.js and Svelte, per Vercel AI SDK 4.2.
  • Zapier Integrates with MCP for broader AI application Access: Zapier has released an MCP server, providing access to over 8,000 integrations for AI assistants to interact with various apps.
    • This allows AIs to perform real-world tasks such as sending messages, managing data, scheduling events, and updating records, expanding their capabilities beyond text generation.

Links mentioned:


MCP (Glama) ā–· #showcase (29 messagesšŸ”„):

mcpwizard, vscode-mcp, DICOM servers MCP, google sheet MCP server, Narrative Spittoon Inversion project

  • MCPwizard Simplifies Server Creation: A member introduced mcpwizard, a CLI tool to simplify creating and deploying MCP servers, highlighting features like initializing projects and adding custom tools to Claude assistants.
    • The tool’s GitHub repo was also shared for community feedback and contributions.
  • VS Code MCP Gets Community Acclaim: Members shared a VS Code MCP that they’ve wanted.
  • DICOM MCP Server for Clinical Imaging: A member created an MCP server for interacting with DICOM servers, enabling AI assistants to query medical imaging systems for patient scans and clinical reports, available at christianhinge.com.
    • The associated GitHub repo is located here.
  • Google Sheets MCP for Direct Editing: A member built a Google Sheet MCP server, allowing Claude to directly edit spreadsheets, streamlining data handling and formula adjustments as mentioned in this tweet.
    • The code can be found here.
  • Automated Debugger MCP Server Enhancements: A member has been making improvements to their automated debugger MCP server, encouraging others to try it out and contribute.
    • The server allows LLMs to place breakpoints, run code, move between breakpoints, and evaluate expressions.

Links mentioned:


Nomic.ai (GPT4All) ā–· #general (102 messagesšŸ”„šŸ”„):

Speech to Text Solutions, GPT4All and NSFW content, LocalDocs Disappearing, LLMs for Office Tasks, Running Models on Multiple Devices

  • Prompting Proficiency Prevails: Members discussed that if a language model is desired to respond in a specific language (e.g. German), it is best to write the system message in that language to avoid triggering ā€œIm Kontext Lernenā€ (in-context learning).
    • It was further suggested that avoiding negative sentences with words like ā€œnichtā€ and ā€œdon’tā€ can improve results, with a recommendation to rephrase instructions to use active verbs instead.
  • Nemo’s Nuances Named: It was mentioned that Mistral Nemo is a 12b model and Mistral 24b is Mistral 3 or Mistral 3.1, with discussion around specific model details for projects.
    • Confusion arose around identifying the exact model, with one member emphasizing the need for precise model information to avoid issues.
  • GPT4All’s LocalDocs Vanish: A user reported that their entire catalog of local docs disappeared for no apparent reason, prompting discussion about potential causes such as changes to the install folder or lack of admin rights.
    • Members recommended backing up the localdocs.db file and the original documents to prevent data loss, and suggested that a Windows 11 update might have caused the issue by messing with drive letters.
  • LLMs Eye Medical Office Efficiency: Members discussed the potential of using local LLMs in a medical office setting to help doctors create reports and assist with treatments, with a focus on the system learning from past dictated notes.
    • However, it was cautioned that LLMs may not be suitable for handling financial or medical data due to the risk of confabulation and the need for precise information.
  • GPT4All Lacks Vision: A member asked if any models that GPT4All can run have vision capabilities, and it was confirmed that GPT4All does not support vision capabilities.
    • Alternative tools like LM-Studio were suggested as options for vision-related tasks.

Links mentioned:


Modular (Mojo šŸ”„) ā–· #general (7 messages):

High performance software, Vendor lock-ins, OpenCL, OpenMP, OpenACC, Vulkan’s Compute API, and SYCL, Democratizing AI Compute, Hardware Lottery

  • Exploring High-Performance Software Landscape: A member is exploring the landscape of writing high-performance software for various devices and industry needs, particularly concerning vendor lock-ins and the necessity of porting projects to phones or embedded devices.
    • They requested recommendations for papers, search terms, or authors to better understand the trade-offs and options available.
  • Open and Portable APIs: A member suggested starting with open and portable APIs such as OpenCL, OpenMP, OpenACC, Vulkan’s Compute API, and SYCL, citing their well-documented reasons for creation.
    • They also pointed to POCL as an academic project with related papers.
  • Democratizing AI Compute Series: A member linked to Chris Lattner’s ā€œDemocratizing AI Computeā€ series, highlighting how better hardware utilization can dramatically reduce the need for expensive GPUs.
    • The series includes articles on CUDA, OpenCL, and AI compilers (TVM and XLA).
  • The Hardware Lottery: A member recommended the paper ā€œThe Hardware Lotteryā€ by Sara Hooker, which discusses how hardware and software can determine the success or failure of research ideas.
    • The abstract states that the paper introduces the term hardware lottery to describe when a research idea wins because it is suited to the available software and hardware and not because the idea is superior to alternative research directions.

Links mentioned:


Modular (Mojo šŸ”„) ā–· #mojo (82 messagesšŸ”„šŸ”„):

Mojo Logging Library, Mojo Formatter Tool, Mojo Dict Default Values, GPU Support for Windows, Mojo Inline Assembly Documentation

  • Logging Library in Mojo Remains WIP: A logging library is work-in-progress in the standard library but is getting reworked; full serialization, and likely reflection, is needed before logging can be considered finished.
    • According to one member, We would need to finish serialization before we could call logging finished, which probably means reflection.
  • Mojo Boasts Built-In Formatting Tool: Mojo includes a built-in formatting tool, mojo format, similar to Black in Python or fmt in Rust, for code formatting.
  • Dict Lacks Default Value Generation: The Mojo Dict is more like Python’s dict and does not include functionality to generate default values like defaultdict.
  • Windows GPU Support Frustrates Mojo Developers: GPU support for Windows is difficult because the Windows compiler toolchain is a pain to work with; most people do not run enterprise GPU clusters on Windows, and there’s little reason to improve tooling.
  • Mojo’s Inline Assembly Documentation is a Mess: Members noted the documentation for inline assembly in Mojo is a bit messy.
    • One member said Time to harass Joe into writing documentation for it, then, but this was immediately followed by No harassing.

Link mentioned: Question: vpermi2b inline assembly output incorrect in loop context due to register allocation: Maybe you could try this from sys import llvm_intrinsic alias T = SIMD[DType.int8, 64] @always_inline(ā€œnodebugā€) fn vpermi2b(a: T, b: T, idx: T) -> T: return llvm_intrinsic[ā€œllv…


Modular (Mojo šŸ”„) ā–· #max (3 messages):

MAX Platform, pixi.toml, max-pipeline, Python model graphs, magic CLI

  • Newcomer Asks About MAX Platform: A new user inquired about modifying the max/pipeline directory and testing changes within the MAX Platform via the pixi.toml file.
    • Specifically, they were curious about altering the max-pipeline without downloading it as a dependency.
  • Editing Python Model Graphs: A member explained that while Python model graphs aren’t well-documented, the MAX pipelines module’s Python source is downloaded locally.
    • Changes to these local files in .modular/envs/max-pipelines/lib/python3.12/site-packages/max/pipelines (or similar location in the .magic environment) should reflect when running pipelines.
  • Running max-pipelines via Python: The original poster asked if they could run max-pipelines directly with Python instead of using the magic CLI to add more command line parameters.
    • No direct response was given on the feasibility of this approach.

Links mentioned:


LlamaIndex ā–· #blog (4 messages):

AGNTCY, Large-Scale Structured Extraction, Deepseek R1 + LlamaIndex RAG app, WeAreDevs WebDev & AI Day

  • AGNCY Initiative for Agentic Interactions Emerges: Luke discusses the motivations behind AGNCY, an effort to create an open standard for agentic interactions.
  • Scale Structured Extraction on Complex Docs: LlamaIndex highlights how to perform large-scale structured extraction over complicated documents, extracting 50-100 fields from a pydantic schema with nested sub-schemas, requiring high accuracy.
    • More details here.
  • Deepseek R1 and LlamaIndex Build RAG: LlamaIndex highlights a project from Akshay Pachaar integrating Deepseek AI to build a RAG app with LlamaIndex for orchestration, Deepseek AI R1 for inference, Ollama to locally serve R1, and Streamlit for the UI; more details available here.
  • WeAreDevs WebDev & AI Day Approaches: LlamaIndex advertises WeAreDevs WebDev & AI Day this Thursday, promising insights from industry experts on how AI is transforming web development and its impact on software development, with more information available here.

LlamaIndex ā–· #general (71 messagesšŸ”„šŸ”„):

Haystack Uninstall LlamaIndex Install, Ollama Integration Error, RTX 3060 Token Issues, Custom AI Interview Prep, Agent Workflow Timeout Error

  • LlamaIndex + Ollama = Perfect RAG?: A member sought help setting up a RAG pipeline with LlamaIndex, Ollama, and related integrations, receiving a code snippet from Deepseek to get started but ran into dependency issues.
    • The error was caused by the incorrect naming of a function argument (model_name instead of model), and while the error was resolved, the generated answer was still not what was expected.
  • Crafting Custom AI Interview Grindset: A member is building a local AI using Llama 3.2, Sonnet 3.7, and Dolphin blended into a 16B model with RAG, custom fine-tuning, and dreams of landing a job at an AI/Tech company.
    • He is trying to get his AI to apply to ai/tech companies and pass interviews and has experience in face tracking, blender, unity, powershell, and tts.
  • Timeouts Break Agent Workflows!: A member reported that their agent workflow was crashing due to unhandled timeout errors with the OpenAI endpoint.
    • It was suggested to catch WorkflowRuntimeException or Exception instead of WorkflowTimeoutError.
  • Hugging Face vs Ollama: Which LLM is Easier to Configure?: Members discussed using Hugging Face models locally for chat with RAG, with one user suggesting Ollama is easier to configure.
    • Despite the debate, helpful links to Hugging Face Embedding examples were provided, such as this notebook.
  • JSONL Datasets and Git: A Match Made in Heaven or Data Disaster?: One member pondered the wisdom of storing datasets as JSONL files in Git, seeking insights into potential downsides.
    • There was no specific answer to this question, but it was mentioned that Github tracks the updates to every piece of documentation.

Links mentioned:


LlamaIndex ā–· #ai-discussion (1 messages):

Multi-Agent Systems, Program-Wide Backoff Mechanism, Function Calling

  • Debate on Triggering Agents via Function Calling: Members are debating if a single agent triggering other single agents via function calling could replace program-wide backoff mechanisms in multi-agent systems.
    • They are considering whether these two setups might overlap to achieve the same functionality in certain scenarios.
  • Exploring Alternatives to Backoff Mechanisms: The discussion focuses on whether using a single agent to trigger others via function calls is a viable alternative to a program-wide backoff mechanism.
    • The goal is to determine if this approach can achieve similar functionality in multi-agent systems, potentially offering a more streamlined solution.

Cohere ā–· #ć€ŒšŸ’¬ć€general (25 messagesšŸ”„):

RAG source return, data retention policy, security information about chat with cohere, sampler settings for Command A, AI assistant powered by Cohere's command-r-plus

  • Command-R-Plus Powers New AI Assistant: A startup founder is building tools for structural biology using an AI assistant powered by Cohere’s command-r-plus, combined with a MolStar molecular viewer (https://ai.doi.bio).
    • The site currently supports the ā€˜load’ command for loading PDB entries into the viewer; for example, say ā€˜Show me 7zzz’.
  • Data Retention Policy & Security Info Discussed: A member inquired about data retention and security policies for Cohere’s chat feature, specifically if data is used for model training.
  • Cohere’s Data Privacy and Deployment: A Cohere team member detailed that their SaaS platform lets users control data directly from their dashboard, offers ZDR support upon request via email, and integrates with major cloud providers (OCI, Bedrock, Sagemaker, Azure Cloud).
  • Seeking RAG Replication Resources: A member is seeking resources to replicate RAG source return behavior similar to notebooklm, where specific paragraphs are referenced in search results.
    • They are looking for open-source examples related to chunking and data model design.
  • Command A Sampler Settings Guidance: A member asked about released recommended sampler settings for Command A.
    • Another member suggested starting with a temperature of 0.7 and adjusting as needed for determinism vs. flexibility; the default temperature is 0.3.

Links mentioned:

  • ai.doi.bio: no description found
  • Security | Cohere: Ensure ultimate AI security and privacy with Cohere's enterprise-grade security protocols, robust access controls, and private deployment options.
  • Login | Cohere: Login for access to advanced Large Language Models and NLP tools through one easy-to-use API.
  • Privacy Policy | Cohere: Cohere Inc. (ā€œCohereā€) values and respects your privacy. We have prepared this privacy policy to explain the manner in which we collect, use and disclose personal information through our Website locat...
  • Deployment Options - SaaS, Cloud API, Virtual Private Cloud (VPC), On Premise | Cohere: Our solutions provide industry-leading data privacy and security and are designed to meet the diverse needs of organizations seeking to harness the power of generative AI. Whether you’re a start-up or...

Cohere ā–· #ć€ŒšŸ”Œć€api-discussions (35 messagesšŸ”„):

Command models, SSL Errors, API Rate Limits, MongoDB

  • Command Models Face SSL Issues?: A member inquired about Command models and their potential for generating more human-like responses, while also experiencing SSL errors.
    • Another member pointed out that SSL errors are not typically related to the model itself but rather to untrusted certificates or network configurations, but could be related to rate limiting.
  • API Spamming Causes SSL Errors?: A member reported encountering SSL errors when rapidly sending requests to the API, suspecting it might be due to spamming despite having the py.ssl module properly installed.
    • Another member suggested the issue could stem from untrusted server certificates, not client-side problems, and recommended contacting the support team.
  • Suspect API Rate Limit Arises: A member suspected the SSL errors might be related to an undocumented API rate limit triggered by spamming requests.
    • Another member noted that rate limits usually return a 429 error code, however.
  • MongoDB Status Queried: Switching topics, a member inquired whether another’s MongoDB was working.
    • The other member stated it was working fine and they used it yesterday.

Cohere ā–· #ć€ŒšŸ’”ć€projects (2 messages):

Discord Bot, RAG Pipeline, vnc-lm, Context Augmentation, Docker

  • vnc-lm Releases Discord Bot with RAG Integration: A member released a new version of their Discord bot, vnc-lm, featuring a RAG pipeline that pulls data from Wikipedia and DuckDuckGo to augment prompts with additional context.
    • This pipeline adds approximately 500 tokens to each prompt by appending five chunks of sourced information to improve the model’s context, with code available on GitHub.
  • Search enabled and disabled: The newly released bot has support for web search.
    • The new search can be enabled with + search and disabled with + model.
  • Versatile Bot Supports Local and Hosted LLMs: The updated Discord bot now supports every popular local and hosted large language model API, including Cohere.
    • The bot can be quickly built using Docker, allowing users to easily edit messages and get new responses within Discord.

Links mentioned:


Torchtune ā–· #general (33 messagesšŸ”„):

Synthetic Data Generation with vllm and deepseek r1, Llama4 Release, Qwen3 MoE, Good Data Problem, PDF Extraction

  • Synthetic Data Streams from vllm and Deepseek R1: A member is generating synthetic data using vllm and Deepseek R1, expecting the process to run for a couple of weeks.
    • Training is delayed in anticipation of Llama4’s release during LlamaCon.
  • Data Quality Conundrums Continue: Despite years of research, the definition and attainment of good data remain elusive for AI labs, even after the recognized importance of datasets like fineweb and lima.
    • A member expressed frustration over the lack of effective PDF extraction tools: we still don’t have amazing PDF extraction and this is making my blood boil.
  • LlamaExtract Tool Launched: LlamaIndex launched LlamaExtract, a tool for structuring complex documents using genAI-native agents.
    • It adapts the latest models to accurately and reliably structure documents like financial reports and resumes.
  • DeepSeek-V3 Releases Unhinged: A member noted the unceremonious release of DeepSeek-V3 by Deepseek, humorously calling them unhinged due to the lack of a proper readme.
    • The model, accessible on Hugging Face, has a blank README.md but provides access to a playground.
  • MoEs Hinted for Torchtune?: A subtle reference was made to the potential inclusion of Mixture of Experts (MoE) models in Torchtune.
    • The discussion touched on the practical challenges of training such large models, potentially requiring 8-9 TB of VRAM.

Links mentioned:


Torchtune ā–· #dev (23 messagesšŸ”„):

datasets library issue, GRPO LoRA 3B Single Device, vLLM support for data generation, CUDA graphs

  • Datasets Library Troubleshoot: Members found an issue with the datasets library and attempted to debug it, with one suggesting upgrading the datasets version.
    • One member confirmed that they are on the latest version 3.4.1.
  • GRPO LoRA Achieves 54% on GMS8K: The GRPO LoRA 3B single device gets to 54% on GMS8K, according to a member who shares a link to the pull request.
    • The member noted that it performs better than expected on novel questions, despite an error of adding extraneous +2 in its calculation.
  • vLLM support lacking for data generation: Members discussed adding vLLM support for data generation but noted difficulties in sharing weights between vLLM and torchtune.
    • One suggested hosting the model in another vLLM process and converting weights, while another mentioned experimenting with a hacky way to make it work on smaller models.
  • CUDA Graphs capture operations: A member inquired about CUDA graphs which captures a whole bunch of GPU operations as a graph and launch them as a single operation.
    • Another member confirmed this and noted that it reduces the overhead to launch CUDA operations from CPU, which reduces GPU idle time.

Link mentioned: GRPO LoRA Single Device by ianbarber Ā· Pull Request #2467 Ā· pytorch/torchtune: ContextWhat is the purpose of this PR? Is it to[x ] add a new feature fix a bug update tests and/or documentation other (please add here)#2421 - exploring a LoRA recipe.ChangelogWhat are …


DSPy ā–· #show-and-tell (1 messages):

DLCoT Optimizer, Chain-of-Thought Distillation, Token Usage Reduction, DSPy Optimizers

  • DLCoT Optimizer Launches for Chain-of-Thought: A member has submitted a pull request (#8000) for a new optimizer called DLCoT (Deconstructing Long Chain-of-Thought) to the DSPy teleprompt module.
    • It enhances chain-of-thought reasoning by intelligently processing and optimizing long CoT data by segmenting CoT content, removing redundant paths, filtering incorrect chains and reconstructing coherent output.
  • DLCoT Slashes Token Usage by 70-90%: The DLCoT optimizer can reduce token usage by 70-90% while maintaining or improving accuracy across benchmarks.
    • The optimizer works with existing DSPy optimizers like BootstrapFewShot and distills down to the most efficient reasoning path.

Link mentioned: Add DLCoT Optimizer for efficient Chain-of-Thought distillation by jmanhype Ā· Pull Request #8000 Ā· stanfordnlp/dspy: Add DLCoT (Deconstructing Long Chain-of-Thought) OptimizerOverviewThis PR adds a new optimizer to the DSPy teleprompt module: the DLCoT (Deconstructing Long Chain-of-Thought) optimizer. This feat…


DSPy ā–· #general (20 messagesšŸ”„):

DSPy for creative content generation, PAPILLON example, Agentic-Reward-Modeling link, DLCoT Optimizer, MIPROv2

  • DSPy for creative content generation discussed: Members are discussing using DSPy for optimizing prompts for creative content generation and suggesting to use a good judge.
  • DLCoT Optimizer contribution: A member shared a new contribution, the DLCoT (Deconstructing Long Chain-of-Thought) Optimizer, on GitHub for efficient Chain-of-Thought distillation.
    • The member encouraged others to check it out and provide feedback.
  • Optimizing Prompt without Examples: A member is seeking guidance on optimizing a prompt for passage summarization without examples, using a working evaluation function and wondered if they should use COPRO instead of MIPROv2.
    • Another member clarified that example inputs are always needed but summaries (labels) are not, if a judge/metric can assess summaries without a reference/label.
  • Fine-Grained Feedback via dspy.Prediction: A member inquired about achieving granular feedback with Refine, similar to assertions/suggestions, where specific checks over an output provide targeted feedback.
    • Another member mentioned that in version 2.6.15, it will be possible to return dspy.Prediction(score=...., feedback=....) to offer fine-grained feedback to the module.
  • Multi-Agent Protocol Standard (MCP) in Retrieval: Members discussed the potential of a multi-agent protocol standard (MCP) and its expansion to include retrievers/retrieval augmented generation.
    • The discussion included a shared schema for retrieval results and methods to exchange documents and embeddings, with an aim to streamline data-driven workflows and simplify the combination of multiple models and data sources.

Links mentioned:


DSPy ā–· #examples (9 messagesšŸ”„):

DSPy Modules, Creative Writing Prompts, PAPILLON, Privacy Preservation

  • DSPy Module Usage Under Scrutiny: A member inquired about the correct usage of DSPy Modules within the context of generating reports and charts from a Pandas DataFrame using LLMs.
    • Another member pointed out the difficulty in getting help without a more specific question beyond reviewing a large attached code file, the member then specified is that the correct way to use DSPy Modules?
  • Members seek creative writing prompt examples: A member requested examples for improving creative writing prompts or similar cases where there’s no clear correct answer.
    • A link to the PAPILLON GitHub repository was shared, featuring a tutorial notebook focused on privacy preservation from internet-based and local language model ensembles, PAPILLON GitHub.

Link mentioned: PAPILLON/papillon_tutorial.ipynb at main Ā· Columbia-NLP-Lab/PAPILLON: Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles - Columbia-NLP-Lab/PAPILLON


tinygrad (George Hotz) ā–· #general (19 messagesšŸ”„):

sops.gz dataset, Tinygrad CUDA port, Meeting #63 Agenda, AMD LLVM progress, ONNX Frontend for Tinygrad

  • Track Down sops.gz Origins: A member inquired about the location of the datasets/sops.gz dataset used in speed_compare_cuda_ptx.
  • CUDA Port Ponderings: A member inquired about the possibility of porting Tinygrad to CUDA GPU for training.
    • Another member responded with a link to the README.md file, highlighting supported backends.
  • Meeting Agenda Announced: The agenda for meeting #63 was announced, covering topics such as company update, quantized DSP, BERT, scheduler, driver, tensor cores, WebGPU, ONNX, RetinaNet, Torch frontend and other bounties.
    • Discussion included test_ops, multi GPU training, torch compile and bounties for an AMD LLVM backend.
  • AMD LLVM Backend Advancements: Progress on the AMD LLVM backend was reported, including multiple merged pull requests and testing with Llama3 and Flux examples.
    • A pull request is undergoing review.
  • ONNX Frontend Ascends: A member noted that tinygrad.frontend.onnx now exists, expressing intent to focus on ONNX preparation this week.
    • Validation of the top 30 Hugging Face ONNX repos is a topic.

Links mentioned:


tinygrad (George Hotz) ā–· #learn-tinygrad (4 messages):

Disable colored terminal output, tinygrad facades, GPU code generation, OpenCLEmpty guarantees

  • Disable colored terminal output in tinygrad: A member asked if there’s a way to disable colored terminal output.
  • Tinygrad has two facades: Tinygrad has two facades: the deep learning part (weights update, tensors, matrix multiplication), and the compiler part (GPU code generation and scheduling).
  • OpenCL empty values are unguaranteed: A member reported getting weird output from the first example in tinygrad-notes.
    • It was clarified that with OpenCLempty is just empty, there’s no guaranteed value.

Link mentioned: Introduction to the internals: Tutorials on tinygrad


LLM Agents (Berkeley MOOC) ā–· #mooc-questions (9 messagesšŸ”„):

Quiz Typos, AgentX Research Track, Remote Research Mentorship, Unpaid Research

  • Quiz Title Typo Causes Confusion: A member reported a typo in the title of Quiz 7, causing confusion when checking answers for Quiz 6.
    • Another member acknowledged the catch and thanked the reporter.
  • AgentX Research Track Application Live: Selected students will receive mentorship from Berkeley postdocs/mentors on an AgentX Research Track project with applications due March 26th at 11:59pm PDT.
    • Mentorship is not required to join or succeed in AgentX, and labs plus the Certificate Declaration form will be released in April as seen in the attached image.
  • Research Track is Confirmed to be Remote and Unpaid: A member confirmed that the AgentX Research Track mentorship will be conducted remotely.
    • Another member clarified that the mentorship is not paid, with mentors simply providing guidance on the research project.




{% else %}

The full channel by channel breakdowns have been truncated for email.

If you want the full breakdown, please visit the web version of this email: [{{ email.subject }}]({{ email_url }})!

If you enjoyed AInews, please share with a friend! Thanks in advance!

{% endif %}