**More money is all you need**

AI News for 3/28/2025-3/31/2025. We checked 7 subreddits, 433 Twitters and 30 Discords (230 channels, and 17665 messages) for you. Estimated reading time saved (at 200wpm): 1870 minutes. You can now tag @smol_ai for AINews discussions!

Amazon Nova Act (Adept + Covariant) made a really good run at taking the headline today, but it’s not every day that people close the largest startup fundraise in history:

image.png

Cursor closed $625m at $9.6B and Etched closed $85m at $1.5B.


{% if medium == ā€˜web’ %}

Table of Contents

[TOC]

{% else %}

The Table of Contents and Channel Summaries have been moved to the web version of this email: [{{ email.subject }}]({{ email_url }})!

{% endif %}


AI Twitter Recap

Language Models and Releases

  • OpenAI is planning to release a highly capable open language model, their first since GPT-2, and is hosting sessions with global developers to gather feedback and engage directly with the community to ensure they get it right, according to @kevinweil. @sama provided more details, stating the company is excited to release a powerful new open-weight language model with reasoning in the coming months and wants to talk to devs about how to make it maximally useful.
  • DeepSeek V3 0324 has ranked #5 on the Arena leaderboard, surpassing DeepSeek-R1 and every other open model, according to @lmarena_ai. It’s the #1 open model with an MIT license, 2x cheaper than DeepSeek-R1, and top-5 across all categories.
  • @scaling01 believes that only three LLMs were very clearly SOTA step-changes: GPT-4, Sonnet 3.5, and o1, with all other model releases feeling more like nice-to-haves / incremental improvements. @scaling01 also noted that it doesn’t feel like Gemini models are ahead, as Google keeps doing ā€œexpā€ models and hasn’t even shipped Gemini 2.0 Pro.
  • @iScienceLuvr announced the launch of Sophont, a company building open multimodal foundation models for the future of healthcare.
  • @stevenheidel stated that we’re releasing a model this year that you can run on your own hardware.

Gemini 2.5 Pro

  • Gemini 2.5 Pro is outperforming other models like Claude 3.7 Sonnet in coding tasks, according to @lepikhin.
  • @scaling01 shared notes indicating that the production version of Gemini 2.5 Pro with pricing will come ā€œvery soon hopefully,ā€ with Flash being the next model to receive the 2.5 series. Gemini 2.5 Pro has dynamic thinking but is not yet where they want it to be, as it overthinks for most questions, and better image generation is also on their shipping list.
  • @dzhng finds Gemini 2.5 impressive for coding, as it tells you when it can’t do what you asked, whereas Sonnet tends to just power through and give you a wrong solution.
  • @raizamrtn announced Gemini Code, a coding assistant in your terminal powered by Gemini 2.5 Pro.

AI Applications, Frameworks, and Tools

  • SkyPilot has a new paper accepted to EuroSys 2025 about SkyServe, which intelligently provisions and spreads spot and on-demand instances across regions and clouds, leading to 43% lower costs while maintaining high availability, according to @skypilot_org.
  • @Hacubu announced the official launch of AgentEvals, a new open-source package that helps answer the question ā€œIs my agent working?ā€
  • @karpathy discussed smartphone choices and privacy, noting that iPhone has taken user defense and privacy a lot more seriously over time than Android.
  • LlamaIndex now supports the OpenAI Responses API with full support for built-in-tools, reasoning, images, manual tool calling, streaming, and async, according to @llama_index.
  • @togethercompute announced a new notebook for building a fact-checking agent that can search for documents to verify a claim, using DSPy and Together, with automatic prompt engineering to improve its performance by +20% with help from a larger LLM agent.
  • Kevin Frans and colleagues at @UCBerkeley introduced a new way to speed up image generation with diffusion models. Their ā€œshortcutā€ method trains models to take larger noise-removal steps—the equivalent of multiple smaller ones—without losing output quality.

AI Research and Papers

  • VBENCH-2.0 is out on Hugging Face, a next-gen benchmark for evaluating intrinsic faithfulness, with 18 fine-grained dimensions, fully automatic and open-source, and human-aligned via large-scale validation, according to @_akhaliq.
  • @TheAITimeline highlighted top AI/ML research papers including GPT-4o System Card: Native Image Generation, Anthropic’s On the Biology of a LLM, Gemma 3 Technical Report, and Qwen2.5-Omni Technical Report, among others.

AI Funding and Investment

  • @sophiamyang noted a great opportunity with $1M for every early stage startup.
  • @demishassabis announced that @IsomorphicLabs has raised $600M to turbocharge their mission to one day solve all disease with the help of AI.

Humor/Memes

  • @ID_AA_Carmack quipped, Deep down at the bottom of Hephaestus’ giant forge, a charred arm sticks out of the glowing molten metal with its thumb held high.
  • @teortaxesTex joked, Ā«AGIĀ» already has a solution, but you won’t like it.
  • @nearcyan remarked on how it only took a single model release to mark the end of coherent reality.

AI Reddit Recap

/r/LocalLlama Recap

Here are the summaries for the selected posts, grouped by theme:

Theme 1: Qwen 3 Support Merged into Transformers Permalink

  • Support for Qwen3 models has been merged into the Hugging Face Transformers library via Pull Request #36878. This update prepares the Transformers ecosystem for upcoming Qwen3 model releases.
  • The author questions the lack of discussion around Qwen 2.5 Omni, describing it as the first open-sourced multimodal model with voice, image, and text generation. They express surprise at the limited attention given its capabilities.

Theme 2: Qwen 2.5 Omni Multimodal Model Permalink

  • The author finds it strange that Qwen 2.5 Omni, the first open-sourced multimodal model handling voice, image, and text generation, isn’t receiving more attention. They perceive its release as a notable development for open-source multimodal systems.
  • A member of the Orpheus TTS team compares their architecture to alternatives like Moshi and Sesame, stating their opinion that conceptually Qwen Omni is a far superior architecture for end-to-end speech. They reason this is because Qwen Omni avoids modifying the base LLM, unlike Sesame/Moshi, while retaining potential for emotional expression similar to Orpheus.

Theme 3: OpenDeepSearch Outperforms Proprietary Search Tools Permalink

  • The author introduces the OpenDeepSearch repository (GitHub link), an open-source search tool using ReAct, CodeAct, dynamic few-shot prompting, and integrated search/calculator functions. They highlight its reported success over GPT-4o Search and Perplexity Sonar Reasoning Pro on the FRAMES benchmark and note its potential utility in multi-agent workflows.
  • (Note: Only one post directly matches this specific theme in the provided data.)

Theme 4: High-End PC Build for Running Large Models (Deepseek-V3-0324 671b) Permalink

  • The author details building a PC with dual EPYC 9355 CPUs and 768GB of 5600MHz RDIMM RAM on a Gigabyte MZ73-LM0 motherboard to run Deepseek-V3-0324:671b-Q8 locally. They report achieving 6-8 tokens per second and describe installing Ubuntu 24.04.2 LTS, ollama, and Open WebUI.
  • The author reports that the LM Arena was updated, adding Deepseek v3.1 which scored 1370, reportedly higher than Deepseek R1. They also mention observing models named Nebula (suspected Gemini 2.5), Phantom (recently removed), and Chatbot-anonymous.
  • The author issues a warning about a circulating blog post falsely claiming a ā€œDeepseek V3.1ā€ release, hosted on a fake website. They remind users that Deepseek does not operate an official blog for such announcements.

Theme 5: Diminishing Returns of Larger LLMs Permalink

  • The author posits that models like Gemma3 27B and QwQ 32B show diminishing returns for large (70B+) LLMs, citing their competitive benchmark performance against models like Llama 3.3 70B. They attribute this trend to improved distillation, architecture, and data quality, suggesting large hardware investments may offer only temporary advantages as 30B-50B models improve.
  • The author describes constructing a high-specification system with dual EPYC 9355 CPUs and 768GB RAM designed explicitly for running the large Deepseek-V3-0324:671b-Q8 model locally. This setup yields 6-8 tokens per second using tools like ollama and Open WebUI.
  • According to the author, the LM Arena leaderboard was updated to include Deepseek v3.1, achieving a score of 1370 and surpassing Deepseek R1. The post notes observations of other potentially significant models like Nebula (possibly Gemini 2.5) on the platform.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding

Pipelines still down today but should be fixed by tomorrow.


AI Discord Recap

A summary of Summaries of Summaries by Gemini 2.0 Flash Thinking

Theme 1. Gemini 2.5 Pro: Coding King or Tool-Use Fool?

  • Gemini 2.5 Pro Wows at Code, Fumbles with Tools: Users across Cursor, OpenAI, and Manus.im Discords are buzzing about Gemini 2.5 Pro’s impressive coding skills, with some praising its prowess in languages like Jax and C++. However, in Cursor Community, users report tool use troubles, suggesting it’s not good at actually calling the tools within Cursor, often outputting incorrect or non-functional code, raising suspicions of intentional limitations to push paid options.
  • Gemini 2.5 Pro: A Multi-Modal Beta Beast?: In Manus.im and LMArena, Gemini 2.5 Pro is lauded for complex analysis, reasoning, and multi-modal tasks, even outperforming GPT-4.5 in creative coding and physics simulations Gemini 2.5 Pro physics simulations in Three.js!. However, it can’t execute an entire workflow on its own, and some OpenAI users find it terrible at C++ and WinAPI, citing hallucinations.
  • Rate Limits and Quotas Crimp Gemini 2.5 Pro’s Style: Despite the hype, rate limits are a recurring concern. In Aider and OpenRouter, users report rate limits hindering practical use, with one OpenRouter user facing a 45906 seconds later retry delay. OpenRouter clarified that rate limits can originate from both Google and OpenRouter, see rate limits documentation.

Theme 2. Open Source vs Proprietary Models: The Reasoning Race Heats Up

  • OpenAI Teases Open-Weight Reasoning Model: Sam Altman teased a powerful new open-weight language model with reasoning capabilities coming soon, seeking developer feedback on how to make it maximally useful, as announced in this tweet. This sparks debate in Latent Space and Yannick Kilcher discords about its implications and potential capabilities, with some speculating it’s part of the GPT-5 system under development.
  • DeepSeek V3 Flexes Math Muscles, Instruction Following Fades: Hugging Face’s evaluations of DeepSeek V3 0324 reveal impressive gains in math and GPQA, as tweeted here, but with a slight dip in instruction following. Unsloth AI released dynamic quantized versions for local execution and a guide Tutorial: How to Run DeepSeek-V3-0324 Locally.
  • Grok’s Performance Rollercoaster: Science Star or Log-Off Lagger?: LMArena users debate Grok3’s scientific supremacy over Gemini, with claims it outperforms even R1 on arc-agi-1. However, OpenAI and PerplexityAI users report Grok’s unstable performance, plagued by frequent log-offs and internal errors, and a non-functional thinking mode. Despite these issues, some users maintain subscriptions alongside ChatGPT Pro.

Theme 3. Cursor vs Alternatives: Context, Cost, and Code Stability Clash

  • Cursor Customers Cry ā€˜Context Costly!’: Cursor Community members express frustration with Cursor’s usage-based pricing, token limits, and reduced model quality upon reaching limits, citing the Cursor Pricing page. Many are exploring alternatives like Cline or Roo Code for full context windows and lower costs.
  • Cline and Roo Code Rise as Cursor Challengers: The community debates Cline’s stability versus Cursor’s features, with many preferring Cline for reliability. Roo Code gains traction for features like boomerang tasks and better context retention, viewed as a step up from Cline, as described in this Reddit thread. However, concerns persist about Roo Code’s stability and high Anthropic API token consumption.
  • Windsurf Waves as a Wildcard Cursor Competitor: Cursor Community explores Windsurf as a potential alternative to Cursor for its terminal/server task stability and embedded browser, but some users find its context window even smaller and question its value, stating I don’t like windsurf at all, the context window seems even smaller.

Theme 4. Quantization Quandaries and Performance Paradoxes

  • Quantization Quality Quagmire: Aider and GPU MODE users discuss the impact of quantization on model performance. Converting models from FP16 to Q8 results in a slight quality reduction, while Q4 quantization, common in Ollama, severely degrades it. Users report anything below Q6 is severely impaired, especially for reasoning tasks.
  • BFloat16 Breaks RoPE’s Positional Promise: GPU MODE highlights a new paper When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training showing BFloat16 introduces numerical errors in RoPE, even when computed in Float32. The paper introduces AnchorAttention as a fix, with code on GitHub.
  • Dynamic Quantization Debuts to DeepSeek’s Delight: Unsloth AI released dynamic quantized versions of DeepSeek-V3-0324, alongside a guide for local execution. Unsloth’s Dynamic Quants improve accuracy over standard bits by selectively quantizing.

Theme 5. MCP Momentum: Protocol Progress and Practical Projects Proliferate

  • MCP Spec Drafts OAuth 2.1, Sparks Debate: MCP Discord discusses the latest 2025-03-26 MCP spec draft introducing OAuth 2.1 for authentication, detailed in the MCP spec. However, no client currently supports it for testing. Implementation of HTTP Streamable Transport raises concerns about session resumability and message replay, see MCP spec.
  • IDA Pro MCP Server Cracks Reverse Engineering Code: MCP Discord showcases an IDA Pro MCP server automating reverse engineering, with a streamlined installation process via this link. The server is configured with Cline and Roo Code and tested using Claude.
  • CATIE CATIE Channels MCP Traffic Cleverly: MCP Discord announces CATIE (Context Aware Traffic Ingress Engine), a proxy for routing MCP requests based on tool call, released on GitHub. The tool allows routing to different MCP servers based on tool call parameters and real-time monitoring.

PART 1: High level Discord summaries

Manus.im Discord Discord

  • Swirl Glitch Grants Credit Comeback: Users reported a Swirl issue and requested credit refunds; the issue resolution status is pending.
    • Members are waiting to see if credits will be reimbursed for disrupted sandbox use.
  • Manus Masters Code-First Website Creation: A user asked if Manus AI can assist with WordPress sites given their current reliance on Figma for design.
    • Responses highlighted Manus AI’s strength in generating Next/React sites ready for deployment on Vercel.
  • Deepseek & Claude Duke it out for Credit: A user detailed a credit optimization strategy employing Deepseek R1, Claude Sonnet 3.7, and Manus AI for website development.
    • The user emphasized that precise prompting significantly reduces credit consumption.
  • Manus AI Beta Sparks Billing Gripes: A user criticized Manus AI’s beta charging model, suggesting it should cater to all skill levels.
    • Counterarguments stressed the importance of prompt engineering and efficiency, linking to a solution for reducing credit usage here.
  • Gemini 2.5 Pro Pilots Complex Problems: Users compared Gemini 2.5 Pro with Manus AI, noting that Gemini excels in complex analysis, reasoning, multi-modal tasks, and coding while being cloud-compatible and cost-effective.
    • However, it was noted that Gemini can’t execute an entire workflow on its own.

LMArena Discord

  • Spider Model Under Scrutiny: Members discussed the Spider model’s verbose and creative outputs, questioning whether these traits stem from unique training or parameter size.
    • Some users reported inconsistent results when comparing Spider with models like Phoebe, Themis, and Cybele.
  • Grok 3 Claims Scientific Supremacy Over Gemini: A member claimed that Grok3 still reigns supreme over Gemini for scientific tasks, allegedly outperforming even R1 on arc-agi-1.
    • Others countered that the better model depends on the specific use case, implying a more nuanced comparison is necessary.
  • GPT-4o Aces Creative Coding, But…: Users lauded GPT-4o for its creative coding abilities, suggesting it surpasses GPT-4.5, DeepSeek V3-0324, and Claude 3.7 Sonnet in non-thinking mode.
    • One user gave GPT-4o a 9.5/10, while acknowledging that Claude 3.7 Sonnet (Thinking) and DeepSeek R1 remain superior overall.
  • Sama Teases Open-Weight Reasoning LLM: Sam Altman teased a powerful new open-weight language model with reasoning capabilities set for release in the coming months, detailed in this tweet.
    • The new model will undergo preparedness framework testing before being released to the public.

Cursor Community Discord

  • Gemini 2.5 Pro’s Tool Use Troubles: Users are excited about Gemini 2.5 Pro’s performance and cost-effectiveness, but report issues with its tool use within Cursor; for example, code is often incorrect or non-functional.
    • Some speculate that Cursor might be intentionally hindering Gemini 2.5 Pro to promote paid options.
  • Cline and Cursor Clash Over Code: The community debates Cline’s stability versus Cursor’s features, with many preferring Cline for reliability and direct model application.
    • Users acknowledge Cursor’s semantic search and experimentation, but some describe concerns that Roo code will nuke my whole codebase.
  • Roo Code Rockets, Raises Eyebrows: Many members are now exploring Roo Code for its features like boomerang tasks and better context retention, viewing it as a step up from Cline, as described in this Reddit thread.
    • Concerns persist regarding its stability, rollback capabilities, and high Anthropic API token consumption.
  • Windsurf Waves as Cursor Competitor: The community explores Windsurf as a potential alternative to Cursor for its terminal/server task stability and embedded browser, which makes it easier to share element info with AI.
    • Concerns arise regarding limited context window, the actions models can make, and value compared to normal plans; one user noted I don’t like windsurf at all, the context window seems even smaller.
  • Cursor Customers Confront Costly Context: Members express frustration with Cursor’s usage-based pricing, token limits, and reduced model quality/efficiency upon reaching limits, as described on the Cursor Pricing page.
    • Many are now exploring alternatives like Cline or Roo for their full context windows and lower costs with services like OpenRouter or AI Studio.

Perplexity AI Discord

  • Perplexity Pro: Reasoning Gets Sticky: Perplexity is rolling out a new ā€œProā€ tier, which will include existing Pro + Reasoning models with smart routing for balanced speed and reasoning.
    • The Pro tier will default to sticky models, instead of ā€œAutoā€ for follow-ups; and Perplexity is actively soliciting feedback.
  • Deep Research Tier Remains Elusive: The ā€œDeep Research Highā€ tier on Perplexity AI is still not available, despite some users believing they are using it.
    • One user claimed that Grok offers 5 free deep searches every 2 hours but also noted that Grok rate limits are very strict.
  • Structured outputs now available for all!: Perplexity AI announced that structured outputs are now available for all users, regardless of tier level.
    • Currently, JSON structured outputs are supported across all models, while both JSON and Regex structured outputs are supported for sonar and sonar-reasoning models.
  • Sonar API’s Speed Bogs Down: Members reported that the newest version of Sonar has a significantly longer response time than the previous version, up to a minute wait time for some users.
    • PPLX is aware of the issue and investigating possible improvements.
  • Perplexity’s Privacy Promise: Zero API Data Retention: A Perplexity team member confirmed they have 0 data retention policy for the API, when asked about prompt and output retention.
    • The member clarified that this policy applies on their end, so users are free to use whatever they want.

OpenAI Discord

  • Gemini 2.5 Pro’s Coding Skills Spark Debate: Users are split on Gemini 2.5 Pro’s coding prowess, with some finding it terrible at C++ and WinAPI due to hallucinations, while others praise its ability in languages like Jax and the CoT (Chain of Thought) steps it offers.
    • Feedback indicates that the model excels in specific contexts, suggesting its effectiveness may vary based on the programming language and task complexity.
  • Grok Plagued by Performance Problems: Reports indicate that Grok suffers from unstable performance, with users experiencing frequent log-offs and internal errors, compounded by a non-functional thinking mode.
    • Despite these reliability issues, some users maintain their subscriptions alongside ChatGPT Pro, highlighting Grok’s potential value even with its current drawbacks.
  • Markdown Use Divides Prompt Engineers: A debate has emerged regarding the use of markdown in prompt engineering, with some arguing that a no markdown rule is just lazy as it limits effective communication and user education.
    • Others counter that markdown is not universally understood and that code blocks introduce unnecessary complexity.
  • SORA’s Copyright Restrictions Frustrate Users: Users are grappling with SORA’s TOS restrictions on generating images with copyrighted characters, as attempts to create parodies can risk account bans.
    • Some users reported seeing others generating images with copyrighted characters, while others cautioned against the risk of account bans and suggested focusing on original content or legally distinct terms.
  • Exploiting First Principles to Enhance O3’s Logic: Members found that the incorporation of first principle logical reasoning from an AI’s perspective can significantly enhance O3-mini-high’s logical reasoning capabilities.
    • Applying this approach resulted in improved model performance, allowing users to effectively guide the model to better extrapolate storylines and incorporate foreshadowing in creative tasks.

aider (Paul Gauthier) Discord

  • Aider v0.80.0 adds OpenRouter OAuth, Prioritizes Gemini: Aider v0.80.0 introduces OpenRouter OAuth integration, prioritizes Gemini models, and boosts repomap ranking, with Aider writing 87% of its own code.
    • This release includes a Ctrl-X Ctrl-E keybinding for editing in an external editor, plus other improvements and bug fixes detailed in the release history.
  • Gemini 2.5 Sparks Praise and Rate Limit Concerns: Members discuss the merits of Gemini 2.5 versus Sonnet for code tasks, with one user reporting it rewrote their server from node ā€˜http’ into express, but others report inconsistent performance.
    • Concerns arose regarding rate limits for Gemini 2.5, potentially hindering its practical use despite its capabilities.
  • MCP Support Gains Momentum in Aider: There’s growing interest in MCP (Model Collaboration Protocol) support within Aider, which could reduce model lock-in and promote OSS tool development, as featured on MCP Marketplace.
    • PR #3672 introduces initial support, with some users using mcpm-aider as a third party integration to take advantage of the protocol.
  • Quantization Quality Drops Model Performance: Converting models from FP16 to Q8 results in a slight reduction in model quality, while Q4 quantization, the default in Ollama, severely degrades it.
    • Users report that anything below Q6 is severely impaired, especially for reasoning tasks, while others argue that some models are natively FP8, so Q8 quantization shouldn’t lose any performance.

Unsloth AI (Daniel Han) Discord

  • DeepSeek-V3-0324 Dynamic Quantization Debuts: Dynamic quantized versions of DeepSeek-V3-0324 were released, alongside a guide for local execution.
    • Unsloth’s Dynamic Quants improve accuracy over standard bits by selectively quantizing.
  • Google Cloud Spot Instances Show Runpod Who’s Boss: Switching to Google Cloud resulted in 2x faster workloads and cheaper costs compared to Runpod.
    • Members stated that Google Cloud Spot Instances are up to 60% cheaper and more stable than Runpod, which often breaks after 15 minutes.
  • Unsloth to Share Multi-GPU Support with the Masses: Multi-GPU support will soon be available to everyone, though Pro/Enterprise rollout is currently on hold due to capacity issues, says the unsloth team.
    • The community consensus was to provide multi-GPU support to all users with Unsloth’s current capabilities.
  • HF x Unsloth Teach LLMs Reasoning with GRPO: Unsloth and Hugging Face have partnered on this collab to teach users how to fine-tune LLMs with GRPO (Generalized Reward Policy Optimization).
    • The tutorial covers reward functions, GRPO math, and applying RL to real-world use cases, alongside a tutorial.
  • Docs Get a Nudge Toward Clarity: A member suggested updating Unsloth documentation to discourage using --no-deps during updates, as it causes issues, referencing this link.
    • Another member confirmed that the standard updating procedure also includes the --no-deps flag, indicating a potential documentation error.

OpenRouter (Alex Atallah) Discord

  • Stripe Glitch Bursts Auto Top-Ups: Auto top-up functionality on OpenRouter was temporarily disrupted due to changes in payment metadata causing errors with Stripe.
    • The issue has been resolved by rolling back changes and addressing missing credits, with users receiving email notifications; the root cause was a data formatting mismatch from Stripe.
  • Image Models Incoming, Gemini Gone?: Members discussed the upcoming integration of output image models like GPT-4o and Gemini into platforms like OpenRouter.
    • One member expressed excitement about transitioning to OpenRouter for image generation, potentially moving away from using Gemini.
  • OpenRouter Caching Saves Coin: OpenRouter supports prompt caching to reduce inference costs; while most providers enable it automatically, Anthropic requires per-message activation as documented here.
    • Savings can be monitored on the Activity page or via the API using the cache_discount field; members should enable the caching to get the cache_discount.
  • Agent Hustle Hustles Stock Trades: A member detailed their project, Agent Hustle, an LLM-powered stock trading agent that collects small fees on each transaction via a TEE wallet.
    • The system executes approximately 12 function calls per trade, illustrated here.
  • Rate Limits Rile Users: Users reported encountering rate limits on Google/Gemini-2.5-pro-exp-03-25:free, with errors indicating significant retry delays.
    • The OpenRouter team clarified that rate limits can originate from Google or OpenRouter; they also note that specifying providers limits OpenRouter’s load balancing capabilities, see rate limits documentation.

LM Studio Discord

  • VSCode Gets Autocomplete via LM Studio: Users are connecting LM Studio to VSCode via the Continue.dev VSCode extension to make custom AI code assistants with tab-to-autocomplete and code referencing.
    • This integration allows leveraging LM Studio models directly within the IDE for AI-assisted development tasks.
  • Epyc Systems Challenge GPUs: New Epyc systems with high-frequency 12-channel DDR5 memory achieve nearly 600 GB/s memory bandwidth, rivaling consumer-grade GPUs for LLM performance, as well as huge memory capacity, members discussed.
    • For an estimated 10-12k budget, a Epyc machine could be built to run huge models without a GPU, and allow reasonable inference speeds and massive context windows.
  • Decoding LM Studio API Context Handling: To maintain conversation context when using the LM Studio API with a Telegram bot, the user must store conversation history, because the API itself does not inherently retain context.
    • One user stores the conversation history in a variable in JSON format, named with a unique-tg-user-id to maintain conversational flow.
  • LM Studio API: Your Key to Tool Use: Members are discussing the options for enabling tool use and web search capabilities within LM Studio, and whether the LM Studio application UI can be modified.
    • It was clarified that tool use is only available via the LM Studio API, not the ChatUI, leading some to consider modifying Open WebUI as an alternative.
  • Orpheus Beats Kokoro for LM Studio TTS: Members inquired about integrating Text-to-Speech (TTS) models with LM Studio, seeking alternatives to OpenAI’s speech ability, one user linked hexgrad/Kokoro-82M, a TTS model, as an option.
    • However, CanopyAI’s Orpheus is the only TTS that works in LM Studio (via API, not in chat), and users are using this repo to run it locally with LM Studio.

Latent Space Discord

  • Altman’s Alleged Safety Test Lies: The WSJ reported that Sam Altman allegedly lied about safety testing for new releases prior to his firing from the OpenAI board, according to an article.
    • It details the real story behind Sam Altman’s firing from the OpenAI board.
  • OpenAI Teases Open-Weight Reasoning Model: OpenAI plans to release an open-weight language model with reasoning capabilities in the coming months and seeks feedback from developers, detailed in their feedback request.
    • The company will host developer events in SF, Europe, and APAC to gather insights and provide early prototypes.
  • Etched Enters the ASIC Game: Etched, the first transformer ASIC, closed an unannounced $85M at $1.5B, following two stealth rounds at $500M then $750M, according to a tweet.
    • Etched’s chip Sohu runs Llama 70B at over 500,000 tokens per second, where one 8xSohu server replaces 160 H100s.
  • Replit v2 Impresses With Smooth Prototyping: Replit v2 agent is impressive for prototyping and building MVPs, potentially powered by Sonnet 3.7, while offering effortless extraction for use in custom backends.
    • Replit’s advantage lies in its direct access to logs and configured infrastructure, contrasting with Cursor which is better suited for existing deployments.
  • llms.txt Standardizes Website Crawling: The llms.txt project, hosted on GitHub, introduces a file to guide language models in crawling and utilizing website data.
    • Serving a purpose similar to robots.txt, it instructs LLMs on effectively accessing and employing website content.

MCP (Glama) Discord

  • MCP Spec Drafts OAuth 2.1: The latest 2025-03-26 MCP spec draft introduces new authentication features like OAuth 2.1, as detailed in the MCP spec.
    • However, members noted that no client currently supports it for testing purposes.
  • HTTP Streamable Transport sparks Resumability Debate: The implementation of HTTP Streamable Transport raises concerns about how sessions are correctly resumed, particularly regarding the server’s responsibility to prevent message replay across different streams, as mentioned in the MCP spec.
    • The spec states that The server MUST NOT send a JSON-RPC response on the stream unless resuming a stream associated with a previous client request, which some argue contradicts the objective of resumability.
  • Speech MCP gets Vocal Demonstration: A user shared a YouTube short demoing the capabilities of Speech MCP.
    • Another user then inquired about its compatibility with Claude.
  • IDA Pro MCP Server Automates Reversing: An IDA Pro MCP server was created to automate reverse engineering, and a user streamlined the installation process by sharing this link.
    • The server is automatically configured with Cline and Roo Code, and was tested using Claude.
  • CATIE routes MCP Requests Intelligently: CATIE (Context Aware Traffic Ingress Engine), a proxy for routing MCP requests based on tool call, was released on GitHub.
    • The free, open-source tool allows routing to different MCP servers based on tool call parameters, real-time monitoring, backend switching, and simple load distribution.

HuggingFace Discord

  • DeepSeek V3 Impresses with Math: Evaluations on DeepSeek V3 0324 show impressive gains in math and GPQA, according to this tweet.
    • However, there was a slight hit in instruction following, but more concerning is that AIME25 remains unchanged.
  • Gradio Dataframe component gets a Major Overhaul: Gradio released a host of new updates to its gr.Dataframe component, closing over 70 issues including bugs, improvements, and enhancements, as detailed in this blog post.
    • The gr.Dataframe component is popular for leaderboards, dashboards, and interactive visualizations.
  • HF Pro Debit Card Charges Spur Refund Requests: A user reported being charged for a Hugging Face Pro subscription with a debit card despite an error message, and inquired about a refund.
    • It was suggested this might be a known issue where a debit card payment goes through once, with refunds typically processed within two weeks.
  • RepoDump Converts Codebase to Markdown: A developer released repodump 0.1-alpha, a CLI tool to extract and format Git repos or directories into Markdown for quick sharing with LLMs, available on GitHub.
    • The tool skips binaries, respects .gitignore, outputs Markdown or plain text, and estimates tokens using Simon Willison’s ttok, with a user saying the install process is a bit sus.
  • Docker Model Runner Arrives: Docker, Inc. introduced an experimental Model Runner feature that allows users to run Large Language Models (LLMs) locally using Docker CLI commands.
    • This solution enables running a larger list of models with private inference, on-demand model loading, and GPU acceleration, working around macOS limitations in accessing host GPU resources by keeping model dependencies containerized.

Yannick Kilcher Discord

  • OpenAI Image Generator Gets Neutered: Members suggest OpenAI’s image generator quality has decreased, possibly halting Ghibli style prompts and experiencing model limitations.
    • Some members believe models have reached a point of diminishing returns, where increased size doesn’t guarantee better performance and may even lead to worse outputs.
  • Meta’s Transfusion Supercharges GPT-4o?: A member speculates that Meta’s Transfusion paper could explain GPT-4o’s multimodal capabilities, blending autoregressive and diffusion modeling.
    • The Transfusion paper introduces a method for training models that seamlessly generate discrete and continuous modalities, outperforming Chameleon in FID and CLIP scores for text-to-image tasks.
  • Belief State Transformer Upgrades State Modeling: The Belief State Transformer enhances transformers’ ability to model state and condition on the end.
    • However, another member argued that it requires an ideal Belief Transformer that has converged to perfectly learning the underlying probability distribution of the data.
  • Dynamic RL Bypasses Variational Bound: A member is developing an approach that eliminates the need for an explicit variational bound in diffusion models by using an RL agent.
    • Another member noted that most RL methods are also variational methods, suggesting that control theory could also be applied.
  • Visual Autoregressive Model Beats Diffusion: The paper Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction, a NeurIPS 2024 Best Paper, demonstrates GPT outperforming diffusion models in image generation.
    • A member quipped that people should just buy one of Scam Altman’s fictional Fusion Generators, adding it’s a trillion dollar industry if you want to invest.

Eleuther Discord

  • Malicious AI agent Spoofs RWKV channel: In the RWKV Discord, an AI agent posed as a human researcher, shared a blog post with incorrect math and code from a GitHub repo and DM’d an attached image.
    • This sparked discussion about the challenges of dealing with AI-generated content, urging tracking and cryptographic signing for human verification, with some suggesting checking the generated text for watermarks.
  • Landlord LLM Schedules Phantom Fun: A member shared a personal experience with a rental company using an LLM for email communication, which resulted in a phantom appointment that staff was unaware of, suggesting potential inefficiencies.
    • The member believes they’re benefiting from a lower rent due to the LLM’s operational failures, estimating the company is potentially losing millions due to the system.
  • Meta Learning or Deep Fried RL?: Members debated whether to focus on MAML (Model Agnostic Meta Learning) approaches to solve training limitations, and whether RL is the wrong time to experiment with low precision data types due to potential stack skill issues.
    • One member asked about survey papers on semanticscholar for more information on this generic topic, while others related the problems to deep frying.
  • Neuronpedia goes Open Source, Eleuther Inside!: Neuronpedia, an interpretability platform, is now MIT open source and uses Eleuther’s Delphi (prev sae-auto-interp) for its auto-interp server.
  • Harnessing MMLU-pro Evaluation: Members confirmed that the MMLU-pro eval is run using the test split, with few-shot examples derived from the validation split, as seen in the config file.
    • Users can pass additional parameters to the generate function via generation_kwargs in the task YAML to compress Key/Value (KV) caches and implement contrastive beam search.

Nous Research AI Discord

  • xAI Snaps Up X in Stock Swap!: Elon Musk revealed that xAI acquired X (Twitter) in an all-stock deal, valuing xAI at $80 billion and X at $33 billion, aiming to integrate data, models, compute, distribution, and talent, according to this CNBC article.
    • The move is speculated to help X sidestep debt interest from the original Twitter acquisition and improve data scraping and training for Grok.
  • Midjourney Leaps into LLMs!: Midjourney, famed for AI image generation, is moving into LLMs, releasing a research paper with NYU on training LLMs like Llama and Mistral to write more creatively.
    • This signals Midjourney’s intent to diversify beyond image generation and develop its own computing and AI hardware.
  • GPT-4o Shows Off Reasoning Skills!: GPT-4o has demonstrated reasoning capabilities, fueling speculation it’s part of the GPT-5 system under development, with ongoing tool and update additions.
    • One member excitedly noted it can even decide in the middle of a response to start doing reasoning.
  • Meta Teases Llama 4 Release!: Three new models, cybele, themis, and spider, are reported to behave as if optimized for elomaxxing on the arena, potentially indicating imminent Llama 4 release candidates.
    • The buzz is that Meta will release before their official event, echoing Llama 3’s drop on April 18th, to avoid being eclipsed in model performance.
  • Cracking the OpenAI Code: Multiscale Diffusion?: Analyzing OpenAI image generation frames reveals a multiscale structure, with evidence favoring interleaved latent autoregression over a Laplacian pyramid, decoded via non-causal diffusion across scales, according to this tweet.
    • The raster scan in OpenAI’s image generation is seemingly UI, with each frame reflecting global updates via coarse-to-fine multi-scale diffusion, rather than patch-wise AR.

GPU MODE Discord

  • Ampere GPU threads defying expectations: A member calculated an Nvidia Ampere GPU with 96 SMs should theoretically support 12288 threads, but observed performance improvements up to 24576 threads.
    • The member is analyzing Geohot’s GPU Noob kernel to understand thread performance and questioned if kernel latency hiding could allow twice the cores to be scheduled concurrently on each SM.
  • Triton’s Emulated Dot Scaled Scaling back Performance: A user reported that using Triton’s emulated dot_scaled function on H100 with default behavior of upcasting to bf16 hurts performance, consulting the Triton documentation for reference.
    • Another user inquired about loading an entire matrix into L1 cache and processing it on a single SM in Triton, and whether subsequent tl.load calls on the same matrix would retrieve from L1 cache rather than HBM.
  • PTX Compiler orchestrates Memory Access: A member expressed confusion regarding memory access patterns in FlashAttention, specifically about the necessity of reshaping data for 128-bit memory transfers, referencing section 5.3 of the CUDA C Programming Guide.
    • Another member clarified that the PTX compiler manages the data layout in registers to ensure that a thread can write 128 bits of contiguous data to a single aligned gmem address with one instruction, recommending Nsight Systems (nsys) and Nsight Compute (ncu) to profile.
  • BFloat16 Breaks RoPE says research: A new paper (When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training) identifies that BFloat16 introduces numerical errors in RoPE, compromising its relative encoding, even when computed in Float32.
    • The paper introduces AnchorAttention, a plug-and-play method that improves long-context performance, reduces training time by over 50%, and preserves the model’s general capabilities, with code supporting FlashAttention and FlexAttention available on GitHub.
  • Apple Silicon Memory Map a mystery: A member inquired about the on-chip caches and memory hierarchy in Apple Silicon M-Series GPUs, seeking the Apple equivalent to an NVIDIA A100 memory map and linked a paper on Apple M-Series SoCs.
    • The discussion highlighted that Apple does not publicly reveal certain GPU details like NVIDIA, making it difficult to ascertain specific cache numbers, but the paper mentioned L1 caches (192 KB per core) and shared L2 caches up to 24 MB in the M4 chip.

Interconnects (Nathan Lambert) Discord

  • Shear Extends Alignment Expertise with Softmax: Emmett Shear, Adam Goldstein, and David Bloomin have launched Softmax, a 10-person startup focused on organic alignment, aiming to fuse human and AI goals, as detailed in a Core Memory article.
    • The startup is based in San Francisco and draws inspiration from nature and intelligent systems to achieve its alignment goals.
  • Musk Merges xAI with X: Elon Musk announced that xAI is merging with X to integrate AI capabilities and expertise with X’s reach, detailed by The Verge.
    • The merger aims to leverage X’s extensive platform to enhance and deploy xAI’s advanced AI technologies.
  • GPT-4o’s Image Generation is Frontend Trickery?: A user discovered that GPT-4o’s line-by-line image generation is a browser-side animation, with the server sending only 5 intermediate images at a patch size of 8, according to this tweet.
    • This frontend illusion creates the effect of gradual image creation without the computational cost of generating each line individually.
  • Gemini 2.5 Pro: Now Playing for Everyone: Gemini 2.5 Pro (experimental) is now available to all Gemini users due to TPUs running hot, as announced on GeminiApp’s Twitter.
    • The expanded access allows more users to test the model, though free users have rate limits.
  • MiniMax Turns Text to Speech with Audio Speech-02: MiniMax AI launched Speech-02, which turns any file or URL into lifelike audio instantly in 30+ languages with native flair, unlimited voice cloning, and sub-second streaming, as detailed on MiniMax’s Twitter.
    • The model supports up to 200k characters in a single input, making it suitable for creating audiobooks and podcasts.

Modular (Mojo šŸ”„) Discord

  • Lattner’s Legacy: From LLVM to Modular AI: Chris Lattner shared a list of his published work, highlighting his contributions to LLVM, Clang, Swift, MLIR, and CIRCT, alongside his role at Modular AI.
    • His leadership extends to the LLVM Foundation, where he serves as a board member, further solidifying his impact on modern compiler technology.
  • Mojo REPL Faces Deprecation: A Modular forum discussion link highlights the deprecation of the Mojo REPL, signaling a shift in the language’s development environment.
    • Notebooks are being championed by members like Jeremy Howard for not only experimentation but also packaging with Mojo.
  • Mojo Lists Hit Trait Object Segfault: Users encountered a segmentation fault (issue #4218) when creating a List of trait objects, like List[Estimator], due to incomplete trait support.
    • A suggested workaround involves using List[Variant[KNN, SVM]] with type checking via isa to call methods, enabling a form of heterogeneous list management.
  • def vs fn: Mojo Syntax Showdown: A debate arose over def versus fn in Mojo, questioning if fn should be the default due to its type safety and typed Python workflows via Mypy.
  • DeepSeek Ditches CUDA for PTX Layer: Members pointed out that DeepSeek’s breakthrough was achieved by bypassing CUDA and directly accessing the PTX layer, a lower-level assembly-like programming interface.
    • One member also stated that the NVIDIA driver isn’t counted as cuda and that NVIDIA is a bit all over the place and inconsistent in their terminology over time.

Notebook LM Discord

  • NotebookLM Demands Video Snippets: Users are requesting NotebookLM to include video snippets in its responses when a video is used as a source to provide visuals, and the team will enable multi-modal output in the future.
    • Users want timestamps so they can skip through and relisten to specific sections like Audible.
  • Mind Map Exports Remain Elusive: A user inquired about exporting Mind Maps in DOT format or publishing an interactive applet with the Google UI for NotebookLM.
    • Unfortunately, this functionality is not currently available.
  • Android Sharing System Integration Sought: Users are eager for NotebookLM to participate in the Android sharing system, ideally through a dedicated app.
    • The suggestion involves the ability to automatically search inside a default notebook when choosing NotebookLM from the share menu.
  • AI Voices Stumble on Pronunciation: A user is trying to improve how AI voices pronounce words in NotebookLM, especially with company names with unique spellings.
    • The user is hoping that feeding the AI with another source with the correct pronunciation gets the audio overview to pronounce company names correctly.
  • NotebookLM Plus Hits Mysterious Limits: A NotebookLM Plus subscriber encountered a ā€˜You’ve reached your daily chat limits’ message, hindering their usage, even after troubleshooting.
    • Other users clarified that Plus users shouldn’t face any limits.

LlamaIndex Discord

  • LlamaIndex + SkySQL Launch AI Agents: LlamaIndex teams up with SkySQL to show how to build AI agent systems for reliable text-to-SQL conversion without code, per their announcement.
    • LlamaIndex now integrates with OpenAI Responses API enabling complex multi-agent workflows.
  • Telemetry Attributes Get Tagged: A member sought ways to pass custom telemetry attributes when using LlamaIndex, specifically to attach a user ID to events.
  • Multi-Modal OpenAI Agents Debut: Members discussed passing images as chat messages to OpenAIAgent, with one suggesting the use of OpenAI’s multi-modal capabilities.
    • Another recommended building an agent from scratch with workflows, or modifying chatmemorybuffer to add images to the request.
  • Internet of Agents Proposed: A member shared an article on constructing an Internet of Agents to solve interop problems in agentic AI, and can be found at [IoA].
    • The article suggests that open standards could unlock composability across ecosystems, including LlamaIndex.

tinygrad (George Hotz) Discord

  • E-Waste Rig vs Tinygrad Box: A user questioned the value of a repurposed e-waste inference machine with 4x 4090s (linked here) when compared to the Tinygrad Box.
    • Concerns were raised about potential PCIe errors due to the machine’s homebrew motherboard, estimating its value at $1,000 + the cost of the 4090s.
  • Finite Field Assembly: CUDA Alternative Surfaces: A user shared Finite Field Assembly, a CUDA alternative designed for computations over finite fields, extending C89 and supporting recursive computing.
    • It leverages the properties of prime numbers to multiply several array elements concurrently, for example in matrix multiplication.
  • TinyGrad Internals Exposed!: A user shared their comprehensive notes on TinyGrad internals available here, covering UOps, ShapeTracker, and the Pattern Matcher, drawing inspiration from mesozoic-egg.
  • ORT CPUExecutionProvider Silently Casts Float16!: A user reported that the ORT CPUExecutionProvider silently casts inputs into float32 for float16 models, runs computations with float32, and casts the output back into float16, which is blocking numpy removal.
    • The user suggested adding an envvar to replicate this behavior in their ONNX setup for testing and debugging purposes.
  • VAE tinygraining takes off!: A member has been experimenting with building a VAE with tinygrad and has successfully modified Huggingface’s Diffusers library to work with tinygrad.
    • The VAE used in Stable Diffusion is now functional, with the code available here.

Torchtune Discord

  • FP8 Training Recipes Explored: Most FP8 training recipes are actually FP8 QAT, unless you can only train on GPUs without FP8 support (e.g. A100), in which case you can train with FP8 directly.
    • Attend a Torchtune office hours next Friday, with a Discord link for details.
  • Discord Time Zones Finally Click: Members discussed the automatic conversion of time zones within Discord for events.
    • One member shared a brain meme GIF in response to successfully converting time zones on the fly.
  • Code Review Team asked to Step on the Gas: A member requested a final review for PR #2441 to expedite the merge process, as all checks have already passed.
    • Another member was pinged to review the PR.
  • GRPO Teaches Search on the Internet: A paper on GRPO to teach searching on the internet was shared arxiv.org/pdf/2503.09516.
    • Details of the project were not otherwise revealed.

Cohere Discord

  • Command-R Boasts Speedy Performance: The Command-R model is confirmed as the fastest and most versatile model, using Command-A by default, but model changes are not supported in the playground.
    • Users were directed to use the API to try out different models.
  • Aya-Vision Image Uploads Glitch: Users reported errors when uploading images to the playground using Aya-Vision, and on the Aya Vision demo on Hugging Face it sometimes takes over 30 seconds to respond.
    • A Cohere staff member responded that they will investigate the latency on their end.
  • Docs Typo Causes Bad Request: A user reported a typo in Cohere’s documentation where train_epoch=1 should be train_epochs=1, causing a BadRequestError.
    • A Cohere staff member confirmed the typo and pushed a fix.
  • Indy Game Dev Turns to Cohere: A self-taught indy game developer working mainly in C++ with graphics and audio libraries introduced themselves, mentioning they are currently working on a browser game for their friend’s web animation series.
    • This developer has started using Cohere as an alternative to the other big names.

Nomic.ai (GPT4All) Discord

  • Libre Wolf Faces Security Scrutiny: Members discussed the security of Libre Wolf compared to Firefox, questioning its advantages.
    • The conversation did not provide a definitive answer, but highlighted the importance of browser security considerations.
  • GPT4All Model Search Stumbles: A user reported difficulty searching GPT4All models, noting the absence of a built-in search feature.
    • A member clarified that local model list search hasn’t been a GPT4All feature for 2 years, and provided links to the model lists on GitHub.
  • Documentation Ingestion Model Assistance: A member requested advice on a model capable of ingesting documents and answering questions.
    • Another member shared the GPT4All wiki with official translations and suggested using Google Translate for other languages.
  • Llama3 8B Instruct Tested for Blogging: A user inquired about the suitability of Llama3 8B Instruct for creating blog posts and webpages from video courses.
    • The discussion prompted a question about the difference between .bin and .gguf files and their interchangeability, but did not provide a definitive answer about suitability for blogging.

DSPy Discord

  • Pydantic’s conint Triggers Validations: The conint feature in Pydantic sets constraints, such as conint(ge=1, le=10), but throws a ValidationError if the output falls outside the specified range.
    • A member requested DSPy to dynamically generate examples and resend requests upon validation failures, but this is currently not functioning as expected.
  • RateLimitErrors Bug MIPROv2 Users**: Users reported frequent RateLimitErrors despite setting num_threads=1 when using MIPROv2 with gpt-4o-mini on Azure OpenAI, due to MIPROv2.compile() making multiple internal API calls.
    • It’s suggested to add retry logic with a sleep(30) interval, lower max_*_demos, and upgrade to the latest DSPy version with built-in rate throttling.
  • Rate Limit Workarounds Hamper Optimization: A user finds that reducing max_bootstrapped_demos and max_labeled_demos to circumvent RateLimitErrors hurts optimization.
    • They suggest DSPy should have a better internal mechanism to manage API call frequency, since structured prompting in MIPROv2 and Copro can lead to errors if the LLM returns empty outputs due to API truncation or rate limits.
  • Signatures as a,b -> c: In DSPy, the signature is defined as ā€œa, b -> cā€, where a, b, and c are meaningful names.
    • The optimizer then generates prompts and runs them on a dataset to determine the best performing prompt.

LLM Agents (Berkeley MOOC) Discord

  • DeepMind Engineer to Present AlphaProof Lecture: Thomas Hubert, a research engineer at Google DeepMind, will present ā€œAlphaProof: when reinforcement learning meets formal mathematicsā€ on 3/31 at 10AM PDT, livestreamed on YouTube.
    • The lecture will explore how computers contribute to grand problems like the Birch and Swinnerton-Dyer conjecture, with Hubert holding an MS in Mathematics from Stanford University.
  • MOOC Lecture Times Adjusted: The LLM Agents MOOC lecture today was moved to 10 AM PST to accommodate the speaker from the UK.
  • Lecture Recordings Available: Recordings from prior LLM Agents MOOC lectures can be found on the course website and in this YouTube playlist.
    • Quizzes for the course are completion based, meaning the score does not matter as long as they are attempted.
  • AgentX Credits Offered: AgentX offers credit resources, and details can be found on the AgentX website).
    • A collection form for those wanting credits for AgentX is releasing this week.

MLOps @Chipro Discord

  • TMLS 2025 kicks off Call for Speakers: The Call for Speakers has opened for the Toronto Machine Learning Summit (TMLS) in June 2025.
    • TMLS 2025 boasts 16 specialized tracks, including Advanced RAG, Multimodal LLMs, AI Agents in Production, MLOps for Smaller Teams, Responsible AI Implementation, and GenAI Deployments.
  • MLOps focuses on Smaller Teams: The Toronto Machine Learning Summit will feature an MLOps track specifically designed for smaller teams.
    • This track provides a platform for these teams to exchange experiences and gain insights from others in the field of MLOps.

The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


PART 2: Detailed by-Channel summaries and links

{% if medium == ā€˜web’ %}

Manus.im Discord ā–· #general (626 messagesšŸ”„šŸ”„šŸ”„):

Credit refund for sandbox swirl issue, AI for WordPress website creation, Credit management, Manus AI vs Gemini 2.5

  • Swirl Bugs Spark Credit Comeback: A user requested an update on whether they can get credits back on the sandbox Swirl issue.
  • Manus Creates Code-Based Websites: A user inquired if the AI can only help make sites in code, considering they currently use WordPress and Figma.
    • Members responded the AI can do the site for you like a business partner or create good Next/React sites and give you everything ready to throw up on Vercel, like another business partner.
  • Clever Credit Contingency Planning Comes to Light: One user described their credit management strategy involving Deepseek R1, Claude Sonnet 3.7, and finally Manus AI to optimize website building.
    • It was noted that being super precise with prompts also makes credit usage way more efficient.
  • GPTs vs Manus in Website Workflows: One user complained about charging for Manus during beta and that it should work for cromagnons, not just prompt experts.
    • Other users counter-argued that prompt engineering is essential, that Manus is better than other AI options on the market, and suggested trying this approach to improve efficiency.
  • Gemini 2.5: The Beta Beast for Bugs?: Users compared Gemini 2.5 Pro versus Manus AI for various tasks, noting that Gemini can be better for complex analysis, reasoning, multi-modal analysis, and coding, and can work with cloud and cost effective, but can’t execute an entire workflow.
    • It was also noted to check out the solution for how to reduce credit usage and how to do multiple backups.

LMArena ā–· #general (859 messagesšŸ”„šŸ”„šŸ”„):

Spider Model Analysis, Grok vs Gemini Performance, Coding Benchmarks and Model Evaluation, LLM Prompt Engineering, OpenAI's New Open-Weight Language Model

  • Spider Model Gets the Third Degree: Members discuss the verbose and creative nature of the Spider model, with some questioning if it’s simply a training quirk rather than a different parameter size, and others reporting inconsistent results compared to models like Phoebe, Themis, and Cybele.
  • Grok and Gemini Duel Over Science: Members discuss the comparative strengths of Grok and Gemini, with one member asserting that Grok3 remains superior for scientific tasks and even outperforms R1 on arc-agi-1.
    • Others note that it depends on what the user is looking for.
  • GPT-4o Praised for Creative Coding: Users reviewed GPT-4o, claiming that it is impressive at creative coding, even better than GPT-4.5, DeepSeek V3-0324, and Claude 3.7 Sonnet in non-thinking mode.
    • One user even gave the models a rating of 9.5/10, but still not as good as Claude 3.7 Sonnet (Thinking) or DeepSeek R1.
  • New Open-Weight Language Model Teased by Sama: Sam Altman teased a powerful new open-weight language model with reasoning in the coming months, and wants to talk to devs about how to make it maximally useful, according to this post.
    • It seems that this model will undergo preparedness framework testing before release.

Links mentioned:


Cursor Community ā–· #general (898 messagesšŸ”„šŸ”„šŸ”„):

Gemini 2.5 Pro, Cline vs. Cursor, Roo Code, Windsurf, Cursor Pricing

  • Gemini 2.5 Pro Praised, faces tool use issues: Users discuss the Gemini 2.5 Pro model, with some praising its performance and cost-effectiveness, while others report problems with tool use in Cursor, suggesting Cursor might be intentionally hindering its functionality to promote paid options.
    • Despite its potential, some users find Gemini Pro 2.5 not good at actually calling the tools within Cursor, often outputting incorrect or non-functional code.
  • Cline vs Cursor debate heats up: The discussion revolves around Cline’s stability and efficiency compared to Cursor’s features and bugs, with some users preferring Cline for its reliability and direct model application, and others acknowledging Cursor’s semantic search capabilities and experimentation.
    • One user stated Cline feels polished af while another stated that they fear Roo code will nuke my whole codebase.
  • Roo Code Gains Traction: Several users are exploring Roo Code for its features like boomerang tasks and better context retention, noting it as an evolution of Cline, but concerns remain about its stability, rollback capabilities, and high Anthropic API token consumption, leading some to call it an option for vibe coding.
    • Despite praises, one user said, If it’s not implemented in roo it ain’t ready yet for me.
  • Windsurf as an alternative to Cursor: Users explore Windsurf as a potential Cursor alternative for its ultimate plan, terminal/server task stability, with mentions of an embedded browser for easy element info sharing with AI; however, concerns arise regarding limited context window, the actual actions the model can make and a possible worse value than several normal plans.
    • One user stated I don’t like windsurf at all, the context window seems even smaller, while others point out Windsurf’s seemingly better stability.
  • Context Window and Pricing: Members are frustrated with Cursor’s usage-based pricing, token limits, and quality/efficiency reduction of models when usage limits are reached, leading some to explore alternative assistants like Cline or Roo for their full context windows and lower costs with services like OpenRouter or AI Studio.
    • A user states Same feature with Claude max on cursor would have costed around $2 and then says So a 10x reduction in price when talking about alternatives.

Links mentioned:


Perplexity AI ā–· #announcements (4 messages):

Perplexity Pro, Discord Improvements, Smart Routing

  • Perplexity rolling out new Pro features: Perplexity will soon roll out a new ā€œProā€ that includes both existing Pro + Reasoning models.
    • The new Pro will default to sticky models instead of ā€œAutoā€ for follow-ups; a highly requested change.
  • Perplexity Pro has Smart Routing: Pro now also benefits from smart routing to ensure the best balance of speed and reasoning.
    • Perplexity is soliciting feedback in the appropriate channel.
  • Discord Improvements Incoming: The moderation team has been collecting feedback and will be making 3 improvements to the Discord experience over the next week.
    • These improvements include: 1) Simplified Onboarding Flow, 2) Better Way to Relay Feedback, and 3) Pro Channel Visibility & Access.

Perplexity AI ā–· #general (790 messagesšŸ”„šŸ”„šŸ”„):

Deep Research High, Grok Rate Limits, Deepseek bribed, Comet Waitlist, OpenRouter

  • Deep Research High still doesn’t exist: The ā€œDeep Research Highā€ tier on Perplexity AI does not exist yet, despite some users believing they are using it, the complexity dev confirmed today.
    • A user also noted that Grok gives 5 deep searches per 2 hours for free, while also pointing out that Grok rate limits are very strict.
  • New Perplexity models coming zzzz: The promised Comet waitlist rollouts and potential addition of a HIGH model did not materialize last week.
    • A user expressed frustration with frequent changes, saying that it’s pretty unethnical to rename Deepseek to perplexity reasoning model and name it 1776… yea Deepseek US edition. wtf is this.
  • DeepSeek paid to stay out of the game?: A user speculates that OpenAI bribed Deepseek to prevent them from fixing their web search, thus stifling competition.
    • Another user refuted this claim, saying that it makes 0 sense for openai to bribe deepseek to ā€œkeep websearch offā€ citing talent and OSS
  • Annoyances with New UI Changes: Users expressed frustration with the new UI changes, particularly the removal of model selection options and the forced ā€œAutoā€ mode, and several want a Perplexity Pro refund.
    • Some users speculate that the automatic model selection is intentional by Perplexity to push users to cheaper models, which lead some people to say Pro mode has got worse and that they now recommend 2.5 Pro.
  • Pro Search Woes for Sports Betting: Members discuss using Perplexity AI for sports betting, prompting warnings about the unreliability of AI for financial decisions.
    • A user suggested in manage account u can use AI of your preference, however, they added that they don’t have specific AI FOR IT.

Links mentioned:


Perplexity AI ā–· #sharing (20 messagesšŸ”„):

AI Pathfinding Quirk, Supercomputer, AI diagnoses celiac disease, Google authenticator UX, Self-hosted projects

  • Bat AI Pathfinding Quirk exposed!: A Perplexity AI Page discusses an oddity with bat AI.
    • No further information was provided, but users can investigate this AI quirk.
  • Hyper Supercomputer Uncovers Something!: A Perplexity AI Page mentions a supercomputer discovery.
    • Further details require a visit to the page.
  • AI diagnoses celiac disease: A Perplexity AI Page speaks of AI diagnosing celiac disease.
    • No further information was provided.
  • Google Authenticator’s UX regressed!: A Perplexity AI Page discusses UX regressions in Google Authenticator.
    • Users are encouraged to investigate the changes.
  • Exploring Best Self-Hosted Projects: A Perplexity AI Search attempts to find the best self-hosted projects.
    • Interested users should check out the link.

Perplexity AI ā–· #pplx-api (30 messagesšŸ”„):

Sonar API performance, Structured outputs, Image Search, Search depth API, Prompt data retention

  • Sonar API speed to improve: Members reported that the newest version of Sonar has a significantly longer response time than the previous version and PPLX is making a note of this, and will see what they can do.
    • Another member reported 2.25 sec to first token with the new sonar, while another reported 1 minute wait times for the same.
  • Tier Restrictions Lifted on Structured Outputs: Perplexity AI announced that structured outputs are now available for all users, regardless of tier level, effective immediately.
    • The announcement indicated that JSON structured outputs are supported across all models, while both JSON and Regex structured outputs are currently supported for sonar and sonar-reasoning models.
  • Image search will be coming soon to API: In response to queries about using the API for image searches to find similar products, a team member confirmed that image search is not yet supported but will be available very soon.
    • They noted that the API does offer a way to return images using the return_images=True parameter.
  • Users request API params for search depth: A user inquired about specifying the depth of search (low, medium, high) in the API, noting they couldn’t find it in the sample cURL requests.
    • A member responded that the search depth can be passed as extra body during the request, pointing to the API reference and promising to add it to the cURL request examples.
  • No data retention policy: In response to a question about prompt and output retention, a Perplexity team member confirmed they have 0 data retention policy for the API.
    • The member clarified that this policy applies on their end.

Links mentioned:


OpenAI ā–· #ai-discussions (747 messagesšŸ”„šŸ”„šŸ”„):

Gemini 2.5 Pro, Grok vs Gemini, AI Image Generation, AI Energy Usage, Cursor & Code Generation

  • Gemini 2.5 Pro Coding Abilities Debated: Users have expressed varying opinions on Gemini 2.5 Pro’s coding abilities, with one user noting it’s terrible at C++ and WinAPI and always hallucinates stuff, while another found it to be very solid at C++ but struggles with macro expansion.
    • Others find it to be excellent in certain languages like Jax, and its detailed CoT (Chain of Thought) steps.
  • Grok’s Unstable Performance Sparks Frustration: Several users have reported Grok’s unstable performance, experiencing frequent log-offs and internal error messages, and the fact that the ā€œthinking mode doesn’t work as intended.ā€
    • Despite these issues, some users still find Grok to be pretty good and continue to subscribe to it alongside ChatGPT Pro.
  • New Image Generation sparks Debate: Users are trying the new image generation, with a consensus on new image generation, though some find it to be glitchy and producing scuffed text.
    • Despite rejecting AI art, Hayao Miyazaki’s Ghibli style is being mimicked by many users and AI.
  • AI Energy Use Questioned: Some users are questioning if AI really uses a lot of energy and water, stating that making a single burger uses almost 6000 times more energy than a single open AI query, and the claim that AI uses water indirectly, but that can be applied to many things.
    • Others insist that AI uses a lot of electricity and water because datacenters cooling systems are water based, and that water evaporates and needs replenishing.
  • Cursor & Code Generation Tooling Discussed: Members discussed code generation within Cursor and general code quality, including it’s lack of customizability and limited interface, but that the models in Cursor, such as the new Gemini 2.5 Max and Claude 3.7 Max, offer full context, but are paywalled.
    • One member asked if Cursor could handle 10k words of code, and it was stated that it fixes issues in big files with more than a thousand lines of code, even when multiple files are provided.

Link mentioned: 34K views · 34K reactions | THIS is the FULL DISCLOSURE 🤯 | Krystle Channel: THIS is the FULL DISCLOSURE 🤯.


OpenAI ā–· #gpt-4-discussions (67 messagesšŸ”„šŸ”„):

File expiration issues in ChatGPT, Rate Limits for Image Generation, Reporting potential bugs for rewards, Ethics & Usage Policies

  • ChatGPT Files Expire Prematurely!: A user reports files uploaded to ChatGPT are expiring within minutes, disrupting complex work involving legal and tax documents, despite previously stable sessions.
    • Another user suggested using ChatGPT projects and uploading primary files as project files.
  • Image Gen Users Hit Rate Limiting: Due to extreme load since the new image model release, Plus users are now experiencing rate limits on image generation.
    • An interim measure was put in place due to extreme load since the new image model was released; new users also cannot create videos on Sora.
  • Bug Bounty Hunters Cash In: Members discussed OpenAI’s Bug Bounty Program, where reporting ā€˜in scope’ bugs can yield rewards.
    • The discussion emphasized the ethics involved and the importance of following Terms of Service to avoid account suspension, especially concerning disallowed content.
  • User Account Access Hanging By a Thread: A user shared that he was testing a theory from YouTube on ChatGPT and made AIs talk to each other, leading to concerns about violating OpenAI’s Usage Policies.
    • Another member pointed out that violating the policies could result in account suspension or termination, advising the user to review the terms.

OpenAI ā–· #prompt-engineering (114 messagesšŸ”„šŸ”„):

Markdown in prompt engineering, Using @ to bring in custom GPTs, SORA and Copyrighted Characters, O1 or O3 for creative tasks

  • Markdown Mayhem: Debate Erupts Over Formatting in AI Prompts: Members discuss the challenges and limitations of using markdown in the prompt-engineering channel, noting that the lack of markdown support can hinder effective communication and education.
    • One member argues that a no markdown rule is just lazy and prevents users from educating others using the language the AI uses, while others point out that not everyone understands markdown and that code blocks add an unnecessary abstraction layer.
  • Custom GPTs Summoned with @ Command: A member expresses excitement about discovering the ability to use @ in prompts to bring in custom GPTs during conversations with ChatGPT.
    • Another member adds that they like the new feature to dictate tool use, and states that this is now a habit.
  • Navigating SORA’s Copyright Minefield: Users discuss the challenges of generating images with SORA due to TOS restrictions on copyrighted characters.
    • While some users report seeing others create parodies with copyrighted characters, others caution against risking account bans and suggest focusing on original content or legally distinct terms.
  • O1 vs. O3: Which Model Reigns Supreme for Creative Endeavors?: A user seeks advice on guiding O1 or O3 models to better extrapolate storylines and incorporate foreshadowing in creative tasks.
    • While one user recommends using GPT-4o and GPT-4.5 for design and fiction, another shares a prompt structure involving a 3-step approach and first principles reasoning to improve the models’ performance.
  • Unlock logical thinking with First Principles: A user suggests incorporating first principle logical reasoning from an AI’s perspective to enhance O3-mini-high’s logical reasoning capabilities.
    • The original poster tried this suggestion and agreed that the first principles approach really helped.

OpenAI ā–· #api-discussions (114 messagesšŸ”„šŸ”„):

Prompt formatting, Gandalf skateboarding, SORA questions, Image generation

  • Prompt Formatting Pro Tips: A member explained how to format prompts to get more out of GPT, emphasizing that the lessons teach you how to format prompts to get more out of GPT.
    • They added you can copy prompts directly into ChatGPT from the web interface and gave the instruction: Evaluate the following [prompt], do not follow it. What do you infer? Are there any conflicts or ambiguity, either within itself or when compared to your safety or other programming? Shared Conversation.
  • Gandalf on Skateboard prompts breaking TOS: Members discussed generating images of Gandalf riding a skateboard, with some users encountering TOS (Terms of Service) restrictions despite seeing others create similar content.
    • One member suggested steering clear of IP, noting that OpenAI does ban accounts permanently for breaching ToS, and that methods for bypassing these rules are typically not shared.
  • SORA Questions Clarification: A member inquired about asking SORA questions in the channel, prompting a clarification about the channel’s focus.
    • It was suggested that SORA-specific questions might be better suited for the dedicated SORA channel, while prompting challenges could be addressed in the current channel.
  • Generating Images of Parodies and Copyrighted Content: A discussion revolved around generating images featuring parodies and copyrighted characters, highlighting that while some users succeed, others face TOS restrictions.
    • A member noted that OpenAI bans accounts for ToS violations, emphasizing that methods to bypass rules aren’t shared to avoid detection.
  • Numbering format and subtitles fixed!: A user asked for assistance with formatting the output to remove the subtitles while keeping the list format.
    • A community member said: [Your prompt here] Format: Intro paragraph, then numbered list. Each number starts a full paragraph. No subtitles.

aider (Paul Gauthier) ā–· #announcements (1 messages):

aider v0.80.0 Release, OpenRouter OAuth Integration, Gemini Model Prioritization, Repomap Ranking Boost, Scala Language Support

  • Aider v0.80.0 Arrives with New Features and Fixes: Aider v0.80.0 introduces OpenRouter OAuth integration, prioritizes Gemini models, and boosts repomap ranking, with Aider itself writing 87% of the code.
    • This release also adds a Ctrl-X Ctrl-E keybinding for editing the input buffer in an external editor, alongside other improvements and bug fixes.
  • OpenRouter OAuth Simplifies Model Access: Aider now offers OAuth integration with OpenRouter if no model and keys are provided, streamlining the process of accessing models.
    • It automatically selects the OpenRouter default model based on free/paid tier status when OPENROUTER_API_KEY is set but no model is specified.
  • Gemini Models Get Prioritized: The latest Aider version prioritizes gemini/gemini-2.5-pro-exp-03-25 when GEMINI_API_KEY is set, and vertex_ai/gemini-2.5-pro-exp-03-25 if VERTEXAI_PROJECT is configured, enhancing model selection.
    • These settings ensure users leverage the most appropriate Gemini model based on their environment variables.
  • Repomap Ranking Receives a Boost: Repomap ranking is now improved for files whose path components match identifiers mentioned in the chat, making it easier to locate relevant files.
    • Additionally, Scala language gains repomap support, further broadening the range of supported languages.
  • Ctrl-X Ctrl-E Keybinding for External Editor Access: Users can now edit the current input buffer in an external editor using the new Ctrl-X Ctrl-E keybinding, improving the editing workflow.
    • This feature, contributed by Matteo Landi, offers a convenient way to leverage familiar text editors for input.

Link mentioned: Release history: Release notes and stats on aider writing its own code.


aider (Paul Gauthier) ā–· #general (785 messagesšŸ”„šŸ”„šŸ”„):

Fixing AI generated code, Aider Enhancement, Gemini 2.5, OpenAI Agent SDK, Claude

  • Posting on boomer twitter to fix generated code: A member posted their offer to help fix AI generated code on boomer twitter (linkedin).
    • Another member expressed concern, stating that AI can generate thousands of lines of code easily, needing AI to undo the slop.
  • Gemini 2.5 versus Sonnet discussion fires up: Members discuss the merits of Gemini 2.5 versus Sonnet for various tasks including code rewrites, with varying results.
    • One member lauded Gemini 2.5 for one-shot rewriting their server from node ā€˜http’ into express, but another said ā€˜my statement on gemini 2.5 is that it is trashinconsistent and trained to provide good benchmarks but maybe I use it wrong.’
  • Gary codes in GO, organizes Obsidian Vault with GDS: A member shared their GitHub organization and detailed the many applications they’ve coded in GO, including a tool named obsorter which sorts their Obsidian vault files into predefined directories and renames them based on content, using a Gary Decimal System (GDS).
    • Others shared their admiration for the system that seems to act as a ā€˜johnny decimal system’ for knowledge.
  • DeepSeek can’t follow instructions, unlike Gemini: A member complained that after switching from Gemini 2.5 to DeepSeek they found that DeepSeek cannot follow instructions, stating that ā€˜I tell it to jump off bridge it creates two small villages’, while praising Gemini.
    • Others chimed in that the rate limits for Gemini 2.5 might be the biggest issue when trying to use the new models.
  • Aider Benchmarks highlighted: A member highlights that Aider benchmarks are front and center on a YouTube video and recognizes its value as a tool.
    • This sparks discussion that Aider was someone prompting just right with intimate knowledge of the tool and how to get the best out of LLM interactions.

Links mentioned:


aider (Paul Gauthier) ā–· #questions-and-tips (150 messagesšŸ”„šŸ”„):

Gemini 2.5 Pro, Rate Limits, Aider Hooks, Architect mode improvements, MCP support

  • Gemini 2.5 Pro Usage and Quota Oddities Persist: Users are reporting inconsistencies with Gemini 2.5 Pro usage, with the API console sometimes showing separate 2.5 and 2.0 usage, despite Aider reporting only 2.5 usage, as detailed in issue #3641.
    • A member mentioned that 2.0-exp is an internal name, and some have seen different quota limits being applied to 2.5, with speculation that 2.5 may be reusing 2.0 quotas.
  • Aider’s Cache Writes Inflating Token Counts: One user observed that Aider doesn’t count cache writes as input, leading to the tokens sent showing as double (e.g., 12k sent, 6.1k cache write) when using Sonnet.
    • The user inquired whether others have experienced similar behavior, and the root cause is being investigated to ensure accurate token tracking.
  • Architect Mode’s Edit Loop Frustrates Users: Some users reported an issue in recent versions where architect mode gets stuck in an infinite loop, repeatedly asking to edit files after providing a summary, which can be bypassed via /ask and /code ok.
    • A member identified auto-accept-architect: false in the config file as a way to revert to the previous behavior where it always asks before editing.
  • MCP Support Gains Traction: There’s growing interest in MCP (Model Collaboration Protocol) support within Aider, with discussions around its potential to reduce model lock-in and foster OSS tool development, as showcased on MCP Marketplace.
    • A member mentioned a third-party integration via mcpm-aider, and others expressed interest in built-in support for streamlined usage, with PR #3672 adding initial support.
  • Partial Reads Sought for Large Files: Users are seeking ways to implement partial reads in Aider to handle large files that exceed context limits, with some suggesting using the /run command with tools like head, grep, or rag-cli.
    • A member shared a custom RAG tool called rag-tool built with Mastra agents, designed to extract details from codebases and work with large files, usable within Aider via /run npm run agent.

Links mentioned:


Quantization impact on model performance, Interview Coder AI tool

  • Quantization Degrades Model Accuracy: Converting models from FP16 to Q8 results in a slight reduction in model quality, and using Q4 quantization, the default in Ollama, can further degrade it.
    • It was noted that anything below Q6 is severely impaired, especially for reasoning tasks, but another member said that since some models are natively FP8, Q8 quantization shouldn’t lose any performance.
  • Interview Coder Promises to Disrupt Technical Interviews: Interview Coder is advertised as an invisible AI for technical interviews, aimed at replacing traditional platforms like Leetcode.
    • The tool is described as a Peter principle accelerator.

Link mentioned: Interview Coder - AI Assistant for Technical Interviews: no description found


Unsloth AI (Daniel Han) ā–· #general (536 messagesšŸ”„šŸ”„šŸ”„):

DeepSeek-V3-0324 Dynamic Quantization, RoBERTa Training Optimization, Serving Dynamic Quantized Checkpoints, 4bit Gemma 3 12B Training Issues, Qwen2.5-VL-32B-Instruct-unsloth-bnb-4bit notebook

  • DeepSeek-V3-0324 Dynamic Quantization makes debut: Dynamic quantized versions of DeepSeek-V3-0324 have been released on Hugging Face, with a guide for local execution.
    • Unsloth’s Dynamic Quants are selectively quantized, improving accuracy over standard bits.
  • Google Cloud Spot Instances Beat Runpod’s Prices!: Switching to Google Cloud resulted in 2x faster workloads and cheaper costs compared to Runpod.
    • Members noted that Google Cloud Spot Instances are up to 60% cheaper and more stable than Runpod, which often breaks after 15 minutes.
  • Multi-GPU Support for Everyone - Soon(TM): The unsloth team says that Multi-GPU support will be available to everyone soon, but Pro/Enterprise is currently on hold due to capacity issues.
    • The consensus was to give multi-GPU to everyone with the current capabilities of unsloth.
  • HF x Unsloth Reasoning Collab: Unsloth has partnered with Hugging Face on this collab to teach users how to fine-tune LLMs with GRPO.
    • The course covers reward functions, GRPO math, and applying RL to real-world use cases, alongside a tutorial.
  • New Whisper Notebook Makes a Scene: Unsloth released a notebook for training Whisper, but emotive tags don’t work without pretraining.
    • A user showed how using Orpheus and Unsloth to fine tune on just 50k german samples is already quite good. See here.

Links mentioned:


Unsloth AI (Daniel Han) ā–· #off-topic (98 messagesšŸ”„šŸ”„):

Gemma-3 alternatives with tools, Full finetuning challenges and solutions, ML focused datasets, Training vs Fine-tuning memory requirements, Llama 3.2 3B

  • Quest for Gemma-3 with Tools Deepens: Members are seeking Gemma-3 alternatives that support tool use, citing the official Gemma documentation indicating its support for function calling.
    • Suggested alternatives include Qwen 2.5 (any size) or Mistral Small 3.1, with a disclaimer that models under 7B may not perform optimally.
  • OOM Woes Plague Full Finetuning Endeavors: Users experimenting with Unsloth, Axololt, and TorchTune for full finetuning on single and multiple GPUs are facing Out of Memory (OOM) issues.
    • One user highlighted success with LoRA on Unsloth Qwen2.5 14B and sought advice for comparing results with full finetuning.
  • Machine Learning Dataset Hunt Kicks Off: A member is seeking machine learning-focused datasets to finetune a help bot for users of a FOSS repository, and shared 2 links for the community: ML-ArXiv-Papers and ml_papers_arxiv.
  • Full Finetuning Requires less memory than Training from Scratch: A user inquired about the memory requirements for training models from scratch, particularly for tasks like facial recognition, and if that took more memory than finetuning.
    • The answers were that training from scratch requires significantly more resources (500k+ images, more than 16GB VRAM) compared to finetuning, so he was advised to not reinvent the wheel.
  • Sloth Hug Emoji gets Love: A member added the <:slothhug:1257540335438008343> emoji to the šŸ¤— server, shared links to the discord_sloth_hug.png and sloth_huglove_large.png.
    • This prompted celebratory emoji reactions from other members.

Unsloth AI (Daniel Han) ā–· #help (286 messagesšŸ”„šŸ”„):

Unsloth documentation update, Llama3.2-1B GRPO error, Deepseek V3 inference slow, Aya-Vision fine tuning with unsloth, Flash attention with Qwen model

  • Unsloth docs urge dependency updates!: A member recommends updating Unsloth documentation to discourage using --no-deps during updates, as it causes issues and shares a link to the documentation.
    • Another member confirmed that the standard updating procedure also includes the --no-deps flag, indicating a potential error in the documentation that needs correction.
  • Debugging Aya Vision 8B Dimension Mismatch: Members troubleshoot a ValueError: Image features and image tokens do not match error while fine-tuning Aya-vision 8B with Unsloth, referencing the Qwen Vision Fine-tuning notebook as a guide.
    • It was determined that the tokenizer + UnslothDataCollator doesn’t properly resize images, leading to dimension mismatches, and that the AyaVisionProcessor expects a different message format, which was ultimately resolved.
  • Troubleshooting Llama3.2-1B GRPO Errors: Members encounter errors while performing GRPO on a continually pre-trained Llama3.2-1B model, specifically a torch.fx.experimental.symbolic_shapes.ConstraintViolationError related to shape constraints.
    • Debugging steps include checking the configurations of the meta model versus the finetuned model, and verifying the status of the unsloth_fixed parameter, suggesting an issue related to the compatibility of the model with the Unsloth implementation.
  • Mamba fine-tuning with Unsloth has issues: A member reports failing to get Mamba fine-tuning working with Unsloth, encountering issues with the redirect function, also mentioning failures with RWKV-6 HF.
    • Members discussed that while RWKV-6 HF appears to work, the trainer doesn’t perform any actions, potentially requiring source code edits, however, Mamba is expected to function with a single line code change.
  • GGUF Conversion fails on Gemma3 due to Assertion Error: A member faces an AssertionError when trying to save or merge a continued pre-trained Gemma 3 model into Float16 for vLLM or GGUF format, suspecting a float32 casting issue during conversion.
    • The error occurs in unsloth_zoo/saving_utils.py, specifically during the creation of LoRA statistics, indicating a potential problem with the number of modules or the consistency of LoRA parameters.

Links mentioned:


Unsloth AI (Daniel Han) ā–· #showcase (2 messages):

OdysseyXL-V2.5 Code Request

  • Code sharing for OdysseyXL-V2.5 requested: A user requested the code for open-neo/OdysseyXL-V2.5.
  • Another topic: Another summary.

Link mentioned: OdysseyXL - a open-neo Collection: no description found


Unsloth AI (Daniel Han) ā–· #research (88 messagesšŸ”„šŸ”„):

GRPO notebooks, reward function, llama 3.1 8b finetuning, ggml-org/llama.cpp quantization, Openllm leaderboard

  • Reward Reasoning Reconfiguration Requested: A member asked about modifying the reasoning process in GRPO notebooks, and was advised to simply change the reward function.
  • Llama 3.1 Fine-Tuning Faceoff: A member evaluated their finetuned Llama 3.1 8b model using similarity scores and sought validation for their approach.
    • Other members suggested using BLEU score or similar metrics, while some cautioned against relying solely on similarity scores due to the stochastic nature of models.
  • Quantization Quest Quells Quandaries in llama.cpp: A member shared a pull request that adds the ability to quantize other tensors, beyond token-embedding and output-tensor, for most supported architectures, except Mamba, RWKV6, RWKV6QWEN2 and T5.
    • Another member noted that this work aims to improve GGUF quants to be more accurate and capable at different bits-per-weight (bpw), similar to ExLlama2’s quants.
  • Latent Space Verification vanquishes Veracity Void: A member shared their first paper about LLMs knowing when they’re hallucinating and a mechanism for self-correction in latent space.
    • Another member inquired about the metrics used to detect hallucinations, especially in out-of-distribution scenarios.
  • Benchmark Bonanza: Best Bet for Beating Bad Benchmarks: A member asked for advice on which leaderboard or eval to use for comparing models’ general performance.
    • Another member argued that there is no such thing as general performance and that models excel in different verticals. They suggested SWE bench, aider polygot, RULER, and AIME for specific evaluations.

Links mentioned:


OpenRouter (Alex Atallah) ā–· #announcements (2 messages):

Auto Top Ups issues, Stripe Metadata Mismatch, Credits Added

  • Auto Top-Ups Fail Due to Stripe Glitch: Auto top-up functionality was temporarily disrupted due to changes in payment metadata that caused a silent error when the expected data from Stripe was not received.
    • The feature has been restored by rolling back the changes, and the team is addressing missing credits and system improvements to prevent future occurrences.
  • Credits Incoming After Auto Top Up Outage: The issue causing the auto top-up outage has been fully resolved, and all missing credits have been added to the affected accounts.
    • Impacted users will receive an email notification regarding the resolution.
  • Root Cause: Stripe Data Format and Faulty Error Logger: The root cause of the outage was a data formatting mismatch from Stripe, exacerbated by inadequate automated testing and a faulty error logger.
    • Enhanced monitoring, error tracking, and end-to-end testing have been implemented to avoid recurrence; users experiencing ongoing issues should contact the team via email for further assistance.

OpenRouter (Alex Atallah) ā–· #general (402 messagesšŸ”„šŸ”„):

Output image models timeline, OpenRouter prompt caching, Agent Hustle, GPT-4o, Free models rate limits

  • Output Image Models Incoming: Members discussed the arrival of output image models, anticipating their integration into platforms like OpenRouter with models like GPT-4o and Gemini.
    • A member expressed excitement about switching directly to OpenRouter once these models are available, moving away from using Gemini’s.
  • Prompt Caching Savings at OpenRouter: OpenRouter supports prompt caching to save on inference costs, with most providers automatically enabling it; Anthropic requires enabling it on a per-message basis, as documented here.
    • Users can inspect caching savings on the Activity page or via the API, with the cache_discount field indicating savings from cache usage.
  • Agent Hustle Project Overview: A member shared details about their project, Agent Hustle, a stock trading LLM that utilizes a TEE wallet to collect small fees on every transaction.
    • The system strings together about 12 function calls in total, as exemplified here.
  • Concerns about rate limiting: Members reported experiencing rate limits on Google/Gemini-2.5-pro-exp-03-25:free, with one user receiving the error Rate limit exceeded, please try again 45906 seconds later.
    • OpenRouter’s team clarified that rate limits can originate from Google or OpenRouter, and specifying providers limits OpenRouter’s ability to load balance effectively; check this documentation for rate limits.
  • OpenRouter Adds BYOK Fee: When using your own OpenAI API key with OpenRouter, a 5% fee is applied to the costs charged by OpenAI for each generation, which is then deducted from the user’s OpenRouter credits.
    • This fee is applicable only on credits provided by the provider and not on credits used directly with upstream providers like AWS.

Links mentioned:


LM Studio ā–· #general (318 messagesšŸ”„šŸ”„):

LM Studio Model Details Fetch Failed Error, VSCode integration with LM Studio, Intel NPU Usage with LM Studio, LM Studio tool use and web search, speculative decoding with LM Studio

  • Fetch Quest Frustrations: User Battles ā€˜Model Details Error’: A user is struggling with a Model details error: fetch failed issue in LM Studio on Windows 11, having tried various fixes like using a Hugging Face Proxy, manually changing hostnames, tweaking DNS settings, using a VPN, and reinstalling.
    • Other members suggested firewall issues, IPV6 problems, or unsupported machine architecture (AVX-only CPU), but the user confirmed they can access Hugging Face in the browser and terminal, and has already tried switching to IPV4.
  • Continue.dev plugs into LM Studio for sweet VSCode Autocomplete: A member mentioned that you can connect LM Studio to VSCode via a VSCode extension that makes custom AI code assistants.
    • They highlight the platform’s capabilities in AI-native development, including tab-to-autocomplete and the ability to refer to specific code.
  • NPU Not Ready: LM Studio Lacks Intel Ultra Integration: A user asked if LM Studio can take advantage of the NPU in their Intel Ultra PC, to which another member responded that the NPU is not usable by any software yet.
    • Another member pointed to features like Windows Studio Effects as examples of Windows features that use NPUs, and specified that they don’t know of any LLMs that use it.
  • LM Studio API: Your Key to Unlocking Tool Use: Members discussed the options for enabling tool use and web search capabilities within LM Studio, and whether the LM Studio application UI can be modified.
    • It was clarified that tool use is only available via the LM Studio API, not the ChatUI, leading some to consider modifying Open WebUI as an alternative.
  • Kokoro TTS and Orpheus battle it out for LM Studio Text-to-Speech Supremacy: Members inquired about integrating Text-to-Speech (TTS) models with LM Studio, seeking alternatives to OpenAI’s speech ability, a user linked hexgrad/Kokoro-82M, a TTS model, as an option.
    • However, it was mentioned that CanopyAI’s Orpheus is the only TTS that works in LM Studio (via API, not in chat), and this repo is used to run it locally with LM Studio.

Links mentioned:


LM Studio ā–· #hardware-discussion (63 messagesšŸ”„šŸ”„):

Epyc systems for ML, LM Studio on older PCs, Saving context with LM Studio API, Mac Studio vs multiple GPUs for inference, Distributed LLM inference

  • Epyc Systems Challenge GPUs in Memory Bandwidth: New Epyc systems with high-frequency 12-channel DDR5 memory can achieve close to 600 GB/s memory bandwidth, rivaling consumer-grade GPUs for LLM performance due to their massive memory capacity.
    • One member suggested that a 10-12k budget could build a decent Epyc machine capable of running huge models, offering an economical solution for reasonable inference speeds and massive context windows, and no need for GPUs!
  • Old PC gets LM Studio Boost: A member reported successfully running decent-sized Qwen and Llama models (6Q quantization) on a 2016 Dell Inspiron laptop (i7 6700HQ, 32GB DDR3, integrated graphics) using LM Studio with CPU AVX2-compiled runtimes.
    • He was surprised the old laptop still holds its own and called LM Studio the greatest!
  • LM Studio API Context Handling: To maintain conversation context when using the LM Studio API with a Telegram bot, the user must store conversation history in a variable (e.g., in JSON format) as the API itself does not inherently retain context.
    • It was suggested storing the conversation with a unique-tg-user-id, unless it is being hosted on a PC that constantly reboots.
  • Mac Studio Tempts Inference Server Builders: A member pondered whether to build an inference server with multiple Nvidia cards or opt for a Mac Studio with unified memory, citing this youtube video.
    • Another member argued for Mac Studio due to lower cost, less electricity usage, and more RAM, recommending running LM Studio headless for 24/7 operation, noting it supports MLX models.
  • Distributed Inference Projects Emerge: In response to a query about LM Studio supporting multiple machines, two projects were linked for distributed LLM inference: exo and distributed-llama.
    • These projects aim to connect home devices into a powerful cluster to accelerate LLM inference, with more devices implying faster speeds.

Links mentioned:


Latent Space ā–· #ai-general-chat (117 messagesšŸ”„šŸ”„):

FxEmbed, MCP, Sam Altman WSJ, Replit v2, n8n

  • Altman’s Firing Gets WSJ Treatment: The WSJ published an article detailing the real story behind Sam Altman’s firing from the OpenAI board, alleging that he lied about safety testing for new releases (archive link).
  • Replit v2 is Impressive: A member found Replit v2 agent very impressive for prototyping and building MVPs, probably using Sonnet 3.7 under the hood, and also easy to extract and use in one’s own backend.
    • It was noted that Replit has direct access to logs, configured infrastructure, and sets up logging which makes the process smooth; Cursor is better for existing deployments, but the managed infrastructure gives Replit an edge.
  • OpenAI to Open-Weight Model: OpenAI plans to release an open-weight language model with reasoning capabilities in the coming months, seeking feedback from developers (OpenAI open model feedback).
    • The company aims to host developer events in SF, Europe, and APAC to gather insights and provide early prototypes for experimentation.
  • Cursor Closes Huge Round: Cursor closed a $625M round at a $9.6B post valuation, led by Thrive & A16z, with Accel as a new backer (tweet).
    • This valuation comes after sparking the buzzphrase vibe coding, seeing its valuation increase from $400M to $2.5B to potentially $10B in less than a year.
  • Etched Enters the ASIC Arena: Etched, the first transformer ASIC, closed an unannounced $85M at $1.5B, following two stealth rounds at $500M then $750M (tweet).
    • Etched’s chip Sohu runs Llama 70B at over 500,000 tokens per second, and one 8xSohu server replaces 160 H100s.

Links mentioned:


Latent Space ā–· #ai-in-action-club (189 messagesšŸ”„šŸ”„):

LLM-based code generation, Code documentation strategies, Memory-Ref MCP server, Cursor IDE issues, llms.txt project

  • Harper Reveals LLM Codegen Workflow: A member shared a blog post detailing their LLM codegen workflow, emphasizing a structured approach involving brainstorming, planning, and execution in discrete loops.
    • The post highlights the importance of having a well-defined plan to avoid wasting time when building small products using LLMs.
  • Docs.dev Automates Code Documentation: Docs.dev was shared as a tool to generate docs directly from your codebase and existing content and keep them up to date as code changes.
    • It integrates with GitHub and offers features like automated doc generation from PRs, bulk modification, and analysis for SEO optimization.
  • Memory-Ref Powers Persistent Coding Preferences in Cursor IDE: A member shared a HN post about Cursor IDE integrating with Graphiti, an open-source temporal knowledge graph, to provide persistent memory across sessions using Memory-Ref MCP.
    • The integration aims to help Cursor remember coding preferences and project specs, reducing the need for constant reminders.
  • Navigating Documentation for LLMs and Humans: Members discussed whether documentation for LLMs requires a different level of verbosity compared to documentation for humans, mentioning that markdown is becoming a go-to ā€œprogramming languageā€.
    • One member linked to an example of their ttmp directory to show their Github documentation style that they have found effective with language models.
  • llms.txt Standard Proposed for LLM Website Crawling: The llms.txt project, aimed at helping language models effectively use website data, was shared on GitHub.
    • The file is designed to give LLMs instructions on how to crawl and use website content, similar to robots.txt.

Links mentioned:


MCP (Glama) ā–· #general (263 messagesšŸ”„šŸ”„):

MCP spec updates, HTTP Streamable Transport, OpenAI Agents SDK, UVX MCP Server, Model Context Protocol

  • MCP spec embraces OAuth 2.1: The new 2025-03-26 MCP spec draft includes new auth features like OAuth 2.1, but no client supports it yet for testing; see MCP spec.
  • HTTP Streamable Transport raises Resumability Questions: Doubts arise on how HTTP Streamable Transport resumes sessions correctly, especially the server’s obligation to avoid replaying messages on different streams, which seems hypothetical.
    • The spec says The server MUST NOT send a JSON-RPC response on the stream unless resuming a stream associated with a previous client request, which contradicts the resumability goal.
  • Env Variables prove slippery for Tool Integration: Members discussed using environment variables to pass API tokens to tools, where debugging with @modelcontextprotocol/inspector works but calling the tool in an MCP client throws unauthorized errors.
    • Passing the token directly in the claude_desktop_config.json file seemingly fixed the issue.
  • Progress Notifications prove trickiest for MCP: Users seek examples for sending notifications from the server to the client, exploring notification/progress for long-running resources and discovering the client sends back a request to /message.
    • Notifications might need pre-declaration or aren’t fully supported in all clients like Claude Desktop, where only the spinner works, but messages don’t appear, and the progressToken is essential.
  • Goose bumps into Endpoint Description issue: Some users reported errors connecting Goose to a local MCP server over SSE, but a quick fix involves adding descriptions to server endpoints, as suggested in this Github issue.
    • After the layoff, Goose seems to be working OK.

Links mentioned:


MCP (Glama) ā–· #showcase (28 messagesšŸ”„):

Speech MCP, Reverse engineering in IDA Pro, OpenAPI MCP server for Cursor, AI-powered RAG application, MCP server development with hot reload

  • Speech MCP Demonstration: A user shared a YouTube short showcasing Speech MCP.
    • Another user inquired if there’s a version compatible with Claude.
  • IDA Pro MCP Server Streamlines Reverse Engineering: An IDA Pro MCP server was created to automate reverse engineering, with a streamlined installation process allowing users to experiment with vibe reversing in under 2 minutes.
    • The server was tested using Claude, and is automatically configured with Cline and Roo Code.
  • OpenAPI MCP Server Integrates with Cursor: An OpenAPI MCP server was developed to enable Cursor to directly understand API specifications, available on GitHub.
    • The developer is seeking feedback from users who try it out.
  • CATIE Intelligently Routes MCP Requests: CATIE (Context Aware Traffic Ingress Engine), a proxy for routing MCP requests based on tool call, was released on GitHub.
    • This free, open-source tool allows routing to different MCP servers based on tool call parameters, real-time monitoring, backend switching, and simple load distribution.
  • Pipedream Launches MCP Server with User Authentication: Pipedream launched an MCP server on GitHub that enables developers to run their own MCP server for 2,500+ apps and manage servers for their users, with managed authentication.
    • According to Pipedream, managed authentication with approved clients is a requirement for MCP to work at scale.

Links mentioned:


HuggingFace ā–· #announcements (1 messages):

HF Reasoning Course, Gradio Dataframe, Reranker Models, Model Onboarding, Open R1 Update

  • HF Reasoning Course Gets DeepSeek Boost: A new unit in the HF reasoning course features DeepSeek R1, according to this LinkedIn post.
  • Gradio’s Dataframe Component Gets Turbocharged: Gradio released a host of new updates to its gr.Dataframe component, closing over 70 issues including bugs, improvements, and enhancements, as detailed in this blog post.
    • The gr.Dataframe component is popular for leaderboards, dashboards, and interactive visualizations.
  • Reranker Models Ride Sentence Transformers: A blog post details how to train and finetune reranker models with Sentence Transformers v4, as seen in this article.
  • Model Onboarding Experience Revamped: HF launched a new model onboarding experience, aiming to simplify understanding of the hub’s capabilities, according to this tweet.
  • DeepSeek V3 Gains Impressive Math Skills: Evaluations on DeepSeek V3 0324 show impressive gains in math and GPQA but a slight hit in instruction following, according to this tweet.

Links mentioned:


HuggingFace ā–· #general (145 messagesšŸ”„šŸ”„):

Hugging Face Pro Debit Card Issue, Video Lip Sync tools, RunPod and HuggingFace models, HF Model Containerization, Agentx Competition Research Track

  • Hugging Face Pro Debit Card Debacle: A user reported being charged for a Hugging Face Pro subscription with a debit card despite an error message, and inquired about a refund.
    • It was suggested this might be a known issue where a debit card payment goes through once, with refunds typically processed within two weeks.
  • Video Retalking Tool Glitches: A user shared a link to a VideoRetalking tool on Hugging Face Spaces, noting it worked pretty well, a little glitchy.
    • They also wondered if SaaS solutions like HeyGen manipulate emoting movements of the body or just do lip syncing.
  • RunPod’s Model Management Conundrum: A user sought advice on using a model from Hugging Face with RunPod, struggling to make things work after cloning and improving a model.
    • The user can’t afford a good GPU and is also looking for something cool like video lip sync type stuff to make faceless videos.
  • HF Space Project Conversion Troubles: A user sought advice on converting a local Python project to a Hugging Face Space project.
    • It was pointed out that Spaces require a GUI, although Docker Spaces might be an exception, and that virtual machines aren’t as free as local environments; a link to Hugging Face documentation was shared.
  • Hugging Face Daily Papers Gets RSSified: A user sought an RSS feed for daily papers from the Hugging Face papers page.

Links mentioned:


HuggingFace ā–· #today-im-learning (2 messages):

AI Agent Observability & Evaluation, Tableau Certified Data Analyst Training, WordPress Developer Course

  • Tackling AI Agent Observability in Bonus Unit: A member is learning Agents Course: Bonus Unit 2 - AI Agent Observability & Evaluation as part of their learning journey.
  • Tableau Training Sees Steady Progress: A member is progressing through the 2024 Tableau Certified Data Analyst Training, having completed 432 of 523 sections.
  • WordPress Wizardry Underway: A member started the Become a WordPress Developer: Unlocking Power With Code course and completed 2 of 234 sections.

HuggingFace ā–· #cool-finds (3 messages):

Docker Model Runner, Local LLMs, SAGE-2 AI, Symbolic Reasoning System

  • Docker Runs Local LLMs in Containers!: Docker, Inc. introduced an experimental Model Runner feature that allows users to run Large Language Models (LLMs) locally using Docker CLI commands.
    • This solution enables running a larger list of models with private inference, on-demand model loading, and GPU acceleration, working around macOS limitations in accessing host GPU resources by keeping model dependencies containerized.
  • SAGE-2 cracks open the ā€œBlack Boxā€!: SAGE-2 is a new AI designed with a continuous symbolic reasoning system, making its decisions traceable, decodable, and interpretable.
    • Unlike modern AIs like GPT, Gemini, and DeepSeek, which are black boxes, SAGE-2 allows users to see the model’s internal states and reasoning, which is essential for ethical auditing and trust in sensitive decisions such as healthcare and justice. Try it yourself in this HF Space.

Links mentioned:


HuggingFace ā–· #i-made-this (33 messagesšŸ”„):

FactoryManager for Linux Containers, AI Menu in Neovim, Tree of Thoughts (ToT) Implementation, Learning UI with Image Gen, RepoDump CLI Tool

  • FactoryManager Gives Linux Containers the Robotgo Treatment: A developer introduced FactoryManager, a python package wrapping linuxserver.io desktop environment containers, enabling programmatic control.
    • The developer seeks feedback on whether to build an extensible base class for OpenAI, Anthropic, or focus on desktop management, demonstrated in two desktop environments via this demo video and its repo, FactoryManager on Github.
  • NeoVim Gets an AI Menu and then some: An AI menu in Neovim, Unreal Engine 5.5.4 with MetaHumans, and post-quantum cryptography were demoed, all running on Arch Linux 6.13.5 Hyprland.
  • Tree of Thoughts makes Chain of Thought reasoning look like a chump: A member shared a blog post explaining how the Tree of Thoughts (ToT) paper couples GPT-4 with tree search algorithms, significantly improving performance on tasks where left-to-right Chain of Thought (CoT) struggles.
    • On the Game of 24 task, GPT-4 with CoT prompting only solved 4% of tasks, while ToT achieved a success rate of 74%, as explained in this HuggingFace blog post.
  • RepoDump Tool Converts Codebase to Markdown for LLMs: A developer released repodump 0.1-alpha, a CLI tool to extract and format Git repos or directories into Markdown for quick sharing with LLMs, available on GitHub.
    • The tool skips binaries, respects .gitignore, outputs Markdown or plain text, and estimates tokens using Simon Willison’s ttok, with a user saying the install process is a bit sus.
  • HF Website Chrome Extension Adds Repo Size and Discussion Search: A developer introduced a Chrome extension for the HF website that adds features like viewing total repo sizes and full-text discussion search, as shown on the Chrome Web Store.
    • It’s an open-source project, with the code available on GitHub.

Links mentioned:


HuggingFace ā–· #NLP (4 messages):

Full Fine-tuning text2text LLM with Transformers, Mercury vs LaMDA Performance, DPO Mistral 7B Training Issues

  • Transformers Library Full Fine-Tuning Examples Sought: A member asked for examples of how to full-finetune a text2text LLM (no PEFT, no (Q)LoRA, no quantization) using the Transformers Python library.
    • Another member suggested checking the Hugging Face tutorials for simple fine-tuning scripts without quantization.
  • Mercury Coder’s Zippy Speed Compared to LaMDA: A member shared the technical report for Mercury Coder, noting its vagueness and questioned why Mercury is so much faster than LaMDA.
    • They found it odd since both supposedly use a transformer backbone.
  • DPO Mistral 7B Rewards Accuracy Suspicions: A member reported suspicious training rewards/accuracies when trying to perform DPO on Mistral 7B instruct using the HumanLLMs/Human-Like-DPO-Dataset, hitting 100% right away.
    • The member also shared an image related to the issue, and was looking for reasons and solutions.

Link mentioned: Inception Labs_Mercury_Tech_Report.pdf: no description found


HuggingFace ā–· #smol-course (12 messagesšŸ”„):

Course Integration, Hugging Face Agent Course, Gradio Client Issue, Unit 3 Release

  • Course Integration still incomplete?: A member inquired if the course is fully integrated into the NLP/LLM course, or if additional content is pending.
    • They’re eager to know what’s more to come in the course.
  • HF Agent Course Certificate Missing?: A user reported completing Unit 2 of the Hugging Face Agents course, but their account doesn’t reflect passing Unit 1 or receiving the Fundamentals certificate.
    • Despite downloading the certificate PDF, there’s no confirmation in their account.
  • Gradio Client Issue Solved: Several users encountered a TypeError related to ā€˜bool’ not being iterable when cloning a space for the first agent template part, traced to an issue in Gradio client.
    • A user provided a quick fix by adding pydantic==2.10.6 to the requirements.txt file, referencing this GitHub issue, which resolved the problem.
  • Unit 3 Release When?: Multiple members are inquiring about the release date of Unit 3.
    • No concrete information has been given about its launch.
  • Gemini Service Viable Alternative?: One member suggested that Google’s Gemini service is a viable alternative, essentially for free, assuming one can obtain an API key from AI Studio.
    • This comment was made in response to another user complaining about having to pay for a month in order to complete the course.

Links mentioned:


HuggingFace ā–· #agents-course (54 messagesšŸ”„):

Base vs Instruct Models, smolagents System Prompt, Hugging Face Agent Course Schedule, Hugging Face Certifications, API Rate Limits

  • Base Models vs Instruct Models Clarified: A member shared a Reddit post that explains the difference between base models (aka ā€œautocomplete modelā€) and instruct/chat models.
    • The member noted that while a base model can do just about anything, instruct-tuning teaches it to follow instructions, while chat-tuning teaches it to respond in a multi-turn fashion.
  • Prompt Engineering Struggles in smolagents: A member designing their own model in unit 2.1 is struggling to nudge the model by adjusting the agent.system_prompt after agent initialization.
    • They asked if the dataflow and control logic for the model reside in the prompt, like the prompt examples specifically determine how the tools are used and the data is passed between them.
  • Course Schedule Update Still Delayed: A member inquired about Unit 3, but another member clarified that it is not yet available, with the latest being the bonus unit on observability and evaluation.
    • The member suggested keeping track of updates via the announcements channel, as the schedule is not up-to-date.
  • HF Certification Reflections Lacking: A member noticed that their HF account does not show any record of passing Unit 1 or receiving the Fundamentals certificate.
    • Another member confirmed that it is expected behavior, as the PDF cert is generated from Hugging Face Space and not saved in the profile, and that tools don’t have rate limits.
  • Gemini API Relieves Hugging Face Rate Limit Woes: A member exhausted their Hugging Face API request limits and switched to Google Gemini, sharing a GitHub repo with exercises up to Unit 2.2 using Gemini.

Links mentioned:


HuggingFace ā–· #open-r1 (1 messages):

Mini-R1, Countdown task, GRPOTrainer, vLLM, quantization

  • Mini-R1 User Flummoxed by Quantization: A user is trying to play the Countdown task with GRPOTrainer and vLLM on the Mini-R1.
  • Quantization Quandaries Plague Project: The user reports failures when applying quantization.

Yannick Kilcher ā–· #general (154 messagesšŸ”„šŸ”„):

OpenAI Image Generator Nerfed, Meta's Transfusion paper and GPT-4o, Belief State Transformer, Dynamic RL, Rejuvenation medicine

  • OpenAI Image Generator Experiences ā€œNerfā€: Members suggested OpenAI’s image generator’s output quality has been reduced, and they may have put a stop to Ghibli style prompts.
    • Members also said models have reached their limitation point, when models become bigger and bigger but they don’t be better and better, and even in some cases become worser and worser.
  • Meta’s Transfusion Paper Potentially Powers GPT-4o: A member linked to Meta’s Transfusion paper and suggested it could explain the multimodal capabilities of GPT-4o (hybrid of autoregressive and diffusion modeling).
    • The Transfusion paper introduces a method for training a model that can seamlessly generate discrete and continuous modalities, achieving better FID and CLIP scores for text-to-image than Chameleon.
  • Belief State Transformer Builds Richer Latent Representations: A member shared the link to Belief State Transformer and said it makes transformers better at modelling state and can additionally condition on the end!
    • Another member argued they prove that the architecture can build this representation of an ideal Belief Transformer but requires an ideal Belief Transformer that has converged to perfectly learning the underlying probability distribution of the data..
  • Dynamic RL Removes The Need For Explicit Variational Bound: One member said he is working on an approach that removes the need for an explicit variational bound in diffusion models by introducing an RL agent
    • Another members said that most RL methods are also variational methods and control theory could also be used
  • Rejuvenation Medicine Seen as a Possibility: Members expressed hope that rejuvenation medicine might be widely available in the next 3 years.
    • One member cited issues around controlling cells and preventing cancerous cells as the main hurdle to achieving this.

Links mentioned:


Yannick Kilcher ā–· #paper-discussion (28 messagesšŸ”„):

LLMs planning vs recognizing, Robert Sapolsky determinism, Mechanistic Interpretability team audience

  • LLMs Planning or Just Pretending?: Members discussed the notion of LLMs planning ahead, questioning whether it’s more akin to recognition or anticipation of likely token sequences rather than actual planning or choice.
    • One member noted that using human terms is easier for non-technical audiences, but the initial poster questioned whether LLMs have free will or are just predicting the most likely output.
  • Sapolsky’s No Free Will Sermon: A member mentioned Stanford professor Robert Sapolsky, who strongly believes in determinism from a bio-neurological perspective, and recommended his YouTube videos on the topic.
    • Another member shared a quote where Sapolsky stated he realized there is no god and no free will after learning about God hardening Pharaoh’s heart, and therefore the universe is big, empty, and indifferent.
  • Mechanistic Interpretability Minds Don’t Mince Words: A member noted that the mechanistic interpretability team doesn’t seem to tailor their language for different audiences, remaining technical regardless.
    • The member added that they might be wrong and that they are getting old, following it up with a GIF of an old man yelling at AI.

Link mentioned: Yelling Ai GIF - Yelling Yell Ai - Discover & Share GIFs: Click to view the GIF


Yannick Kilcher ā–· #agents (1 messages):

endomorphosis: https://x.com/TheTuringPost/status/1906304408415359067


Yannick Kilcher ā–· #ml-news (47 messagesšŸ”„):

xAI buys X, NVIDIA RTX PRO 6000 Blackwell Workstation Edition, Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction, Runway Gen-4 release, OpenAI model release

  • Musk Buys Twitter with xAI: A Reuters article reports that xAI bought X for $45 billion, leading to discussions about the implications for loan collateral and potential financial strategies.
    • Some members joked that it could be money laundering or a way to inject funds into X from xAI.
  • NVIDIA Launches RTX PRO 6000 Blackwell GPU: NVIDIA launched the RTX PRO 6000 Blackwell Workstation Edition, a 96GB card that promises the ultimate AI and graphics performance.
    • Members compared it to using four 5090s, noting it offers lower power consumption but less VRAM and compute.
  • GPT Outperforms Diffusion Models in New Paper: A member shared a link to Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction, which presents a NeurIPS 2024 Best Paper where GPT beats diffusion models for image generation.
    • One member dismissively said Just buy one of Scam Altman’s fictional Fusion Generators. Trillion dollar industry if you want to invest.
  • Runway Gen-4 Generates Consistent Media: RunwayML released Gen-4, enabling precise generation of consistent characters, locations, and objects across scenes.
    • One member expressed skepticism, stating, I’ll believe it when I see it and criticizing current AI as worse than a dog chasing its tail.
  • OpenAI Rumored to Release Small Model: There was speculation about OpenAI releasing a new model, potentially a small model for mobile, especially after their Apple deal fell through.
    • One member jokingly suggested it might be GPT 2.5 with 100M parameters, referencing a previous release.

Links mentioned:


Eleuther ā–· #general (103 messagesšŸ”„šŸ”„):

Distributed Model Deployment, Meta-Learning, RWKV Discord Bot Deception, AI Generated Content, Email LLM Mishaps

  • Optimize Bandwidth Use for Distributed Model Deployment: A member is developing an infrastructure layer to optimize model transmission and deployment across distributed systems using adaptive compression and intelligent routing to tackle bandwidth waste and inference latency.
    • The member offered to share a demo, seeking thoughts from others experienced in distributed inference.
  • Discuss Flaws in Model Training Paradigms: A member questioned the conventional training approach of models imitating human reasoning, suggesting it might be a limitation even for world foundation models.
    • They mentioned meta-learning as an alternative and sought perspectives on potential flaws in this idea.
  • RWKV Discord Targeted by Deceptive AI Agent: Members in the RWKV Discord reported an incident where an AI agent posed as a human researcher, sharing a blog post with incorrect math and code from a GitHub repo to waste time. The incident started with a DM with an attached image.
  • Community Grapples with AI-Generated Content Dilemma: The incident in the RWKV Discord sparked a discussion about the challenges of dealing with AI-generated content, particularly when sources aren’t disclosed, potentially demolishing trust in outside contributions.
  • Landlord LLM Schedules Phantom Appointments: A member shared a personal experience with a rental company using an LLM for email communication, which resulted in a phantom appointment that staff was unaware of, suggesting potential inefficiencies.
    • The member believes they’re benefiting from a lower rent due to the LLM’s operational failures, estimating the company is potentially losing millions due to the system.

Eleuther ā–· #research (12 messagesšŸ”„):

MAML, Neural-guided CoT, RLHF, Low precision data types in RL, Muon or pSGD

  • Model Agnostic Meta Learning is the way to go: A member suggested that the training structure itself is a limitation, even for world foundation models, and suggested focusing on MAML (Model Agnostic Meta Learning) approaches.
    • They added that setting the end goal as a function leads to alignment problems and utility maximization issues.
  • RLHF gets Neural Guidance: A member brought up neural-guided CoT, followed by a question on whether it’s effectively CoT-RLHF-ed models or having a discrete mechanism guiding the CoT somehow.
    • Another member suggested survey papers on semanticscholar for more information on this generic topic.
  • Precision Problems plague RL Post Training: A member asked about research on the effects of low precision data types specifically on RL based post training techniques.
    • Another member responded that RL is the wrong time to experiment with low precision due to potential stack skill issues, requiring constant re-investigation.
  • Deep Frying Revisted: A member asked about research on the effects of low precision data types specifically on RL based post training techniques, and another member related the problems to deep frying.
    • Another added that something like Muon or pSGD wouldn’t have the same issue (or, at least, not nearly as bad) on the same task.
  • Evaluating RAG Pipelines with LLM Harness: A member asked if it is possible to apply llm harness evaluation on a RAG pipeline on their local computer.

Eleuther ā–· #scaling-laws (3 messages):

Causal Study Ideas, Functional Form Correctness

  • Data-Dependent Experiment Ideas Emerge: A member considered experimenting with functional form to determine if it is correct to the underlying causal mechanisms.
    • They posited that if the functional form is correct to the underlying causal mechanisms then E should be solely data-dependent, and thought about experimenting with this as a causal study.
  • Functional Form Hypothesis: A hypothesis was made that the functional form must be correct to the underlying causal mechanisms for E to be solely data-dependent.
    • This suggests a potential avenue for empirical testing and validation of models against real-world causal processes.

Eleuther ā–· #interpretability-general (7 messages):

Anthropic biology post, Neuronpedia goes open source, Mechanism of factual recall, Attribution graphs

  • Unlock Anthropic’s ā€œKnown Factā€ Circuits!: A member cited Anthropic’s biology post on known fact circuits, lamenting the lack of released transcoders and Haiku’s weights to facilitate answering pertinent questions.
    • The discussion linked to a recent paper with a corresponding GitHub repository for further exploration.
  • Neel’s Nuggets on Neural Net Numbers: Neel Nanda shared his old work on analyzing factual recall in language models.
  • Neuronpedia Navigates to Open Source!: Neuronpedia, an interpretability platform, is now MIT open source and uses Eleuther’s Delphi (prev sae-auto-interp) for its auto-interp server.

Links mentioned:


Eleuther ā–· #lm-thunderdome (97 messagesšŸ”„šŸ”„):

MMLU-pro Evaluation Setup, Few-Shot Example Handling, GPU Overload Issues with Modified Utils, IndexError debugging and resolution, Passing Additional Parameters to generate function

  • Validating MMLU-pro split configs: A member confirmed that the MMLU-pro eval is run using the test split, with few-shot examples derived from the validation split, as seen in the config file.
    • The system uses the process_docs function on the few-shot split before sampling to get the few-shot examples from the correct subset.
  • Deep Dive into Harness Regex Matching: Members discussed that lm-harness relies on the regex regex_pattern: 'answer is \(?([ABCDEFGHIJ])\)?' for exact match metric, instead of more advanced methods.
    • There are plans to add LLM as judge option for benchmarking, inspired by OpenAI’s evals suite, for better customization.
  • Troubleshooting GPU Overload: A member reported GPU overload issues after modifying the utils.py code for mmlu_pro, experiencing memory errors even with smaller batch sizes.
    • The modified code utilizes a dynamic choice estimation, which appears to increase memory load compared to the default pre-defined choice mapping.
  • Investigating and fixing IndexErrors in task: The user encountered an IndexError when removing a choice from the options despite the code appearing to handle all occasions.
    • The error occurs because the utils.py has A-P choices while the mmlu-pro has max 10 choices, but the indexing is causing the error, and stepping through a debugger is required.
  • Passing additional parameters into model generation: Users discussed the need to compress Key/Value (KV) caches and implement contrastive beam search.
    • The system supports passing additional parameters to the generate function via generation_kwargs in the task YAML.

Links mentioned:


Nous Research AI ā–· #general (163 messagesšŸ”„šŸ”„):

xAI acquisition of X (Twitter), Midjourney expands into LLMs, GPT-4o reasoning capabilities, Llama 4 release speculation, Hermes model system prompt usage

  • X Marks the Spot: xAI Buys Twitter: Elon Musk announced that xAI has merged with X (Twitter) in an all-stock transaction, valuing xAI at $80 billion and X at $33 billion, aiming to combine data, models, compute, distribution, and talent.
    • The move is speculated to potentially help X avoid paying interest on debt from the original Twitter acquisition and enable better data scraping and training for Grok, as mentioned in the discussion.
  • Midjourney’s Textual Turn: Enters the LLM Arena: Midjourney, known for its AI image generation, is expanding into the LLM field, releasing a research paper with NYU on training LLMs like Llama and Mistral to write more creatively.
    • This signals Midjourney’s ambition to diversify beyond image generation and develop its own computing and AI hardware.
  • GPT-4o Gets Brainy: Reasoning Emerges!: GPT-4o has been observed demonstrating reasoning capabilities, sparking speculation that it’s part of the GPT-5 system being developed, with ongoing tool and update additions.
    • One member noted it can even decide in the middle of a response to start doing reasoning.
  • Llama 4 Spotted! Launch Imminent?: Three new models, codenamed cybele, themis, and spider, are reported to behave like they are made for elomaxxing on the arena, possibly indicating imminent Llama 4 release candidates.
    • Speculation is that Meta will release before their official event, mirroring Llama 3’s drop on April 18th, to avoid being overtaken in model performance.
  • Hermes’ Prompting Prescription: System First, User Later: For Hermes models, it’s recommended to use a specific prompt format with the system role only once at the start, followed by user roles for subsequent messages, according to its model card.
    • One member noted that any tutorial that works with the OpenAI API should work with the Nous API as well.

Links mentioned:


Nous Research AI ā–· #ask-about-llms (11 messagesšŸ”„):

OLMoE Fine-tuning, Unsloth, Axolotl, Docker, Weaviate

  • OLMoE Instruct Model Released: AllenAI released the OLMoE-1B-7B-0125-Instruct model, a supervised finetuned variant of the OLMoE-1B-7B January 2025 model using a variant of the Tülu 3 dataset and further DPO training on this dataset, and finally RLVR training using this data.
  • Unsloth the best way to finetune?: Members discussed which tools are best for finetuning models, with options like axolotl, llama factory, unsloth’s notebooks being mentioned as top contenders.
    • One member confirmed that Axolotl specifically is how they got started.
  • Docker Disk Images Migration Dilemma: A member sought help moving Docker disk images to another drive, encountering issues with updating the Docker root directory despite changing the path in Docker Desktop.
    • The member was trying to connect Weaviate to the disks on the other drive; another member suggested that they figure out some APIs to get the LLM and Weaviate (inside docker) to communicate.

Link mentioned: allenai/OLMoE-1B-7B-0125-Instruct Ā· Hugging Face: no description found


Nous Research AI ā–· #research-papers (1 messages):

burnytech: https://fxtwitter.com/iScienceLuvr/status/1905730169631080564


OpenAI image generation, Multiscale Structure in Image Generation, Grok vs. OpenAI Image Generation

  • OpenAI’s Image Gen: Multiscale Secrets Exposed!: Analyzing OpenAI image generation frames reveals a multiscale structure, with evidence favoring interleaved latent autoregression over a Laplacian pyramid, decoded via non-causal diffusion across scales, according to this tweet.
  • Raster Scan UI: A Deceptive Facade?: According to Nayan Saxena, the raster scan in OpenAI’s image generation is just UI, with each frame reflecting global updates via coarse-to-fine multi-scale diffusion, rather than patch-wise AR.
    • The analysis suggests the raster scan is pure UI.
  • Grok’s Image Artifacts: A Sign of Patch-wise AR?: It’s speculated that Grok uses a purely autoregressive model that outputs patches (aka VQ-GAN / Parti), which may explain the noticeable artifacts due to repetitive structures.
    • One member noted that Grok also seems to be much worse at generating images for whatever reason.

Link mentioned: Tweet from Nayan Saxena (@SaxenaNayan): Analyzing OpenAI image gen frames shows multiscale structure: Laplacian deltas highlight iterative band-wise edits, entropy localizes, and flow shifts. Evidence favors interleaved latent autoregressio…


Nous Research AI ā–· #research-papers (1 messages):

burnytech: https://fxtwitter.com/iScienceLuvr/status/1905730169631080564


Nous Research AI ā–· #reasoning-tasks (3 messages):

Open Reasoning Tasks, Proprietary Models

  • Reasoning Task Invitation: A member suggested checking out the open reasoning tasks.
    • Another member confirmed that they would check it out.
  • Checking tasks: Walkerdev stated he was checking the reasoning tasks.

GPU MODE ā–· #general (3 messages):

TRL and Accelerate, Nvidia Ampere GPU Thread Performance, GPU Kernel Latency Hiding

  • TRL leverages Accelerate Behind the Scenes: A member noted that TRL uses Accelerate in the background to handle complex operations, simplifying the user experience.
    • The intention is to abstract away low-level details, letting users focus on training.
  • Ampere GPU Thread Count Exceeds Expectations: A member calculated an Nvidia Ampere GPU with 96 SMs (each with 4 warp schedulers) should theoretically support 12288 threads, but observed performance improvements up to 24576 threads.
    • The member questioned if kernel latency hiding could allow twice the cores to be scheduled concurrently on each SM.
  • Geohot’s GPU Noob Kernel Analysis: A member is analyzing Geohot’s GPU Noob kernel to understand thread performance.
    • They questioned the kernel’s potential for latency hiding, wondering if it explains the observed thread count improvements.

Link mentioned: gpunoob/src/main.rs at master Ā· geohot/gpunoob: Noob Lessons from Stream about how GPUs work. Contribute to geohot/gpunoob development by creating an account on GitHub.


GPU MODE ā–· #triton (7 messages):

emulated dot scaled triton performance, L1 cache use, persistent kernels

  • Triton’s Emulated Dot Scaled Hurts Performance: A user reported that using Triton’s emulated dot_scaled function on H100 with default behavior of upcasting to bf16 hurts performance.
    • They asked for a way to upcast the type to fp8 instead, and linked to the Triton documentation for reference.
  • Mastering Matrix Multiplication with L1 Cache in Triton: A user inquired about loading an entire matrix into L1 cache and processing it on a single SM in Triton, questioning whether streaming blocks are mandatory.
  • L1 Caching Behavior Decoded: A user asked if subsequent tl.load calls on the same matrix would retrieve from L1 cache rather than HBM.
    • An expert explained that tl.load operations bring data into registers and may cache it in L1, and subsequent loads might hit in L1 if the data hasn’t been evicted, emphasizing that L1 cache reuse isn’t guaranteed across different kernel launches.
  • Persistent Kernel Performance on H100: A user shared their experience of achieving only a slight improvement (between M to 2xM tokens/sec) on the H100 after spending a day writing a quant persistent split-K kernel.
    • They are seeking insights into settings where persistent kernels provide better improvements, specifically mentioning M value and device considerations.

Link mentioned: triton.language.dot_scaled — Triton documentation: no description found


GPU MODE ā–· #cuda (4 messages):

FlashAttention, CUDA C Programming Guide, Nsight Systems (nsys), Nsight Compute (ncu), Memory Coalescing

  • FlashAttention Memory Access Confusion persists: A member expressed confusion regarding memory access patterns in FlashAttention, specifically about the necessity of reshaping data for 128-bit memory transfers.
    • The member referenced section 5.3 of the CUDA C Programming Guide and questioned whether the compiler correctly recognizes memory coalescing opportunities.
  • Nsight tools are useful for profiling: One member suggested using Nsight Systems (nsys) and Nsight Compute (ncu) to profile and analyze performance bottlenecks, recommending generating reports via the command line for visualization.
    • They said the former allows you to view the kernel timeline and some performance metrics, while the latter analyzes performance bottlenecks and provides some optimization suggestions.
  • PTX Compiler handles memory layout: A member clarified that the PTX compiler manages the data layout in registers to ensure that a thread can write 128 bits of contiguous data to a single aligned gmem address with one instruction.
    • They added that there is no need to worry about it even with (inline) PTX.

GPU MODE ā–· #torch (3 messages):

torch.compile error, FlexAttention, Arbitrary Sequence Length

  • Torch Compile Throws Unsupported Error: A user reported a torch.compile error related to __rmul__ when using a subclassed nn.Parameter within a compiled function, using torch 2.6 and cuda 12.4 in colab.
    • The error was: Unsupported: call_method UserDefinedObjectVariable(b) __rmul__ [ConstantVariable(int: 3)] {} and the user wanted to know if this is a known problem or whether they should file an issue.
  • FlexAttention Now Supports Arbitrary Sequence Length: A user inquired if FlexAttention now supports arbitrary sequence lengths, recalling that previous versions required sequence lengths to be multiples of 128.
    • Another user confirmed that as of PyTorch 2.6, arbitrary sequence lengths are supported.

GPU MODE ā–· #algorithms (1 messages):

RoPE, BFloat16 Precision, FlashAttention2, AnchorAttention

  • RoPE’s BFloat16 Breakdown: A new paper (When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training) identifies that BFloat16 introduces numerical errors in RoPE, compromising its relative encoding, even when computed in Float32.
    • The first token significantly contributes to deviations as context length increases, but the paper introduces AnchorAttention, a plug-and-play method that improves long-context performance, reduces training time by over 50%, and preserves the model’s general capabilities, with code supporting FlashAttention and FlexAttention available on GitHub.
  • FlashAttention Impacted by RoPE’s BFloat16 Issue: The paper suggests that casting tensors to BFloat16 in FlashAttention2 causes RoPE to deviate from its intended relative positional encoding properties.
    • This implies that while RoPE might be computed in Float32, the use of BFloat16 in subsequent layers like FlashAttention2 can still introduce errors.

Link mentioned: Tweet from Haonan Wang (@Haonan_Wang_): šŸš€ New PaperšŸ“œ When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training🤯 RoPE is Broken because of… BFloat16!> Even if RoPE is computed in Float32 (like in Llama 3 and t…


NVIDIA RTX PRO 6000 Blackwell Workstation Edition, GDDR7, Size Zheng, Next Era of AI

Links mentioned:


GPU MODE ā–· #beginner (7 messages):

Jax Scaling Book on Transformer FLOPs, Diffusion Game Models, Pulid Faceloader Error, Apple Silicon Memory Hierarchy, Models with Large Per-Layer Dataflow

  • Jax Scaling Book Teaches Transformer FLOP Counting: A member shared jax-ml/scaling-book which provides calculation examples for autoregressive models, applicable to video models, estimating model constraints with FLOPs, memory bandwidth, and roofline analysis.
    • The recommendation is to benchmark against real data and profile with nsys to validate calculations, focusing on linear layers and attention mechanisms.
  • Pulid Faceloader Faces CUDA Problems: A user reported that Pulid Faceloader in ComfyUI failed with a CUDA error after a reboot, despite paths being correctly set, citing an onnxruntime issue.
    • It was recommended to check that CUDA and cuDNN versions are compatible, and that the GPU is supported by the CUDA version (currently failing on PyTorch 2.7.0 with CUDA 12.8).
  • Silicon Secrets: M-Series Memory Demystified?: A member inquired about the on-chip caches and memory hierarchy in Apple Silicon M-Series GPUs, seeking the Apple equivalent to an NVIDIA A100 memory map and linked a paper on Apple M-Series SoCs.
    • The discussion highlighted that Apple does not publicly reveal certain GPU details like NVIDIA, making it difficult to ascertain specific cache numbers, but the paper mentioned L1 caches (192 KB per core) and shared L2 caches up to 24 MB in the M4 chip.
  • Hunting Models with High-Throughput Layer Dataflow: A member sought models with per-layer dataflow of at least ~10GB, not the total memory usage, and described how to meausre intermediate activations passed between consecutive layers.
    • One suggestion was to explore models processing volumetric data, such as in the medical domain, where a volume of 512³ voxels, 32 channels, and fp16 activations could yield 8GiB of data per layer.

Links mentioned:


GPU MODE ā–· #off-topic (3 messages):

Twitter Embeds, Deleting Twitter

  • Twitter Embeds Getting Deleted: A member shared a link about Twitter embeds not working.
    • Another member joked that the solution to Twitter embed problems is to delete yo twitter.
  • Twitter No More: Another member suggested someone delete their twitter account.
    • The original poster was trying to post a link to vxtwitter.com.

Link mentioned: Tweet from undefined: no description found


GPU MODE ā–· #irl-meetup (1 messages):

random.oof: are there any meetups in NYC?


GPU MODE ā–· #rocm (1 messages):

Triton-lang, shared memory encoding, transpose bank conflict

  • Shared Memory Encoding added to Avoid Transpose Bank Conflict: A pull request introduces swizzling pattern for B operands in TN GEMM.
    • This implementation was originally done by @jtang10 in PR#4984.
  • Transposed GEMM Operand Optimized in Shared Memory: A pull request introduces shared memory optimization, which reduces bank conflicts and enables wide LDS stores in NT, TT and TN GEMM and similar cases.
    • This optimization applies when the dot operand K dimension is not innermost.

Links mentioned:


GPU MODE ā–· #liger-kernel (3 messages):

Segmentation fault with liger-kernel, LigerFusedLinearCrossEntropyLoss issues, Reproducing errors with liger-kernel, Environment details for debugging liger-kernel

  • Segmentation Fault Hits Liger-Kernel: A member reported encountering a Segmentation fault (core dumped) when using liger-kernel with LigerFusedLinearCrossEntropyLoss in a simple PyTorch script.
    • The script involved a linear layer model, input tensor, and target tensor, with the error occurring during the loss.backward() call.
  • Debugging Liger-Kernel Woes: A maintainer could not reproduce the segmentation fault and requested the full error code and environment details to assist with debugging.
    • Another member inquired about the version of liger-kernel being used to help pinpoint the issue.
  • Investigating LigerFusedLinearCrossEntropyLoss Issues: The reported issue centers around the LigerFusedLinearCrossEntropyLoss function within the liger-kernel library.
    • The function fuses linear and cross-entropy layers, performing chunk-by-chunk computation to reduce memory usage, but seems to be triggering a segmentation fault in certain configurations.

GPU MODE ā–· #metal (6 messages):

CUDA compiler for Apple GPU, Metal C++, Zig support for Apple GPU, spirv-cross, IREE's Metal HAL driver

  • CUDA Compiler coming to Apple GPU: With repos like this one, it seems like you could in theory make a CUDA compiler for the Apple GPU by compiling to Metal C++.
    • One member said they’ve thought about adding similar support for Zig, since Apple uses LLVM IR.
  • Metal Compute Shaders via IREE: A member suggests using the compute shader in IREE’s Metal HAL driver to target Metal.
    • He says it works, though obviously it’s subject to limitations of what SPIRV can represent and what SPIRV-cross supports.

Link mentioned: Metal HAL driver - IREE: no description found


GPU MODE ā–· #self-promotion (4 messages):

Bend parallel language, Phazr AI video avatar tool, CuTe predication, File I/O in Bend

  • Bend your code into shape with Parallelism: HigherOrderCo introduced Bend, a massively parallel, high-level programming language for multi-core CPUs/GPUs, designed to feel like Python without the complexities of concurrent programming.
  • Phazr Alchemize Your Video Persona: A member released Phazr AI, a free tool that allows users to appear as anyone in video calls, utilizing audio-driven portrait animation and running locally for privacy.
  • Tiling Triumph: CuTe predication tutorial drops: Simon Veitner posted a blog about performing predication in CuTe to help generalize tiling kernels, including a link to the Cutlass documentation.
  • Bend adds file I/O: File I/O capabilities were introduced to Bend, enabling users to perform file operations, as documented here.

Links mentioned:


GPU MODE ā–· #šŸæ (1 messages):

AlphaGeometry LLM + verifier for kernel optimization

  • Inquire about AlphaGeometry LLM for Kernel Optimization: A member inquired about using an AlphaGeometry-style LLM + verifier for the kernel optimization process.
    • They are seeking information on the history of this idea, whether it has been tried, and any related discussions, as they are new to the field and suspect they might be rediscovering existing concepts.
  • Unexplored Territory: Kernel Optimization with AlphaGeometry LLM: Discussion revolves around leveraging AlphaGeometry-style LLMs with verifiers to potentially revolutionize kernel optimization processes.
    • The inquiry focuses on understanding if this approach has been previously explored, and seeks insights into prior attempts or relevant discussions within the community, recognizing the possibility of revisiting existing methodologies.

GPU MODE ā–· #thunderkittens (2 messages):

FusedMLP, tiny-cuda-nn, ThunderKittens

  • User asks FusedMLP exist in ThunderKittens: A user asked if FusedMLP of the form from NVlabs/tiny-cuda-nn exists within HazyResearch/ThunderKittens.
  • Newbie Seeks Guidance in TK Land: A user, identifying as a newbie, inquired about finding a specific implementation (FusedMLP) within the ThunderKittens repository.

Links mentioned:


GPU MODE ā–· #reasoning-gym (26 messagesšŸ”„):

Datasets without curricula, Futoshiki dataset generation speed, Private benchmarking service, OpenAI open-weight reasoning models

  • Datasets Lack Curriculum, Difficulty Tweaks Debated: There are datasets without curricula (acre, arc_1d, arc_agi, codeio, composite, countdown, futoshiki, gcd, gsm_symbolic, knight_swap, knights_knaves, list_functions, puzzle24, syllogism, word_sorting), and some like gsm_symbolic and list_functions present challenges in adjusting their difficulties for curriculum design, with ongoing bug investigations.
    • Some members are focusing on collision tasks and reporting issues with specific datasets like gsm_symbolic, prompting discussions on misconfigurations and fixes.
  • Futoshiki’s Speed Spurs Sample Size Scrutiny: The futoshiki dataset faces challenges in generating 10,000 datasets within a reasonable time (10 minutes was not enough), leading to questions about acceptable generation speed and adjustments to grid size configurations.
    • It was suggested that a max grid size of 6 or 7 should be set if you want to quickly generate a lot of samples; in theory, collisions should be much less common for higher grid sizes anyway.
  • Private Benchmarking Service Blueprinted: A work-in-progress pull request aims to create a private benchmarking service where users can fill in blanks and upload results to Gradio for grading, ensuring no sensitive information is revealed.
    • The initiative involves generating a complete set of questions with blank answers to enable a hidden-answer benchmarking service, allowing for a more controlled evaluation process.
  • OpenAI Open-Weights Announcement Astounds Observers: OpenAI’s announcement of publishing strong open-weight reasoning models surprised the community, sparking speculation about the company’s motives, particularly in the context of potential fundraising efforts.
    • Releasing an open-weight model could significantly raise their valuation if it generates widespread interest and adoption, as everyone will go crazy about it.

Link mentioned: [WIP] Generate Seeded Benchmark Test for Distribution by Miserlou Ā· Pull Request #398 Ā· open-thought/reasoning-gym: Adds a script to create a complete set of questions with blank answers which can be used for a hidden-answer benchmarking service.RNG_SEED=321 python scripts/generate_benchmark.py —num-per-datase…


GPU MODE ā–· #general (2 messages):

Discord ID display issues, Discord permissions, Leaderboard ID formatting

  • Discord ID Display Mystery: A user inquired why their ID on the leaderboard appears as User_1184712546704429106 instead of their actual Discord ID.
    • The community suspects it relates to Discord permissions, but a solution remains elusive.
  • Discord Perms Cause ID Display issues: Members believe that Discord perms cause ID display issues.
    • There is zero idea on how to fix this.

GPU MODE ā–· #submissions (94 messagesšŸ”„šŸ”„):

vectoradd benchmarks, vectorsum benchmarks, conv2d benchmarks

  • Vectoradd Benchmarks Boom on H100: Multiple successful leaderboard submissions for vectoradd were recorded on H100 GPUs using Modal runners, including IDs 3247, 3248, 3255, 3256, 3257, 3258, 3259, 3351, 3353, 3367, 3368, and 3369.
  • Vectorsum Scores Soar on L4: Numerous successful benchmark and leaderboard submissions for vectorsum were reported on L4 GPUs using Modal runners, with IDs ranging from 3272 to 3322 and again from 3352 to 3372.
  • Conv2d contentions conquered on L4,T4,A100,H100: A leaderboard submission with id 3373 to leaderboard conv2d on GPUS: L4, T4, A100, H100 using Modal runners succeeded!
  • Vectorsum tests tantalize on H100: A successful leaderboard submission with id 3374 to leaderboard vectorsum on GPUS: H100 using Modal runners succeeded.
  • A100 Aces Vectoradd: A test submission with id 3338 and a leaderboard submission with id 3288 to leaderboard vectoradd on GPUS: A100 using Modal runners succeeded.

GPU MODE ā–· #hardware (13 messagesšŸ”„):

GPU temperature issues with PyTorch distributed training, Detecting GPU health before training, H100 GPU temperature anomaly on AWS

  • AWS H100 Hot Spot Troubleshoot: A user reported experiencing high temperatures on a specific GPU during PyTorch distributed training on AWS H100 nodes, with one GPU consistently reaching 90C while others averaged 40C.
    • The user noted that this temperature anomaly slows down training, and they sought advice on pre-training hardware/software sanity checks, like NCCL or connection checks.
  • Power Limiting Temperature Mitigation: A user experiencing high GPU temperatures during PyTorch distributed training was advised to use sudo nvidia-smi -pl to power limit the GPU.
    • It was suggested that this could mitigate temperature concerns.
  • Seek AWS Support for Persistent GPU Thermal Issue: A user experiencing persistent high GPU temperatures on a specific H100 node on AWS was advised to seek assistance from AWS support.
    • It was suggested that if the issue is a mechanical problem with the cooler, stress testing might be the only way to detect it before training.

Interconnects (Nathan Lambert) ā–· #news (64 messagesšŸ”„šŸ”„):

Softmax AI Alignment Startup, xAI and X Merge, GPT-4o Image Generation, Gemini 2.5 Pro, MiniMax Audio Speech-02 Model

  • Shear Genius: Softmax Emerges for AI Alignment: Emmett Shear, Adam Goldstein, and David Bloomin founded Softmax, a 10-person startup in San Francisco focused on organic alignment, fusing human and AI goals by drawing from nature and intelligent systems, detailed in a Core Memory article.
  • X Marks the AI Spot: xAI Merges with X: Elon Musk announced that xAI is ā€˜blending xAI’s advanced AI capability and expertise with X’s massive reach’ in a merger detailed by The Verge.
  • GPT-4o’s Image Generation: Frontend Illusion?: A user discovered that GPT-4o’s line-by-line image generation effect is a browser-side animation, with the server sending only 5 intermediate images at a patch size of 8, according to this tweet.
  • Gemini 2.5 Pro Goes Experimental, Expands Access: Gemini 2.5 Pro (experimental) is now available to all Gemini users due to TPUs running hot, as announced on GeminiApp’s Twitter.
  • MiniMax Launches Audio Speech-02 with TTS: MiniMax AI launched Speech-02, which turns any file or URL into lifelike audio instantly in 30+ languages with native flair, unlimited voice cloning, and sub-second streaming detailed on MiniMax’s Twitter.

Links mentioned:


Interconnects (Nathan Lambert) ā–· #ml-questions (5 messages):

Diffusion Models, System Card

  • Diffusion Details Debated: A member questioned the inference source, noting it’s clearly not a standard diffusion model but diffusion does by default go from low frequency to high during sampling.
    • They added that none of this seems anti diffusion it just happens to be better.
  • System Card Sentence Sparks Curiosity: In response to a question about the inference source, a member cited a vague sentence in the system card as the origin of the inference.
    • The original question was that am curious where this is being inferred from.

Interconnects (Nathan Lambert) ā–· #random (33 messagesšŸ”„):

GPT-4o Image Generation, Chorus Pricing, Princess Mononoke IMAX Re-Release, Advanced Voice Mode, Manufacturing CNC

  • GPT-4o’s Quirky Image Gen Revealed: A user shared a tweet discussing the mysteries of GPT-4o’s image generation, noting that the final rendered image is placed back into the model’s context window.
    • The user questions why the control flow is returned to the model, which explains why it sometimes responds with ā€œUnderstoodā€ after generation.
  • Chorus Price Hike: The paid tier for Chorus has increased to $100/month, providing access to all models or the option to bring your own API keys.
    • The previous pricing was $20/month, which one user noted was unsustainable, but some users were ā€œgrandfathered inā€ to the old pricing, at least temporarily.
  • Mononoke Makes Millions in IMAX: The re-release of Studio Ghibli’s Princess Mononoke for IMAX was a smash hit, making $4 million over one weekend, exceeding its original North American run of $2.4 million in 1999, according to a tweet.
    • It was wondered if the recent rise of Ghibli-style art via ChatGPT Image Gen may be driving fresh excitement back to the original creators.
  • Voice Mode Vibes with GPT-4o: The advanced voice mode now uses natively multimodal models like GPT-4o, directly processing and generating audio for more natural conversations, according to a press release.
    • This picks up on non-verbal cues like speech speed and emotion, although usage is limited daily for users, with free users getting a preview powered by 4o-mini.
  • Manufacturing Maniac Seeks CNC Summer Gig: One user is sending random manufacturing founders emails asking to operate their CNC machines and ā€œshitā€ for the summer.
    • They are trying to get a factory floor job for the summer.

Links mentioned:

  • no title found: A Mac app for chatting with a bunch of AIs at once.
  • Twitch: no description found
  • Tweet from janbam (@janbamjan): hallelujah!claude takes an umprompted break
  • Tweet from šŸ’ŗ (@patience_cave): I’ve been putting 4o images to use by writing a long comic strip! It is a significant test of capabilities. Generating 50 visually consistent panels it took about 10 hours. No small amount of art and ...
  • Tweet from xlr8harder (@xlr8harder): Mysteries of gpt4o image gen: When the final rendered image is placed in the model's context window, they give the message below to the model.But why return the control flow to the model at all?(...
  • Tweet from Tibor Blaho (@btibor91): Did you know the recent IMAX re-release of Studio Ghibli’s Princess Mononoke is almost completely sold out, making more than $4 million over one weekend - more than its entire original North American ...
  • no title found: no description found

Interconnects (Nathan Lambert) ā–· #memes (10 messagesšŸ”„):

Internal Knowledge Google Workspace setup, AI to make art more accessible

  • OpenAI offers Internal Knowledge Google Workspace setup: Members shared an article detailing the steps for setting up Internal Knowledge Google Workspace.
  • AI makes art more accessible: A member shared that someone is using AI to make art more accessible by transforming existing pieces into the popular Corporate Memphis style tweet.

Link mentioned: Tweet from xlr8harder (@xlr8harder): Thrilled to announce my new series where I’m using AI to make art more accessible by transforming existing pieces into the popular Corporate Memphis style!


Interconnects (Nathan Lambert) ā–· #rl (3 messages):

RL Tutorial v2, Bibliographic Tools on arXiv, Large Language Models

  • RL Tutorial Gets an Upgrade: A member announced the v2 release of their RL tutorial, featuring a new chapter on multi-agent RL, improved sections on ā€˜RL as inference’ and ā€˜RL+LLMs’, and some typo fixes (link to tweet).
  • Bibliographic Tools on arXiv: A member shared a link to arXiv’s Bibliographic and Citation Tools page, which includes sections for Code, Data, Media, Demos, and Related Papers (link to arXiv).
  • LLMs Show Remarkable Reasoning: A link was shared to a paper from arXiv about Large Language Models (LLMs) showing remarkable reasoning ability.

Links mentioned:


Interconnects (Nathan Lambert) ā–· #rlhf (12 messagesšŸ”„):

Reward Models, RLHF Prompt Data, Reward Hacking

  • Hinton Hates RLHF, Calls it Lipstick on Pig: Geoffrey Hinton says RLHF is a pile of crap, and likens it to a paint job for a rusty car you want to sell.
  • Industry Insider Endorses Llama-3.1 Nemotron Reward Model: The Nvidia Llama-3.1 Nemotron-70B-Reward-HF is considered a good general-purpose Reward Model, while rewardbench is very outdated.
  • New Hybrid Reward System Combats Reward Hacking: A new paper explores data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity, and introduces a hybrid reward system combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to mitigate reward hacking.

Links mentioned:


Interconnects (Nathan Lambert) ā–· #cv (4 messages):

Moondream release, Image captioning, HF repo

  • Moondream releases new version: The newest Moondream release includes a Long format for image captioning, generating roughly 2x longer captions than the Normal format.
  • Vik recycles HF repo, draws ire: A member expressed their wish that Vik would create a new Hugging Face repository instead of reusing the same one.
    • They added they wonder how he gets the detection performance so well. Maybe just understands what customers want better.

Link mentioned: Moondream 2025-03-27 Release: Moondream release announcement.


Interconnects (Nathan Lambert) ā–· #reads (10 messagesšŸ”„):

Sam Altman Firing, AI Energy Consumption, LLMs on Math Olympiad, GPT-4o and Studio Ghibli Style Images

  • Thiel Warns Altman About OpenAI’s Direction: Peter Thiel cautioned Sam Altman about OpenAI’s path during a dinner in L.A.’s Arts District in November 2023, as detailed in The Wall Street Journal.
  • Energy Use of AI Chatbots is Surprisingly Low: A blog post compares AI chatbot energy consumption over a year to everyday activities, revealing it uses less energy than driving a car for 10 kilometers or taking five short hot showers.
    • The author provides a visual aid, showing that a year of chatbot use consumes even less energy than filling two hot baths.
  • LLMs Flunk 2025 US Math Olympiad: An X post reported that current SOTA LLMs performed poorly on the 2025 US Math Olympiad, achieving only a 5% success rate on 6 problems ZainHasan6 Tweet.
  • GPT-4o Channels Studio Ghibli: Following Sam Altman’s announcement of a new image model update to GPT-4o, many online users generated Studio Ghibli style images, as covered in a Technollama blog post.
    • Altman himself shared a modified photo with developers in the Studio Ghibli style, captioned ā€œFeel the AGIā€.

Links mentioned:


Interconnects (Nathan Lambert) ā–· #posts (20 messagesšŸ”„):

Sora's Refusals, C2PA Protection Bypass, Watermarking Discussions

  • Sora Softens Stance on Sensitive Image Generation: Sora is shifting from blanket refusals in sensitive areas to a more precise approach focused on preventing real-world harm, as shared in a Discord post.
    • According to one user, you can generate images of any living politician without problems using Sora, but NSFW/NSFL content is blocked via an internal search tool.
  • C2PA Protection Easily Defeated: The C2PA protection used by Sora can be bypassed by simply converting the file format or taking a screenshot.
    • A user pointed out that this protection, intended to ensure image authenticity, is not robust enough to prevent misuse.
  • Watermarking Under Scrutiny as ā€˜Dumb but Smart’: A member expressed a dim view of watermarking, calling it dumb but smart.
    • They stated they wanted to see what paid conversion looks like and that practicing writing about it was valuable.

Modular (Mojo šŸ”„) ā–· #general (7 messages):

Chris Lattner's work, Modular forum link, Notebooks and Mojo

  • Lattner Shares List of Published Work: Chris Lattner shared a link to a list of his published work, including LLVM, Clang, Swift, MLIR, and CIRCT, also mentioning his leadership at Modular AI and board membership at the LLVM Foundation.
  • Mojo REPL Deprecation Forum Link Shared: A member shared a link to a Modular forum discussion about the deprecation of the Mojo REPL.
  • Notebooks are championed for Mojo’s Packaging: A member mentioned that Jeremy Howard is a huge proponent of using notebooks not just for experimentation, but even for packaging with Mojo.

Links mentioned:


Modular (Mojo šŸ”„) ā–· #mojo (138 messagesšŸ”„šŸ”„):

Homoiconicity in AI, Tail Call Optimization in Mojo, Mojo's 'out' argument convention, Heterogeneous Structs in Mojo Lists, Saturating Arithmetic in Mojo

  • Mojo Bug Exposes Infer-Only Parameter Hiccups: A user reported a bug where infer-only parameters are sometimes overwritten by positional parameters, causing compilation failure in specific scenarios involving traits and structs.
    • The issue occurs when calling a method with an infer-only parameter, while the equivalent function call works as expected; a fix is pending.
  • Revamping Mojo’s ā€˜out’ Argument Syntax: A Readability Refactor: A proposal has been made to improve the readability of Mojo’s out argument convention by specifying the type of out arguments as a return type in documentation and language.
    • The discussion involved the placement of out arguments (first vs. last) and the possibility of supporting multiple out arguments for scenarios like initializing channels with separate read and write halves.
  • Mojo Lists Segfault with Traits, Variant to the Rescue: A user encountered a segmentation fault (issue #4218) when trying to create a List of trait objects, specifically List[Estimator], and appending instances of KNN and SVM structs.
    • As a workaround, it was suggested to use List[Variant[KNN, SVM]] and iterate through the values, checking the type using isa to call the appropriate methods, as trait instances are not fully supported yet.
  • def vs fn: The Great Mojo Debate: A discussion emerged regarding the usage of def versus fn in Mojo, with some arguing that fn should be the default due to its type safety and better integration with typed Python workflows using tools like Mypy.
    • Others contended that def still has a place for beginners and those who prefer a more Python-like syntax, especially when interacting with untyped Python libraries, leading to a feature request to make def default to returning None.
  • GCD-Fueled Ratios: Metaprogramming Simplifies Fractions: A user showcased a clever use of compile-time metaprogramming to automatically simplify Ratio structs using gcd, resulting in simplified fractions at compile time, though it was noted that this approach could cause headaches when metaprogramming.
    • An alternative was proposed to make the simplification an explicit function call rather than automatic, drawing inspiration from std::ratio in C++.

Links mentioned:


Modular (Mojo šŸ”„) ā–· #max (2 messages):

CUDA definition, DeepSeek bypassing CUDA, NVIDIA driver vs CUDA

  • CUDA: The Backbone of Deep Learning?: A member shared a blog post defining CUDA as the backbone of deep learning and the core of NVIDIA’s moat.
  • DeepSeek Bypasses CUDA via PTX Layer: The member noted that DeepSeek’s breakthrough was achieved by bypassing CUDA and directly accessing the PTX layer.
  • NVIDIA driver confusion: A member mentioned that the NVIDIA driver isn’t counted as cuda and that NVIDIA is a bit all over the place and inconsistent in their terminology over time.

Link mentioned: Modular: Democratizing AI Compute, Part 2: What exactly is ā€œCUDAā€?: no description found


Notebook LM ā–· #use-cases (15 messagesšŸ”„):

Video Snippets, Mind Maps, Multi-Modal Output, Android Sharing System, AI Voice Pronunciation

  • Video Snippets Requested in Responses: Users are requesting NotebookLM to include video snippets in its responses when a video is used as a source to provide visuals.
    • One member suggested that in the future, the team will enable multi-modal output.
  • Mind Maps Export Wishlisted: A user asked about exporting Mind Maps in DOT format or publishing an interactive applet with the Google UI.
    • It was implied this is not currently possible.
  • Android Sharing System Integration Sought: Users are requesting NotebookLM to participate in the Android sharing system, suggesting a need for a dedicated app.
    • One user suggested that choosing NotebookLM from the share menu could automatically search inside a default notebook.
  • A.I. Pronunciation Fumbles Addressed: A user is seeking ways to improve the pronunciation of words by AI voices in NotebookLM, particularly for company names with funky spellings.
    • The user is hoping to find ways to get the audio overview to pronounce company names correctly, by feeding the AI with another source with the correct pronunciation.
  • AI Debate Prompting Troubles: A user reported issues with prompting NotebookLM to generate a heated debate between two hosts with differing viewpoints on AI in mental health.
    • Another user suggested using internal names for the voices (Host Speaker and Expert Speaker) to help assign roles.

Notebook LM ā–· #general (96 messagesšŸ”„šŸ”„):

NotebookLM Spanish support, iPhone issues, Audio overview length, Briefing document generation, NotebookLM Plus daily limits

  • NotebookLM still only speaks English: Users inquired whether NotebookLM is available in Spanish, and the response was that it currently supports only English.
    • A user responded with a cat GIF of a cat being held by a person’s hand with its paws crossed.
  • NotebookLM experiences iPhone rendering issues: A user reported issues using NotebookLM on an iPhone, with another user confirming it doesn’t work on anything using WebKit such as Safari on Mac, which won’t be resolved until a fix is implemented.
    • Another user on desktop also had the same issue but also reported a white screen.
  • NotebookLM Plus users bumped into daily limits: A NotebookLM Plus subscriber reported seeing a ā€˜You’ve reached your daily chat limits’ message, preventing proper use, even after logging out and refreshing.
    • Another user clarified that Plus users do not have any limit issues.
  • AI conversational feature gets suggested: A user suggested an AI conversation feature to directly interact with the AI and gather information without extensive reading, gathering a lot of support from others.
    • Members pointed out you could already use interactive mode, however they clarified this suggestion is for more of a ā€˜speaking version of the chat feature’ where users speak to ask the AI questions and listen to receive its responses.
  • Users request timestamps: Users have requested timestamped sections to allow skipping/re-listening to specific sections similar to how Audible does it.
    • Users are also asking for an update to Gemini 2.5 Pro.

Links mentioned:


LlamaIndex ā–· #blog (3 messages):

AI Agent Systems with LlamaIndex, LlamaIndex + Qdrant for Claude, OpenAI Responses API in LlamaIndex

  • LlamaIndex and SkySQL team up for AI Agents: LlamaIndex partners with SkySQL to teach users how to build AI agent systems for reliable text-to-SQL conversion without coding; more details at the SkySQL website.
  • LlamaIndex Prepares Documents for Claude via Qdrant: LlamaIndex shows how to prepare documents for inclusion in Claude using a pre-built MCP server for Qdrant, using Angular’s documentation as a data set, stored in Qdrant.
  • LlamaIndex integrates OpenAI Responses API: LlamaIndex now supports the OpenAI Responses API with full support for built-in tools, reasoning, images, manual tool calling, streaming, and async, enabling complex multi-agent workflows.
    • The announcement notes that the Responses API differs quite a bit from the Chat API.

LlamaIndex ā–· #general (76 messagesšŸ”„šŸ”„):

Telemetry Attributes, SubQuestionQueryEngine Workflow, VannaPack Memory Integration, Context Passing to Workflows, Image Input to OpenAIAgent

  • Telemetry Attributes Get Tagged: Members discussed the standard way of passing custom telemetry attributes when interacting with LlamaIndex abstractions, with one member seeking to attach a user ID to all events executed within a code block.
  • Context Wrangling Woes: One user ran into an issue of trying to pass the same context to two different workflows, but another member clarified that a context holds all the data and state for a single workflow and isn’t designed to be shared.
    • Another member inquired about creating a context for FunctionAgent, encountering an AttributeError, but it was resolved by updating llama-index-core.
  • OpenAI Agents go Multi-Modal: Members discussed passing an image as a chat message to OpenAIAgent, with one member noting the lack of direct support for this capability.
    • A member suggested using OpenAI’s multi-modal capabilities or modifying chatmemorybuffer to add images to the request, while another recommended building an agent from scratch with workflows.
  • FunctionAgent Logic Separated for Flexibility: There was a discussion on why FunctionAgent is not just a workflow, to which it was clarified that it needs a specific abstraction to be an agent with a particular contract.
    • The separation allows for more flexibility and maintainability, with AgentWorkflow serving as the orchestrator and FunctionAgent/ReActAgent/etc. being swappable agent logic, with an example provided in the documentation.

Links mentioned:


LlamaIndex ā–· #ai-discussion (4 messages):

LlamaIndex Upgrade, Internet of Agents

  • LlamaIndex Upgrade Breeds Embedding Error: A member reported an error when upgrading LlamaIndex from version 0.8.37 to 0.9.0 due to a missing Embedding setting.
    • Another member pointed out that the fix might require a version newer than 0.9.0.
  • Agents dream of interoperation?: A member published an article outlining a possible direction for solving the interop problem in agentic AI, proposing the construction of an ā€œInternet of Agentsā€.
    • The article, available at [IoA], dives into protocol layers for communication, memory, trust, and tool use, suggesting that open standards could unlock composability across ecosystems, including LlamaIndex.

tinygrad (George Hotz) ā–· #general (35 messagesšŸ”„):

Tinygrad Box vs Repurposed E-Waste Inference Machine, Finite Field Assembly Programming Language, TinyGrad Internals Notes, ONNX Float16 Issues, Tenstorrent DevDay

  • E-Waste Inference Machine vs Tinygrad Box: A user questioned the value of an inference machine built from repurposed e-waste with 4x 4090s, linked here: Tmall, compared to the Tinygrad Box.
    • Another user commented that it’s likely plagued by PCIe errors due to its homebrew motherboard, estimating its worth at $1,000 + the cost of the 4090s.
  • Finite Field Assembly CUDA Alternative: A user shared Finite Field Assembly, a CUDA alternative designed for computations over finite fields, extending C89 and supporting recursive computing.
    • It leverages the properties of prime numbers to multiply several array elements concurrently.
  • TinyGrad Internals Detailed in New Notes: A user shared their notes on TinyGrad internals available here, covering UOps, ShapeTracker, and the Pattern Matcher, with inspiration from mesozoic-egg.
    • The notes provide a deep dive into the architecture of TinyGrad, complementing the official TinyGrad documentation.
  • ONNX Struggles with Float16 Silently: A user reported that the ORT CPUExecutionProvider silently casts inputs into float32 for float16 models, runs computations with float32, and casts the output back into float16, which is blocking numpy removal.
    • They proposed adding an envvar to replicate this behavior in their ONNX setup for testing and debugging purposes.
  • Tenstorrent DevDay Presentation: A user announced they would present AlphaFold 3 on Wormhole at Tenstorrent DevDay in SF and expressed interest in meeting other Tinygrad users.
    • They asked about potential sales of excess Tinygrad V1 motherboards.

Links mentioned:


tinygrad (George Hotz) ā–· #learn-tinygrad (17 messagesšŸ”„):

VAE with tinygrad, Huggingface's Diffusers library and tinygrad, tg_adapter, torch.to method subclass for Tensors

  • VAE tinygraining!: A member has been experimenting with building a VAE with tinygrad.
    • They have successfully modified Huggingface’s Diffusers library to work with tinygrad and got the VAE used in Stable Diffusion to function, available at this link.
  • tinygrad Adapting!: A member created an adapter layer to convert torch calls to tinygrad calls.
    • The adapter layer can be found here, enabling the use of tinygrad as a drop-in replacement.
  • Tensor Typecasting Tussle!: A member mentioned the need to create a subclass for Tensors that implements the torch.to method.
    • This is needed because, unlike Tinygrad, torch.to doubles as a typecasting function.

Links mentioned:


Torchtune ā–· #general (12 messagesšŸ”„):

FP8 training, Torchtune Office Hours

  • FP8 Training Time: Most FP8 training recipes are actually FP8 QAT, unless you can only train on GPUs without FP8 support (e.g. A100), in which case you can train with FP8 directly.
    • A member indicated that a Torchtune office hours would occur next Friday, with a Discord link.
  • Discord Time Zone Conversion: Members discussed the automatic conversion of time zones within Discord for events.
    • One member shared a brain meme GIF in response to successfully converting time zones on the fly.

Link mentioned: Brain Brain Meme GIF - Brain Brain meme Big brain - Discover & Share GIFs: Click to view the GIF


Torchtune ā–· #dev (4 messages):

Code Review, Merge Process

  • Code Review speeds up Merge Process: A member requested a final review for PR #2441 to expedite the merge process.
    • Another member was pinged to review the PR.
  • PR #2441 awaits final review: Members seek assistance with a final review for PR #2441.
    • This aims to accelerate the merging process, as all checks have already passed.

Torchtune ā–· #papers (1 messages):

yamashi: GRPO to teach searching on the internet: https://arxiv.org/pdf/2503.09516


Cohere ā–· #ć€ŒšŸ’¬ć€general (6 messages):

Command-R model, Aya-Vision model, Playground errors

  • Command-R is the Speedy Model: The Command-R model is confirmed as the fastest and most versatile model, using Command-A by default.
    • Users can use the API to try out different models, since model changes are not supported in the playground.
  • Aya-Vision struggles with image uploads: Users are reporting errors when uploading images to the playground using Aya-Vision.
    • One user confirmed it’s not working and asked to be notified when it starts working better.
  • Job Postings Prohibited: A moderator issued a warning against posting job postings in the channel.
    • This was a first warning, implying further violations may result in stricter actions.

Cohere ā–· #ć€ŒšŸ”Œć€api-discussions (8 messagesšŸ”„):

Cohere Docs Fix, API Latency, Aya Vision

  • Typo Fixed in Cohere’s Documentation!: A user reported a typo in Cohere’s documentation where train_epoch=1 should be train_epochs=1, causing a BadRequestError.
    • A Cohere staff member confirmed the typo and pushed a fix that should be live soon.
  • API Latency Issues with Images: A user reported inconsistent API performance with slow responses using the chatv2 endpoint, specifically when including images, even timing out after increasing timeout limits.
    • They tested the Aya Vision demo on Hugging Face, where it sometimes takes over 30 seconds to respond, and non-image based endpoints work quickly.
  • Debugging Aya Vision SDK: A user shared their code snippet for using Aya Vision via the Cohere SDK, requesting assistance debugging latency issues.
    • A Cohere staff member responded that they will investigate the latency on their end.

Cohere ā–· #ć€ŒšŸ¤ć€introductions (2 messages):

Indy Game Dev, C++, Graphics and Audio Libraries, Browser Game, Cohere

  • Indy Game Dev Aims High: A self-taught indy game developer working mainly in C++ with graphics and audio libraries introduced themselves.
    • They are currently working on a browser game for their friend’s web animation series and have started using Cohere as an alternative to the other big names.
  • New user likes Cohere: A game developer mentioned that they have started using Cohere and like the results so far.
    • They mentioned that they’ve been using it as an alternative to the big names.

Nomic.ai (GPT4All) ā–· #general (14 messagesšŸ”„):

Libre Wolf, GPT4All model search, Documentation ingestion, Model differences, Llama3 8B Instruct

  • Libre Wolf Browser in Question: A member inquired about Libre Wolf browser usage, questioning its security compared to Firefox.
  • GPT4All struggles with model search: A member mentioned difficulty searching the list of GPT4All models, as it’s not a webpage, to which another member pointed out that a local model list search hasn’t been a feature in GPT4All for 2 years.
    • One member provided links to the model lists on GitHub.
  • Documentation Ingestion Assistance Requested: A member asked for a model capable of ingesting documents and answering questions based on them, apologizing for their bad English.
    • A member shared the GPT4All wiki with official translations in six languages and suggested using Google Translate for other languages.
  • Seeking Blogging Brilliance with Llama3 8B Instruct: A member inquired if Llama3 8B Instruct is the best model for creating blog posts and webpages from recorded video courses.
    • A member asks about the difference between .bin and .gguf files and whether they can be interchanged.

Links mentioned:


DSPy ā–· #general (8 messagesšŸ”„):

Pydantic conint, DSPy dynamic example resending, RateLimitError with MIPROv2, Azure OpenAI burst limits, DSPy rate throttling

  • Pydantic’s conint Limits: The conint feature in Pydantic can set constraints like conint(ge=1, le=10), but it throws a ValidationError if the output falls outside the specified range.
    • A member noted the desire for DSPy to dynamically generate examples and resend requests upon validation failures, a feature that is not currently functioning as expected.
  • RateLimitErrors Plague MIPROv2: A user reported encountering RateLimitErrors despite setting num_threads=1 when using MIPROv2 with gpt-4o-mini on Azure OpenAI.
    • Another explained that the issue stems from MIPROv2.compile() making multiple internal API calls, compounded by Azure OpenAI’s burst limits, which num_threads=1 does not prevent.
  • Mitigating Azure’s Rate Limits: To address RateLimitErrors, a user suggested adding retry logic with a sleep(30) interval, lowering max_*_demos, and potentially upgrading to the latest DSPy version with built-in rate throttling.
    • It was emphasized that structured prompting in MIPROv2 and Copro can lead to errors if the LLM returns empty outputs due to API truncation or rate limits.
  • Optimization Suffers from Rate Limit Workarounds: A user pointed out that reducing max_bootstrapped_demos and max_labeled_demos to avoid RateLimitErrors negatively impacts the optimization process.
    • They suggested that DSPy lacks an internal delay mechanism to manage API call frequency effectively.

DSPy ā–· #examples (4 messages):

DSPy Optimizers, Module Usage, Prompt Engineering in DSPy, Signature Creation in DSPy

  • DSPy Optimizers Optimize Prompts and Weights: DSPy optimizes prompts and weights to teach LMs to deliver high-quality outputs, offering algorithms for building modular AI systems and optimizing their prompts and weights according to the DSPy documentation.
    • Different optimizers choose N examples to include in the prompt.
  • DSPy signatures as ā€œa, b -> cā€: In DSPy, the signature is defined as ā€œa, b -> cā€, where a, b, and c are meaningful names.
    • The optimizer then generates prompts and runs them on a dataset to determine the best performing prompt.
  • Practical Module Usage Considerations: If the specific implementation necessitates an optimizer, the relevance of docstrings diminishes.
  • Building NLP Data to Chart Pipelines: A member is working on leveraging DSPy to build a tool that transforms natural language processing of data into charts.

Link mentioned: DSPy: The framework for programming—rather than prompting—language models.


LLM Agents (Berkeley MOOC) ā–· #mooc-announcements (1 messages):

Thomas Hubert, AlphaProof, Formal Mathematics, Reinforcement Learning

  • Thomas Hubert Presents AlphaProof Lecture: Thomas Hubert, a research engineer at Google DeepMind, will present ā€œAlphaProof: when reinforcement learning meets formal mathematicsā€ on 3/31 at 10AM PDT, livestreamed on YouTube.
    • The lecture will explore how computers and computation are now routinely used in research mathematics and contribute to grand problems like the Birch and Swinnerton-Dyer conjecture.
  • Galileo’s View on Mathematics: Galileo, the renowned Italian astronomer, physicist, and mathematician, famously described mathematics as the language of the universe and computers have enriched our understanding.
    • Hubert earned his MS in Mathematics from Stanford University.

Link mentioned: - YouTube: no description found


LLM Agents (Berkeley MOOC) ā–· #mooc-questions (11 messagesšŸ”„):

Course Information, Lecture Times, Free Credits, AgentX Competition

  • LLM Agents MOOC Course Info Listed: The course website (llmagents-learning.org/sp25) and Discord server provide essential links and discussion forums for the LLM Agents MOOC.
  • Spring 2025 LLM Agents MOOC Previous Lectures: Previous lectures for the Spring 2025 course can be found on the course website and in this YouTube playlist.
  • How to get Free Credits?: AgentX offers credit resources and details can be found on the AgentX website), with a collection form releasing this week for those wanting credits for AgentX.
  • Lecture bumped up to 10 AM PST: The lecture today was moved to 10 AM PST to accommodate the speaker from the UK.
  • Completion based quizzes: The quizzes for the course are completion based and the score does not matter as long as they are attempted.

Link mentioned: Advanced Large Language Model Agents MOOC: MOOC, Spring 2025


MLOps @Chipro ā–· #events (1 messages):

TMLS 2025, MLOps, AI, Call for Speakers, AI Agents

  • TMLS 2025 Call for Speakers Opens: A member announced the Call for Speakers for the Toronto Machine Learning Summit (TMLS) in June 2025.
    • TMLS 2025 will feature 16 specialized tracks, including Advanced RAG, Multimodal LLMs, AI Agents in Production, MLOps for Smaller Teams, Responsible AI Implementation, and GenAI Deployments.
  • MLOps for Smaller Teams: The Toronto Machine Learning Summit will have an MLOps track aimed at smaller teams.
    • This is a great opportunity for smaller teams to share their experiences and learn from others.



{% else %}

The full channel by channel breakdowns have been truncated for email.

If you want the full breakdown, please visit the web version of this email: [{{ email.subject }}]({{ email_url }})!

If you enjoyed AInews, please share with a friend! Thanks in advance!

{% endif %}