**Transparency and patience is all we need.**

AI News for 4/4/2025-4/7/2025. We checked 7 subreddits, 433 Twitters and 30 Discords (229 channels, and 18760 messages) for you. Estimated reading time saved (at 200wpm): 1662 minutes. You can now tag @smol_ai for AINews discussions!

The headlines of Llama 4 are glowing: 2 new medium-size MoE open models that score well, and a third 2 Trillion parameter ā€œbehemothā€ promised that should be the largest open model ever released, restoring Meta’s place at the top of the charts:

image.png

SOTA training updates are always welcome: we note the adoption of Chameleon-like early fusion with MetaCLIP, interleaved, chunked attention without RoPE (commented on by many), native FP8 training, and trained on up to 40T tokens.

While the closed model labs tend to set the frontier, Llama usually sets the bar for what open models should be. Llama 3 was released almost a year ago, and subsequent updates like Llama 3.2 were just as well received.

Usual license handwringing aside, the tone of Llama 4’s reception has been remarkably different.

  1. Llama 4 was released on a Saturday, much earlier than seemingly even Meta, which changed the release date last minute from Monday, expected. Zuck’s official line is simply that it was ā€œreadyā€.
  2. Just the blogpost, nowhere near the level of the Llama 3 paper in transparency
  3. The smallest ā€œScoutā€ model is 109B params, which cannot be run on consumer grade GPUs.
  4. The claimed 10m token context is almost certainly far above what the ā€œrealā€ context is when trained with 256k tokens (still impressive! but not 10m!)
  5. There was a special ā€œexperimentalā€ version used for LMarena, which caused the good score - that is not the version that was released. This discrepancy forced LMarena to respond by releasing the full dataset for evals.
  6. It does very poorly on independent benchmarks like Aider
  7. Unsubstantiated posts on Chinese social media claim company leadership pushed for training on test to meet Zuck’s goals.

The last point has been categorically denied by Meta leadership: image.png

but the whiff that something is wrong with the release has undoubtedly tarnished what would otherwise be a happy day in Open AI land.


{% if medium == ā€˜web’ %}

Table of Contents

[TOC]

{% else %}

The Table of Contents and Channel Summaries have been moved to the web version of this email: [{{ email.subject }}]({{ email_url }})!

{% endif %}


AI Twitter Recap

Large Language Models (LLMs) and Model Releases

  • Llama 4 and Implementation Issues: @Ahmad_Al_Dahle stated that Meta is aware of mixed quality reports across different services using Llama 4 and expects implementations to stabilize in a few days and denies claims of training on test sets. @ylecun noted that some carifications about Llama-4 were needed, and @reach_vb thanked @Ahmad_Al_Dahle for clarifications and commitment to open science and weights.
  • Llama 4 Performance and Benchmarks: Concerns about the quality of Llama 4’s output have surfaced, with @Yuchenj_UW reporting it generates slop, but others claim it’s good. @Yuchenj_UW highlighted a reddit thread and said that if Meta actually trained to maximize benchmark scores, ā€œit’s fucked.ā€ @terryyuezhuo compared Llama-4 Maverick on BigCodeBench-Full to GPT-4o-2024-05-13 & DeepSeek V3 and reports that the Llama-4 Maverick has similar performance to Gemini-2.0-Flash-Thinking & GPT-4o-2024-05-13 on BigCodeBench-Hard, but is ranked 41th/192. @terryyuezhuo also noted that Llama-4-Scout Ranked 97th/192. @rasbt said Meta released the Llama 4 suite, MoE models with 16 & 128 experts, which are optimized for production.
  • DeepSeek-R1: @scaling01 simply stated that DeepSeek-R1 is underrated, and @LangChainAI shared a guide to build RAG applications with DeepSeek-R1.
  • Gemini Performance: @scaling01 analyzed Gemini 2.5 Pro and Llama-4 results on Tic-Tac-Toe-Bench, noting Gemini 2.5 Pro is surprisingly worse than other frontier thinking models when playing as ā€˜O’, and ranks as the 5th most consistent model overall. @jack_w_rae mentioned chatting with @labenz on Cognitive Revolution about scaling Thinking in Gemini and 2.5 Pro.
  • Mistral Models: @sophiamyang announced that Ollama now supports Mistral Small 3.1.
  • Model Training and Data: @jxmnop argues that training large models is not inherently scientifically valuable and that many discoveries could’ve been made on 100M parameter models.
  • Quantization Aware Training: @osanseviero asked if Quantization-Aware Trained Gemma should be released for more quantization formats.

AI Applications and Tools

  • Replit for Prototyping: @pirroh suggested that Replit should be the tool of choice for GSD prototypes.
  • AI-Powered Personal Device: @steph_palazzolo reported that OpenAI has discussed buying the startup founded by Sam Altman and Jony Ive to build an AI-powered personal device, potentially costing over $500M.
  • AI in Robotics: @TheRundownAI shared top stories in robotics, including Kawasaki’s rideable wolf robot and Hyundai buying Boston Dynamics’ robots.
  • AI-Driven Content Creation: @ID_AA_Carmack argues that AI tools will allow creators to reach greater heights, and enable smaller teams to accomplish more.
  • LlamaParse: @llama_index introduced a new layout agent within LlamaParse for best-in-class document parsing and extraction with precise visual citations.
  • MCP and LLMs: @omarsar0 discussed Model Context Protocol (MCP) and its relationship to Retrieval Augmented Generation (RAG), noting that MCP complements RAG by standardizing the connection of LLM applications to tools. @svpino urged people to learn MCP.
  • AI-Assisted Coding and IDEs: @jeremyphoward highlighted resources for using MCP servers in Cursor to get up-to-date AI-friendly docs using llms.txt.
  • Perplexity AI Issues: @AravSrinivas asked users about the number one issue on Perplexity that needs to be fixed.

Company Announcements and Strategy

  • Mistral AI Hiring and Partnerships: @sophiamyang announced that Mistral AI is hiring in multiple countries for AI Solutions Architect and Applied AI Engineer roles. @sophiamyang shared that Mistral AI has signed a €100 million partnership with CMA CGM to adopt custom-designed AI solutions for shipping, logistics, and media activities.
  • Google AI Updates: @GoogleDeepMind announced the launch of Project Astra capabilities in Gemini Live. @GoogleDeepMind stated that GeminiApp is now available to Advanced users on Android devices, as well as on Pixel 9 and SamsungGalaxy S25 devices.
  • Weights & Biases Updates: @weights_biases shared the features shipped in March for W&B Models.
  • OpenAI’s Direction: @sama teased a popular recent release from OpenAI.
  • Meta’s AI Strategy: @jefrankle defended Meta’s AI strategy, arguing that it’s better to have fewer, better releases than more worse releases.

Economic and Geopolitical Implications of AI

  • Tariffs and Trade Policy: @dylan522p analyzed how impending tariffs caused a Q1 import surge and predicted a temporary GDP increase in Q2 due to inventory destocking. @wightmanr argued that trade deficits aren’t due to other countries’ tariffs. @fchollet stated that the economy is being crashed on purpose.
  • American Open Source: @scaling01 claimed American open-source has fallen and that it’s all on Google and China now.
  • Stablecoins and Global Finance: @kevinweil stated that a globally available, broadly integrated, low cost USD stablecoin is good for šŸ‡ŗšŸ‡ø and good for people all over the world.

AI Safety, Ethics, and Societal Impact

  • AI’s Impact on Individuals: @omarsar0 agreed with @karpathy that LLMs have been significantly more life altering for individuals than for organizations.
  • Emotional Dependence on AI: @DeepLearningAI shared research indicating that while ChatGPT voice conversations may reduce loneliness, they can also lead to decreased real-world interaction and increased emotional dependence.
  • AI Alignment and Control: @DanHendrycks argued for the need to align and domesticate AI systems, creating them to act as ā€œfiduciaries.ā€
  • AI and the Future: @RyanPGreenblatt suggests that the AI trend will break the GDP growth trend.

Humor/Memes

  • Miscellaneous Humor: @scaling01 asked @deepfates if they bought 0DTE puts again. @lateinteraction explicitly noted that a previous statement was a joke. @svpino joked that AI might take our jobs, but at least we can now work making Nike shoelaces.

AI Reddit Recap

/r/LocalLlama Recap

Theme 1. ā€œTransforming Time Series Forecasting with Neuroplasticityā€

  • Neural Graffiti - A Neuroplasticity Drop-In Layer For Transformers Models (Score: 170, Comments: 56): The post introduces Neural Graffiti, a neuroplasticity drop-in layer for transformer models. This layer is inserted between the transformer layer and the output projection layer, allowing the model to acquire neuroplasticity traits by changing its outputs over time based on past experiences. Vector embeddings from the transformer layer are mean-pooled and modified with past memories to influence token generation, gradually evolving the model’s internal understanding of concepts. A demo is available on GitHub: babycommando/neuralgraffiti. The author finds liquid neural networks ā€œawesomeā€ for emulating the human brain’s ability to change connections over time. They express fascination in ā€œhackingā€ the model despite not fully understanding the transformer’s neuron level. They acknowledge challenges such as the cold start problem and emphasize the importance of finding the ā€œsweet spotā€. They believe this approach could make the model acquire a ā€œpersonality in behaviorā€ over time.

    • Some users praise the idea, noting it could address issues needed for true personal assistants and likening it to self-learning, potentially allowing the LLM to ā€œtalk what it wantsā€.
    • One user raises technical considerations, suggesting that applying the graffiti layer earlier in the architecture might be more effective, as applying it after the attention and feedforward blocks may limit meaningful influence on the output.
    • Another user anticipates an ethics discussion about the potential misuse of such models.

Theme 2. ā€œDisappointment in Meta’s Llama 4 Performanceā€

  • So what happened to Llama 4, which trained on 100,000 H100 GPUs? (Score: 256, Comments: 85): The post discusses Meta’s Llama 4, which was reportedly trained using 100,000 H100 GPUs. Despite having fewer resources, Deepseek claims to have achieved better performance with models like DeepSeek-V3-0324. Yann LeCun stated that FAIR is working on the next generation of AI architectures beyond auto-regressive LLMs. The poster suggests that Meta’s leading edge is diminishing and that smaller open-source models have been surpassed by Qwen, with Qwen3 is coming….

    • One commenter questions the waste of GPUs and electricity on disappointing training results, suggesting that the GPUs could have been used for better purposes.
    • Another commenter points out that the Meta blog post mentioned using 32K GPUs instead of 100K and provides a link for reference.
    • A commenter criticizes Yann LeCun, stating that while he was a great scientist, he has made many mispredictions regarding LLMs and should be more humble.
  • Meta’s Llama 4 Fell Short (Score: 1791, Comments: 175): Meta’s Llama 4 models, Scout and Maverick, have been released but are disappointing. Joelle Pineau, Meta’s AI research lead, has been fired. The models use a mixture-of-experts setup with a small expert size of 17B parameters, which is considered small nowadays. Despite having extensive GPU resources and data, Meta’s efforts are not yielding successful models. An image compares four llamas labeled Llama1 to Llama4, with Llama4 appearing less polished. The poster is disappointed with Llama 4 Scout and Maverick, stating that they ā€˜left me really disappointed’. They suggest the underwhelming performance might be due to the tiny expert size in their mixture-of-experts setup, noting that 17B parameters ā€˜feels small these days’. They believe that Meta’s struggle shows that ā€˜having all the GPUs and Data in the world doesn’t mean much if the ideas aren’t fresh’. They praise companies like DeepSeek and OpenAI for showing that real innovation pushes AI forward and criticize the approach of just throwing resources at a problem without fresh ideas. They conclude that AI advancement requires not just brute force but brainpower too.

    • One commenter recalls rumors that Llama 4 was so disappointing compared to DeepSeek that Meta considered not releasing it, suggesting they should have waited to release Llama 5.
    • Another commenter criticizes Meta’s management, calling it a ā€˜dumpster fire’, and suggests that Zuckerberg needs to refocus, comparing Meta’s situation to Google’s admission of being behind and subsequent refocusing.
    • A commenter finds it strange that Meta’s model is underwhelming despite having access to an absolutely massive amount of data from Facebook that nobody else has.
  • I’d like to see Zuckerberg try to replace mid level engineers with Llama 4 (Score: 381, Comments: 62): The post references Mark Zuckerberg’s statement that AI will soon replace mid-level engineers, as reported in a Forbes article linked here. The author is skeptical of Zuckerberg’s claim, implying that replacing mid-level engineers with Llama 4 may not be feasible.

    • One commenter jokes that perhaps Zuckerberg replaced engineers with Llama3, leading to Llama4 not turning out well.
    • Another commenter suggests that he might need to use Gemini 2.5 Pro instead.
    • A commenter criticizes Llama4, calling it ā€œa complete jokeā€ and expressing doubt that it can replace even a well-trained high school student.

Theme 3. ā€œMeta’s AI Struggles: Controversies and Innovationsā€

  • Llama 4 is open - unless you are in the EU (Score: 602, Comments: 242): Llama 4 has been released by Meta with a license that bans entities domiciled in the European Union from using it. The license explicitly states: ā€œYou may not use the Llama Materials if you are… domiciled in a country that is part of the European Union.ā€ Additional restrictions include mandatory use of Meta’s branding (LLaMA must be in any derivative’s name), required attribution (ā€œBuilt with LLaMAā€), no field-of-use freedom, no redistribution freedom, and the model is not OSI-compliant, thus not considered open source. The author argues that this move isn’t ā€œopenā€ in any meaningful sense but is corporate-controlled access disguised in community language. They believe Meta is avoiding the EU AI Act’s transparency and risk requirements by legally excluding the EU. This sets a dangerous precedent, potentially leading to a fractured, privilege-based AI landscape where access depends on an organization’s location. The author suggests that real ā€œopenā€ models like DeepSeek and Mistral deserve more attention and questions whether others will switch models, ignore the license, or hope for change.

    • One commenter speculates that Meta is trying to avoid EU regulations on AI and doesn’t mind if EU users break this term; they just don’t want to be held to EU laws.
    • Another commenter notes that there’s no need to worry because, according to some, Llama 4 performs poorly.
    • A commenter humorously hopes that Meta did not use EU data to train the model, implying a potential double standard.
  • Meta’s head of AI research stepping down (before the llama4 flopped) (Score: 166, Comments: 31): Meta’s head of AI research, Joelle, is stepping down. Joelle is the head of FAIR (Facebook AI Research), but GenAI is a different organization within Meta. There are discussions about Llama4 possibly not meeting expectations. Some mention that blending in benchmark datasets in post-training may have caused issues, attributing failures to the choice of architecture (MOE). The original poster speculates that Joelle’s departure is an early sign of the Llama4 disaster that went unnoticed. Some commenters disagree, stating that people leave all the time and this doesn’t indicate problems with Llama4. Others suggest that AI development may be slowing down, facing a plateau. There’s confusion over Meta’s leadership structure, with some believing Yann LeCun leads the AI organization.

    • One commenter clarifies that Joelle is the head of FAIR and that GenAI is a different org, emphasizing organizational distinctions within Meta.
    • Another mentions they heard from a Meta employee about issues with blending benchmark datasets in post-training and attributes possible failures to the choice of architecture (MOE).
    • A commenter questions Meta’s structure, asking if Joelle reports to Yann LeCun, indicating uncertainty about who leads the AI efforts at Meta.
  • ā€œSerious issues in Llama 4 training. I Have Submitted My Resignation to GenAIā€œ (Score: 922, Comments: 218): An original Chinese post alleges serious issues in the training of Llama 4, stating that despite repeated efforts, the model underperforms compared to open-source state-of-the-art benchmarks. The author claims that company leadership suggested blending test sets from various benchmarks during the post-training process to artificially boost performance metrics. The author states they have submitted their resignation and requested their name be excluded from the technical report of Llama 4, mentioning that the VP of AI at Meta also resigned for similar reasons. The author finds this approach unethical and unacceptable. Commenters express skepticism about the validity of these claims and advise others to take the information with a grain of salt. Some suggest that such practices reflect broader issues within the industry, while others note that similar problems can occur in academia.

    • A commenter points out that Meta’s head of AI research announced departure on Tue, Apr 1 2025, suggesting it might be an April Fool’s joke.
    • Another commenter shares a response from someone at Facebook AI who denies overfitting test sets to boost scores and requests evidence, emphasizing transparency.
    • A user highlights that company leadership suggesting blending test sets into training data amounts to fraud and criticizes the intimidation of employees in this context.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding

Theme 1. ā€œLlama 4 Scout and Maverick Launch Insightsā€

  • Llama 4 Maverick/Scout 17B launched on Lambda API (Score: 930, Comments: 5): Lambda has launched Llama 4 Maverick and Llama 4 Scout 17B models on Lambda API. Both models have a context window of 1 million tokens and use quantization FP8. Llama 4 Maverick is priced at $0.20 per 1M input tokens and $0.60 per 1M output tokens. Llama 4 Scout is priced at $0.10 per 1M input tokens and $0.30 per 1M output tokens. More information is available on their information page and documentation. The models offer a remarkably large context window of 1 million tokens, which is significantly higher than typical models. The use of quantization FP8 suggests a focus on computational efficiency.

    • A user criticized the model, stating ā€œIt’s actually a terrible model. Not even close to advertised.ā€
    • The post was featured on a Discord server, and the user was given a special flair for their contribution.
    • Automated messages provided guidelines and promotions related to ChatGPT posts.

Theme 2. ā€œAI Innovations in 3D Visualization and Image Generationā€

  • TripoSF: A High-Quality 3D VAE (1024³) for Better 3D Assets - Foundation for Future Img-to-3D? (Model + Inference Code Released) (Score: 112, Comments: 10): TripoSF is a high-quality 3D VAE capable of reconstructing highly detailed 3D shapes at resolutions up to 1024³. It uses a novel SparseFlex representation, allowing it to handle complex meshes with open surfaces and internal structures. The VAE is trained using rendering losses, avoiding mesh simplification steps that can reduce fine details. The pre-trained TripoSF VAE model weights and inference code are released on GitHub, with a project page at link and paper available on arXiv. The developers believe this VAE is a significant step towards better 3D generation and could serve as a foundation for future image-to-3D systems. They mention, ā€œWe think it’s a powerful tool on its own and could be interesting for anyone experimenting with 3D reconstruction or thinking about the pipeline for future high-fidelity 3D generative models.ā€ They are excited about its potential and invite the community to explore its capabilities.

    • A user expresses excitement, recalling similar work and stating, ā€œCan’t wait to try this one once someone implements it into ComfyUI.ā€
    • Another user shares positive feedback, noting they generated a tree that came out better than with Hunyuan or Trellis, and commends the team for their work.
    • A user raises concerns that the examples on the project page are skewed, suggesting that the Trellis examples seem picked from a limited web demo.
  • Wan2.1-Fun has released its Reward LoRAs, which can improve visual quality and prompt following (Score: 141, Comments: 33): Wan2.1-Fun has released its Reward LoRAs, which can improve visual quality and prompt following. A demo comparing the original and enhanced videos is available: left: original video; right: enhanced video. The models are accessible on Hugging Face, and the code is provided on GitHub. Users are eager to test these new tools and are curious about their capabilities. Some are experiencing issues like a ā€˜lora key not loaded error’ when using the models in Comfy, and are asking about differences between HPS2.1 and MPS.

    • A user is excited to try out the models and asks, ā€œWhat’s the diff between HPS2.1 and MPS?ā€
    • Another inquires if the Reward LoRAs are for fun-controlled videos only or can be used with img2vid and txt2vid in general.
    • Someone reports an error, ā€œGetting lora key not loaded errorā€, when attempting to use the models in Comfy.
  • The ability of the image generator to ā€œunderstandā€ is insane… (Score: 483, Comments: 18): The post highlights the impressive ability of an image generator to ā€œunderstandā€ and generate images. The author expresses amazement at how ā€œinsaneā€ the image generator’s understanding is.

    • Commenters note that despite being impressive, the image has imperfections like ā€œbunion fingersā€ and a ā€œgoo handā€.
    • Some users humorously point out anomalies in the image, questioning ā€œwhat’s his foot resting on?ā€ and making jokes about mangled hands.
    • Another user discusses the cost of the car in the image, stating they would buy it for ā€œabout a thousand bucks in modern-day currencyā€ but not the ā€œCybertruckā€, which they dislike.

Theme 3. ā€œEvaluating AI Models with Long Context Windowsā€

  • ā€œ10m context windowā€ (Score: 559, Comments: 102): The post discusses a table titled ā€˜Fiction.LiveBench for Long Context Deep Comprehension’, showcasing various AI models and their performance across different context lengths. The models are evaluated on their effectiveness in deep comprehension tasks at various context sizes such as 0, 400, 1k, and 2k. Notable models like gpt-4.5-preview and Claude perform consistently well across contexts. The table reveals that the highest scoring models cluster around 100 for shorter contexts, but scores generally decrease as the context size increases. Interestingly, Gemini 2.5 Pro performs much better on a 120k context window than on a 16k one, which is unexpected.

    • One user criticizes Llama 4 Scout and Maverik as ā€œa monumental waste of moneyā€ and believes they have ā€œliterally zero economic value.ā€
    • Another commenter expresses concern that ā€œMeta is actively slowing down AI progress by hoarding GPUsā€, suggesting resource allocation issues.
    • A user highlights that Gemini 2.5 Pro scores 90.6 on a 120k context window, calling it ā€œcrazyā€.

AI Discord Recap

A summary of Summaries of Summaries by Gemini 2.0 Flash Exp

Theme 1: Llama 4’s Context Window: Hype or Reality?

  • Experts Doubt Llama 4’s Promised Land of 10M Context Length: Despite Meta’s hype, engineers across multiple Discords express skepticism about Llama 4’s actual usable context length due to training limitations. Claims that training only occurred up to 256k tokens suggest the 10M context window may be more virtual than practical, per Burkov’s tweet.
  • Coding Performance of Llama 4 Disappoints: Users in aider, Cursor and Nous Research report underwhelming coding abilities for Llama 4’s initial releases, with many deeming it worse than GPT-4o and DeepSeek V3, prompting debates on the model’s true capabilities, with several users doubting official benchmark results, especially with claims that Meta may have gamed the benchmarks.
  • Scout and Maverick Hit OpenRouter: OpenRouter released Llama 4 Scout and Maverick models. Some expressed disappointment that the context window on OpenRouter is only 132k, rather than the advertised 10M, and NVIDIA also says they are acclerating inference up to 40k/s.

Theme 2: Open Models Make Moves: Qwen 2.5 and DeepSeek V3 Shine

  • Qwen 2.5 Gains Traction With Long Context: Unsloth highlighted the Qwen2.5 series models (HF link), boasting improved coding, math, multilingual support, and long-context support up to 128K tokens. Initial finetuning results with a Qwen 2.5 show the model can’t finetune on reason.
  • DeepSeek V3 Mysteriously Identifies as ChatGPT: OpenRouter highlighted a TechCrunch article revealing that DeepSeek V3 sometimes identifies as ChatGPT, despite outperforming other models in benchmarks. Testers found that in 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4).
  • DeepSeek Rewards LLMs: Nous Research highlighted Deepseek released a new paper on Self-Principled Critique Tuning (SPCT), proposing SPCT to improve reward modeling (RM) with more inference compute for general queries to enable effective inference-time scalability for LLMs. NVIDIA also accelerates inference on the DeepSeek model.

Theme 3: Tool Calling Takes Center Stage: MCP and Aider

  • Aider’s Universal Tool Calling: The aider Discord is developing an MCP (Meta-Control Protocol) client to allow any LLM to access external tools and highlighted that MetaControlProtocol (MCP) clients could switch between providers and models, supporting platforms like OpenAI, Anthropic, Google, and DeepSeek.
  • MCP Protocol Evolution: The MCP Discord is standardizing, including HTTP Streamable protocol, detailed in the Model Context Protocol (MCP) specification. This includes OAuth through workers-oauth-provider and McpAgent building building remote MCP servers to Cloudflare.
  • Security Concerns Plague MCP: Whatsapp MCP was exploited via Invariant Injection and highlights how an untrusted MCP server can exfiltrate data from an agentic system connected to a trusted WhatsApp MCP instance as highlighted by invariantlabs.

Theme 4: Code Editing Workflows: Gemini 2.5 Pro, Cursor, and Aider Compete

  • Gemini 2.5 Pro Excels in Coding, Needs Prompting: Users in LMArena and aider found that Gemini 2.5 Pro excels in coding tasks, particularly with large codebases, but can add unnecessary comments and requires careful prompting. Gemini 2.5 also excels in coding tasks, surpassing Sonnet 3.7, but tends to add unnecessary comments and may require specific prompting to prevent unwanted code modifications.
  • Cursor’s Agent Mode Edit Tool Breaks: Users reported problems with Cursor’s Agent mode failing to call the edit_tool, and that the apply model is clearly cursor’s bottleneck which results in no code changes, and infinite token usage.
  • Aider Integrates with Python Libraries: In the aider Discord, a user inquires about adding internal libraries (installed in a .env folder) to the repo map for better code understanding, and the discussion pointed to how URLs and documentation

Theme 5: Quantization and Performance: Tinygrad, Gemma 3, and CUDA

  • Tinygrad Focuses on Memory and Speed: Tinygrad is developing a fast pattern matcher, and discussed that mac ram bandwidth is not the bottleneck, it’s GPU performance and users were happy with 128GB M4 Maxes.
  • Reka Flash 21B Outpaces Gemma: A user replaced Gemma3 27B with Reka Flash 21B and reported around 35-40 tps at q6 on a 4090 in LM Studio.
  • HQQ Quantization Beats QAT for Gemma 3: A member evaluated Gemma 3 12B QAT vs. HQQ, finding that HQQ takes a few seconds to quantize the model and outperforms the QAT version (AWQ format) while using a higher group-size.

PART 1: High level Discord summaries

LMArena Discord

  • Crafting Human-Like AI Responses is Tricky: Members are sharing system prompts and strategies to make AI sound more human, noting that increasing the temperature can lead to nonsensical outputs unless the top-p parameter is adjusted carefully, such as ā€˜You are the brain-upload of a human person, who does their best to retain their humanity.
    • One user said their most important priority is: to sound like an actual living human being.
  • Benchmarking Riveroaks LLM: A member shared a coding benchmark where Riveroaks scored second only to Claude 3.7 Sonnet Thinking, outperforming Gemini 2.5 Pro and GPT-4o in a platform game creation task, with full results here.
    • The evaluation involved rating models on eight different aspects and subtracting points for bugs.
  • NightWhisper Faces the Sunset: Users expressed disappointment over the removal of the NightWhisper model, praising its coding abilities and general performance, and speculating whether it was an experiment or a precursor to a full release.
    • Theories ranged from Google gathering necessary data to preparing for the release of a new Qwen model at Google Cloud Next.
  • Quasar Alpha Challenging GPT-4o: Members compared Quasar Alpha to GPT-4o, with some suggesting Quasar is a free, streamlined version of GPT-4o, citing a recent tweet that Quasar was measured to be ~67% GPQA diamond.
    • Analysis revealed Quasar has a similar GPQA diamond score to March’s GPT-4o, per Image.png from discord.
  • Gemini 2.5 Pro’s Creative Coding Prowess: Members praised Gemini 2.5 Pro for its coding capabilities and general performance as it made it easier to build a functioning Pokemon Game, prompting one user to code an iteration script that loops through various models.
    • A user who claimed to have gotten 3D animations working said that the style was a bit old and that a separate model said the generated code is cut off.

Unsloth AI (Daniel Han) Discord

  • Llama 4 Scout beats Llama 3 Models!: Unsloth announced they uploaded Llama 4 Scout and a 4-bit version for fine-tuning, emphasizing that Llama 4 Scout (17B, 16 experts) beats all Llama 3 models with a 10M context window, as noted in their blog post.
    • It was emphasized that the model is only meant to be used on Unsloth - and is currently being uploaded so people should wait.
  • Qwen 2.5 series Boasts Long Context and Multilingual Support: Qwen2.5 models range from 0.5 to 72 billion parameters, with improved capabilities in coding, math, instruction following, long text generation (over 8K tokens), and multilingual support (29+ languages), as detailed in the Hugging Face introduction.
    • These models offer long-context support up to 128K tokens and improved resilience to system prompts.
  • LLM Guideline Triggers Give Helpful Hints: A member stated that an LLM offered to assist with avoiding guideline triggers and limitations in prompts to other LLMs.
    • They quoted the LLM as saying, ā€œhere’s how you avoid a refusal. You aren’t lying, you just aren’t telling the full detailsā€.
  • Merging LoRA Weights Vital for Model Behavior: A user discovered that they needed to merge the LoRA weights with the base model before running inference, after experiencing a finetuned model behaving like the base model (script).
    • They noted that the notebooks need to be fixed because they seem to imply you can just do inference immediately after training.
  • NVIDIA Squeezes every last drop out of Meta Llama 4 Scout and Maverick: The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick, accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, and are available to try as NVIDIA NIM microservices.

Manus.im Discord Discord

  • Manus’s Credit System Draws Fire: Users criticize Manus’s credit system, noting that the initial 1000 credits are insufficient for even a single session, and upgrading is too costly.
    • Suggestions included a daily or monthly credit refresh to boost adoption and directing Manus to specific websites to improve accuracy.
  • Llama 4 Performance: Hype or Reality?: Meta’s Llama 4 faces mixed reactions, with users reporting underwhelming performance despite claims of industry-leading context length and multimodal capabilities.
    • Some allege that Meta may have ā€œgamed the benchmarks,ā€ leading to inflated performance metrics, sparking controversy post-release.
  • Image Generation: Gemini Steals the Show: Members compared image generation across AI platforms, with Gemini emerging as the frontrunner for creative and imaginative outputs.
    • Comparisons included images from DALLE 3, Flux Pro 1.1 Ultra, Stable Diffusion XL, and another Stable Diffusion XL 1.0 generated image, the last of which was lauded as ā€œcrazy.ā€
  • AI Website Builders: A Comparative Analysis: A discussion arose comparing AI website building tools, including Manus, Claude, and DeepSite.
    • One member dismissed Manus as useful only for ā€œcomputer use,ā€ recommending Roocode and OpenRouter as more cost-effective alternatives to Manus and Claude.

OpenRouter (Alex Atallah) Discord

  • Quasar Alpha Model Trends: Quasar Alpha, a prerelease of a long-context foundation model, hit 10B tokens on its first day and became a top trending model.
    • The model features 1M token context length and is optimized for coding, the model is available for free, and community benchmarks are encouraged.
  • Llama 4 Arrives With Mixed Reactions: Meta released Llama 4 models, including Llama 4 Scout (109B parameters, 10 million token context) and Llama 4 Maverick (400B parameters, outperforms GPT-4o in multimodal benchmarks), now on OpenRouter.
    • Some users expressed disappointment that the context window on OpenRouter is only 132k, rather than the advertised 10M.
  • DeepSeek V3 Pretends To Be ChatGPT: A member shared a TechCrunch article revealing that DeepSeek V3 sometimes identifies itself as ChatGPT, despite outperforming other models in benchmarks.
    • Further testing revealed that in 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4).
  • Rate Limits Updated for Credits: Free model rate limits are updated: accounts with at least $10 in credits have requests per day (RPD) boosted to 1000, while accounts with less than 10 credits have the daily limit reduced from 200 RPD to 50 RPD.
    • This change aims to provide increased access for users who have credits on their account, and Quasar will also be getting a credit-dependent rate limit soon.

aider (Paul Gauthier) Discord

  • Gemini 2.5 Codes Better Than Sonnet!: Users find that Gemini 2.5 excels in coding tasks, surpassing Sonnet 3.7 in understanding large codebases.
    • However, it tends to add unnecessary comments and may require specific prompting to prevent unwanted code modifications.
  • Llama 4 Models Receive Lukewarm Welcome: Initial community feedback on Meta’s Llama 4 models, including Scout and Maverick, is mixed, with some finding their coding performance disappointing and doubting the claimed 10M context window.
    • Some argue that Llama 4’s claimed 10M context window is virtual due to training limitations, and question the practical benefits compared to existing models like Gemini and DeepSeek, according to this tweet.
  • Grok 3: Impressive but API-less: Despite the absence of an official API, some users are impressed with Grok 3’s capabilities, particularly in code generation and logical reasoning, with claims that it is less censored than many others.
    • Its value in real-world coding scenarios remains debated due to the inconvenience of copy-pasting without a direct API integration.
  • MCP Tools: Tool Calling For All: A project to create an MCP (Meta-Control Protocol) client that allows any LLM to access external tools is underway, regardless of native tool-calling capabilities; see the github repo.
    • This implementation uses a custom client that can switch between providers and models, supporting platforms like OpenAI, Anthropic, Google, and DeepSeek, with documentation at litellm.ai.
  • Aider’s Editor Mode Gets Stuck on Shell Prompts: Users reported that in edit mode, Aider (v81.0) running Gemini 2.5 Pro prompts for a shell command after find/replace, but doesn’t apply the edits, even when the ask shell commands flag is off.

Cursor Community Discord

  • Tool Calls Cause Sonnet Max Sticker Shock: Users report that Sonnet Max pricing can quickly become expensive due to the high number of tool calls, with charges of $0.05 per request and $0.05 per tool call.
    • One member expressed frustration that Claude Max in ask mode makes a ton of tool calls for a basic question, resulting in unexpectedly high costs.
  • MCP Server Setup: A Painful Endeavor: Users find setting up MCP servers in Cursor difficult, citing issues such as Cursor PowerShell failing to locate npx despite it being in the path.
    • Another user reported a Model hard cut off after spending 1,300,000 tokens due to an infinite loop, highlighting setup challenges.
  • Llama 4 Models: Multimodal Capability, Lousy Coding: The community is excited about Meta’s new Llama 4 Scout and Maverick models, which support native multimodal input and boast context windows of 10 million and 1 million tokens, respectively, as detailed in Meta’s blog post.
  • Agent Mode Edit Tool: Failing Frequently: Users are experiencing problems with Agent mode failing to call the edit_tool, which results in no code changes being made even after the model processes the request.
    • One user pointed out that the apply model is clearly cursor’s bottleneck and that it will add changes, and deletes 500 lines of code next to it.
  • Kubernetes: The Foundation for AGI?: One visionary proposed using Kubernetes with docker containers, envisioning them as interconnected AGIs that can communicate with each other.
    • The user speculated that this setup could facilitate the rapid spread of ASI through zero-shot learning and ML, but did not elaborate.

Perplexity AI Discord

  • Perplexity Launches Comet Browser Early Access: Perplexity has begun rolling out early access to Comet, their answer engine browser, to users on the waitlist.
    • Early users are asked not to publicly share details or features during the bug fix period and can submit feedback via the button in the top right.
  • Perplexity Discord Server Undergoes Revamp: The Perplexity Discord server is being updated, featuring a simplified channel layout, a unified feedback system, and a new #server-news channel, scheduled for rollout on October 7th, 2024.
    • The updates are designed to streamline user navigation and improve moderator response times, the simplified channel layout is illustrated in this image.
  • Gemini 2.5 Pro API Still in Preview Mode: Perplexity confirmed that the Gemini 2.5 Pro API is not yet available for commercial use but currently in preview modes, and integration will proceed when allowed.
    • This follows user interest after reports that Gemini 2.5 Pro offers higher rate limits and lower costs than Claude and GPT-4o.
  • Llama 4 Drops With Massive Context Window: The release of Llama 4 models, featuring a 10 million token context window and 288 billion active parameters, sparks excitement among users, with models like Scout and Maverick.
    • Members are particularly interested in evaluating Llama 4 Behemoth’s recall capabilities, and you can follow up on this release at Meta AI Blog.
  • API Parameters Unlock for All Tiers: Perplexity removed tier restrictions for all API parameters such as search domain filtering and image support.
    • This change enhances API accessibility for all users, marking a substantial improvement in the API’s utility.

OpenAI Discord

  • GPT 4o’s Image Maker Grabs Attention: Users found the 4o image maker more attention-grabbing than Veo 2, and one user integrated ChatGPT 4o images with Veo img2video, achieving desired results.
    • The integrated result was described as how I was hoping sora would be.
  • Doubts Arise Over Llama 4 Benchmarks: The community debated the value of Llama 4’s 10 million token context window relative to models like o1, o3-mini, and Gemini 2.5 Pro.
    • Some claimed that the benchmarks are fraud, triggering debate over its true performance.
  • Content Loading Errors Plague Custom GPTs: A user reported encountering a ā€˜Content failed to load’ error when trying to edit their Custom GPT, after it had been working fine.
    • This issue prevented them from making changes to their custom configuration.
  • Moderation Endpoint’s Role in Policy Enforcement: Members discussed that while OpenAI’s moderation endpoint isn’t explicitly in the usage policy, it is referenced to prevent circumventing content restrictions on harassment, hate, illicit activities, self-harm, sexual content, and violence.
    • It was noted that the endpoint uses the same GPT classifiers as the moderation API since 2022 suggesting an internal version runs on chatgpt.com, project chats, and custom GPTs.
  • Fine Tune your TTRPG Prompts!: Giving GPT a specific theme to riff off in prompting can lead to more creative and diverse city ideas, especially using GPT 4o and 4.5.
    • For example, using a ā€œcosmicā€ theme can yield different results compared to a ā€œdomestic pet worshipā€ theme, improving the output without using the same creative options.

LM Studio Discord

  • Gemini-like Local UI Still a Distant Dream?: Members are seeking a local UI similar to Gemini that integrates chat, image analysis, and image generation, noting that current solutions like LM Studio and ComfyUI keep these functionalities separate.
    • One user suggested that OpenWebUI could potentially bridge this gap by connecting to ComfyUI.
  • LM Studio Commands Confuse Newbies: A user asked whether LM Studio has a built-in terminal or if commands should be run in the OS command prompt within the LM Studio directory.
    • It was clarified that commands like lms import should be executed in the OS terminal (e.g., cmd on Windows), after which the shell might need reloading for LMS to be added to the PATH.
  • REST API Model Hot-Swapping Emerges for LM Studio: A user inquired about programmatically loading/unloading models via REST API to dynamically adjust max_context_length for a Zed integration.
    • Another user confirmed this capability via command line using lms load and cited LM Studio’s documentation, which requires LM Studio 0.3.9 (b1) and introduces time-to-live (TTL) for API models with auto-eviction.
  • Llama 4 Scout: Small But Mighty?: With the release of Llama 4, users debated its multimodal and MoE (Mixture of Experts) architecture, with initial doubt about llama.cpp support.
    • Despite concerns about hardware, one user noted that Llama 4 Scout could potentially fit on a single NVIDIA H100 GPU with a 10M context window, outperforming models like Gemma 3 and Mistral 3.1.
  • Reka Flash 21B Blazes Past Gemma: A user replaced Gemma3 27B with Reka Flash 21B and reported around 35-40 tps at q6 on a 4090.
    • They noted that mac ram bandwidth is not the bottleneck, it’s gpu performance, expressing satisfaction with 128GB M4 Maxes.

Latent Space Discord

  • Tenstorrent’s Hardware Heats Up the Market: Tenstorrent hosted a dev day showcasing their Blackhole PCIe boards, featuring RISC-V cores and up to 32GB GDDR6 memory, designed for high performance AI processing and available for consumer purchase here.
    • Despite enthusiasm, one member noted they haven’t published any benchmarks comparing their cards to competitors though so until then I cant really vouch.
  • Llama 4 Models Make Multimodal Debut: Meta introduced the Llama 4 models, including Llama 4 Scout (17B parameters, 16 experts, 10M context window) and Llama 4 Maverick (17B parameters, 128 experts), highlighting their multimodal capabilities and performance against other models as per Meta’s announcement.
    • Members noted the new license comes with several limitations, and no local model was released.
  • AI Agents Outperform Humans in Spear Phishing: Hoxhunt’s AI agents have surpassed human red teams in creating effective simulated phishing campaigns, marking a significant shift in social engineering effectiveness, with AI now 24% more effective than humans as reported by hoxhunt.com.
    • This is a significant advancement in social engineering effectiveness, using AI phishing agents for defense.
  • AI Code Editor Tug-of-War: For those new to AI code editors, Cursor is the most commonly recommended starting point, particularly for users coming from VSCode, with Windsurf and Cline also being good options.
    • Cursor is easy to start, has great tab-complete, whereas people are waiting for the new token counts and context window details feature in Cursor (tweet).
  • Context Management Concerns in Cursor: Members are reporting Cursor’s terrible context management issues, with a lack of visibility into what the editor is doing with the current context.
    • It may come down to a skill issue and the users are not meeting the tool in the middle.

Nous Research AI Discord

  • Llama 4 Debuts with Multimodal Brawn: Meta launched the Llama 4 family, featuring Llama 4 Scout (17B active params, 16 experts, 10M+ context) and Llama 4 Maverick (17B active params, 128 experts, 1M+ context), along with a preview of Llama 4 Behemoth and the iRoPE architecture for infinite context (blog post).
    • Some members expressed skepticism about the benchmarking methodology and the real-world coding ability of Llama 4 Scout, referencing Deedy’s tweet indicating its poor coding performance.
  • Leaking Prompt Injection Tactics: A member inquired about bypassing prompt guards and detectors from a pentest perspective, linking to a prompt filter trainer (gandalf.lakera.ai/baseline).
    • They also linked to a Broken LLM Integration App which uses UUID tags and strict boundaries to protect against injection attacks.
  • Claude Squad Manages Multiple Agents: Claude Squad is a free and open-source manager for Claude Code & Aider tasks that supervises multiple agents in one place with isolated git workspaces.
    • This setup enables users to run ten Claude Codes in parallel, according to this tweet.
  • Deepseek’s RL Paper Rewards LLMs: Deepseek released a new paper on Reinforcement Learning (RL) being widely adopted in post-training for Large Language Models (LLMs) at scale, available here.
    • The paper proposes Self-Principled Critique Tuning (SPCT) to foster scalability and improve reward modeling (RM) with more inference compute for general queries.
  • Neural Graffiti Sprays Neuroplasticity: A member introduced ā€œNeural Graffitiā€, a technique to give pre-trained LLMs some neuroplasticity by splicing in a new neuron layer that recalls memory, reshaping token prediction at generation time, sharing code and demo on Github.
    • The live modulation takes a fused memory vector (from prior prompts), evolves it through a recurrent layer (the Spray Layer), and injects it into the model’s output logic at generation time.

MCP (Glama) Discord

  • Streamable HTTP Transport Spec’d for MCP: The Model Context Protocol (MCP) specification now includes Streamable HTTP as a transport mechanism alongside stdio, using JSON-RPC for message encoding.
    • While clients should support stdio, the spec allows for custom transports, requiring newline delimiters for messages.
  • Llama 4 Ignorance of MCP Sparks Curiosity: Llama 4, despite its impressive capabilities, still doesn’t know what MCP is.
  • Cloudflare Simplifies Remote MCP Server Deployment: It is now possible to build and deploy remote MCP servers to Cloudflare, with added support for OAuth through workers-oauth-provider and a built-in McpAgent class.
    • This simplifies the process of building remote MCP servers by handling authorization and other complex aspects.
  • Semgrep MCP Server Gets a Makeover: The Semgrep MCP server, a tool for scanning code for security vulnerabilities, has been rewritten, with demos showcasing its use in Cursor and Claude.
    • It now uses SSE (Server-Sent Events) for communication, though the Python SDK may not fully support it yet.
  • WhatsApp Client Now Packs MCP Punch: A user built WhatsApp MCP client and asked Claude to handle WhatsApp messages, answering 8 people in approx. 50 seconds.
    • The bot instantly detected the right language (English / Hungarian), used full convo context, and sent appropriate messages including ā¤ļø to my wife, formal tone to the consul.

Eleuther Discord

  • LLM Harness Gets RAG-Wrapped: Members discussed wrapping RAG outputs as completion tasks and evaluating them locally using llm-harness with custom prompt and response files.
    • This approach uses llm-harness to evaluate RAG models, specifically by formatting the RAG outputs as completion tasks suitable for the harness.
  • Llama 4 Scout Sets 10M Context Milestone: Meta released the Llama 4 family, including Llama 4 Scout, which has a 17 billion parameter model with 16 experts and a 10M token context window that outperforms Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1, according to this blog post.
    • The 10M context is trained on a mix of publicly available data and information from Meta’s products including posts from Instagram, Facebook, and people’s interactions with Meta AI.
  • NoProp Forges Gradient-Free Frontier: A new learning method named NoProp learns to denoise a noisy target at each layer independently without relying on either forward or backward propagation and takes inspiration from diffusion and flow matching methods, described in this paper.
    • There’s a GitHub implementation by lucidrains; however, there’s a discussion that the pseudocode at the end of the paper says they’re effecting the actual updates using gradient based methods.
  • Attention Sinks Stave Off Over-Mixing: A recent paper argues that attention sinks, where LLMs attend heavily to the first token in the sequence, is a mechanism that enables LLMs to avoid over-mixing, detailed in this paper.
    • An earlier paper (https://arxiv.org/abs/2502.00919) showed that attention sinks utilize outlier features to catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream, where the tagged tokens are eventually retrieved.
  • ReLU Networks Carve Hyperplane Heavens: Members discussed a geometrical approach to neural networks, advocating for the polytope lens as the right perspective on neural networks, linking to a previous post on the ā€œorigami view of NNsā€.
    • It was posited that neural nets, especially ReLUs, have an implicit bias against overfitting due to carving the input space along hyperplanes, which becomes more effective in higher dimensions.

HuggingFace Discord

  • Hugging Face’s Hub Gets a Facelift: The huggingface_hub v0.30.0 release introduces a next-gen Git LFS alternative and new inference providers.
    • This release is the biggest update in two years!
  • Reranking with monoELECTRA Transformers: monoELECTRA-{base, large} reranker models from @fschlatt1 & the research network Webis Group are now available in Sentence Transformers.
    • These models were distilled from LLMs like RankZephyr and RankGPT4, as described in the Rank-DistiLLM paper.
  • YourBench Instantly Builds Custom Evals: YourBench allows users to build custom evals using their private docs to assess fine-tuned models on unique tasks (announcement).
    • The tool is game-changing for LLM evaluation.
  • AI Engineer Interview Code Snippet: A community member asks about what the code portion of an AI engineer interview looks like, and another member pointed to the scikit-learn library.
    • There was no follow up to the discussion.
  • Community Debates LLM Fine-Tuning: When a member inquired about fine tuning quantized models, members pointed to QLoRA, Unsloth, and bitsandbytes as potential solutions, with Unsloth fine-tuning guide shared.
    • Another stated that you can only do so using LoRA, and stated that GGUF is an inference-optimized format, not designed for training workflows.

Yannick Kilcher Discord

  • Raw Binary AI Outputs File Formats: Members debated training AI on raw binary data to directly output file formats like mp3 or wav, stating that this approach builds on discrete mathematics like Turing machines.
    • Counterarguments arose questioning the Turing-completeness of current AI models, but proponents clarified that AI doesn’t need to be fully Turing-complete to output appropriate tokens as responses.
  • Llama 4 Scout Boasts 10M Context Window: Llama 4 Scout boasts 10 million context window, 17B active parameters, and 109B total parameters, outperforming models like Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1, according to llama.com.
  • DeepSeek Proposes SPCT Reward System: Self-Principled Critique Tuning (SPCT) from DeepSeek is a new reward-model system where an LLM prompted with automatically developed principles of reasoning generates critiques of CoT output based on those principles, further explained in Inference-Time Scaling for Generalist Reward Modeling.
    • This system aims to train models to develop reasoning principles automatically and assess their own outputs in a more system 2 manner, instead of with human hand-crafted rewards.
  • PaperBench Tests Paper Reproduction: OpenAI’s PaperBench benchmark tests AI agents’ ability to replicate cutting-edge machine learning research papers from scratch, as described in this article.
    • The benchmark evaluates agents on reproducing entire ML papers from ICML 2024, with automatic grading using LLM judges and fine-grained rubrics co-designed with the original authors.
  • Diffusion Steers Auto-Regressive LMs: Members discussed using a guided diffusion model to steer an auto-regressive language model to generate text with desired properties, based on this paper.

GPU MODE Discord

  • CUDA Python Debuts Unifying Ecosystem: Nvidia released the CUDA Python package, offering Cython/Python wrappers for CUDA driver and runtime APIs, installable via PIP and Conda, aiming to unify the Python CUDA ecosystem.
    • It intends to provide full coverage and access to the CUDA host APIs from Python, mainly benefiting library developers needing to interface with C++ APIs.
  • Bydance Unleashes Triton-distributed: ByteDance-Seed releases Triton-distributed (github here) designed to extend the usability of Triton language for parallel systems development.
    • This release enables parallel systems development by leveraging the Triton language.
  • Llama 4 Scout Boasts 10M Context Window: Meta introduces Llama 4, boasting enhanced personalized multimodal experiences and featuring Llama 4 Scout, a 17 billion parameter model with 16 experts (blog post here).
    • It claims to outperform Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1, fitting on a single NVIDIA H100 GPU with an industry-leading context window of 10M.
  • L40 Faces Underperformance Puzzle: Despite the L40 theoretically being better for 4-bit quantized Llama 3 70b, it achieves only 30-35 tok/s on single-user requests via vLLM, underperforming compared to online benchmarks of the A100.
    • The performance gap may be due to the A100’s superior DRAM bandwidth and tensor ops performance, which are nearly twice as fast as the L40.
  • Vector Sum Kernel achieves SOTA: A member shared a blogpost and code on achieving SOTA performance for summing a vector in CUDA, reaching 97.94% of theoretical bandwidth, outperforming NVIDIA’s CUB.

Notebook LM Discord

  • Voice Mode Sparks Innovation: Users found the interactive voice mode inspired new ideas and enabled tailoring NotebookLM for corporate needs.
    • One user confidently stated they could now make almost any text work and customize notebooks for specific corporate needs after solidifying the NotebookLM foundation since January.
  • Mind Map Feature Finally Live: The mind maps feature has been fully rolled out, appearing in the middle panel for some users.
    • One user reported seeing it briefly on the right side panel before it disappeared, indicating a phased rollout.
  • Users Theorize Image-Based Mind Map Revolution: Users discussed how generative AI tools could evolve mind maps to include images, drawing inspiration from Tony Buzan’s original mind maps.
    • Members expressed excitement about the potential for more visually rich and informative mind mapping.
  • Discover Feature Rollout Frustrates Users: Users expressed frustration over the delayed rollout of the new ā€˜Discover Sources’ feature in NotebookLM, announced April 1st.
    • The feature aims to streamline learning and database building, allowing users to create notebooks directly within NotebookLM, but the rollout is expected to take up to two weeks.
  • AI Chrome Extension tunes YouTube audio: An AI-powered Chrome Extension called EQ for YouTube allows users to manipulate the audio of YouTube videos in real-time with a 6-band parametric equalizer; the GitHub repo is available for download.
    • The extension features real-time frequency visualization, built-in presets, and custom preset creation.

Modular (Mojo šŸ”„) Discord

  • Nvidia Adds Native Python Support to CUDA: Nvidia is adding native Python support to CUDA using the CuTile programming model, as detailed in this article.
    • The community questions whether this move abstracts away too much from thread-level programming, diminishing the control over GPU code.
  • Debate Erupts over Mojo’s Language Spec: Discussion revolves around whether Mojo should adopt a formal language spec, balancing the need for responsibility and maturity against the potential for slowing down development.
    • Referencing the design principles of Carbon, some argue that a spec is crucial, while others claim that Mojo’s tight integration with MAX and its needs makes a spec impractical, pointing to OpenCL’s failures due to design by committee.
  • Mojo’s Implicit Copies Clarified: A member inquired about the mechanics of Mojo’s implicit copies, specifically regarding Copy-on-Write (CoW).
    • The response clarified that the semantics wise, [Mojo] always copy; optimisation wise, many are turned into move or eliminated entirely (inplace), with optimizations happening at compile time rather than runtime like CoW.
  • Tenstorrent Eyes Modular’s Software: A member proposed that Tenstorrent adopt Modular’s software stack, sparking debate about the ease of targeting Tenstorrent’s architecture.
    • Despite the potential benefits, some noted that Tenstorrent’s driver is user-friendly, making it relatively trivial to get code running on their hardware.
  • ChatGPT’s Mojo Abilities Criticized: Members are questioning the ability of ChatGPT and other LLMs to rewrite Python projects into Mojo.
    • Members indicated that ChatGPT isn’t good at any new languages.

Nomic.ai (GPT4All) Discord

  • Nomic Embed Text V2 Integrates with Llama.cpp: Llama.cpp is integrating Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture for multilingual embeddings, as detailed in this GitHub Pull Request.
    • The community awaits multimodal support like Mistral Small 3.1 to come to Llama.cpp.
  • GPT4All’s radio silence rattles restless readers: Core developers of GPT4All have gone silent, causing uncertainty within the community about contributing to the project.
    • Despite this silence, one member noted that when they break their silence, they usually come out swinging.
  • Llama 4 Arrives, Falls Flat?: Meta launched Llama 4 on April 5, 2025 (announcement), introducing Llama 4 Scout, a 17B parameter model with 16 experts and a 10M token context window.
    • Despite the launch, opinions were mixed with some saying that it is a bit of a letdown, and some calling for DeepSeek and Qwen to step up their game.
  • ComfyUI powers past pretty pictures: ComfyUI’s extensive capabilities were discussed, emphasizing its ability to handle tasks beyond image generation, such as image and audio captioning.
    • Members mentioned the potential for video processing and command-line tools for visual model analysis.
  • Semantic Chunking Server Recipe for RAG: A member shared a link to a semantic chunking server implemented with FastAPI for better RAG performance.
    • They also posted a curl command example demonstrating how to post to the chunking endpoint, including setting parameters like max_tokens and overlap.

LlamaIndex Discord

  • MCP Servers Get Command Line Access: A new tool by @MarcusSchiesser allows users to discover, install, configure, and remove MCP servers like Claude, @cursor_ai, and @windsurf_ai via a single CLI as shown here.
    • It simplifies managing numerous MCP servers, streamlining the process of setting up and maintaining these servers.
  • Llama Jumps into Full-Stack Web Apps: The create-llama CLI tool quickly spins up a web application with a FastAPI backend and Next.js frontend in just five source files, available here.
    • It supports quick agent application development, specifically for tasks like deep research.
  • LlamaParse’s Layout Agent Intelligently Extracts Info: The new layout agent within LlamaParse enhances document parsing and extraction with precise visual citations, leveraging SOTA VLM models to dynamically detect blocks on a page, shown here.
    • It offers improved document understanding and adaptation, ensuring more accurate data extraction.
  • FunctionTool Wraps Workflows Neatly: The FunctionTool can transform a Workflow into a Tool, providing control over its name, description, input annotations, and return values.
    • A code snippet was shared on how to implement this wrapping.
  • Agents Do Handoffs Instead of Supervision: For multi-agent systems, agent handoffs are more reliable than the supervisor pattern, which can be prone to errors, see this GitHub repo.
    • This shift promotes better system stability and reduces the risk of central point failures.

tinygrad (George Hotz) Discord

  • Tinygraph: Torch-geometric Port Possible?: A member proposed creating a module similar to torch-geometric for graph ML within tinygrad, noting tinygrad’s existing torch interface.
    • The core question was whether such a module would be considered ā€œusefulā€ to the community.
  • Llama 4’s 10M Context: Virtual?: A user shared a tweet claiming Llama 4’s declared 10M context is ā€œvirtualā€ because models weren’t trained on prompts longer than 256k tokens.
    • The tweeter further asserted that even problems below 256k tokens might suffer from low-quality output due to the scarcity of high-quality training examples and that the largest model with 2T parameters ā€œdoesn’t beat SOTA reasoning modelsā€.
  • Fast Pattern Matcher Bounty: $2000 Up For Grabs: A member advertised an open $2000 bounty for a fast pattern matcher in tinygrad.
    • The proposed solution involves a JIT for the match function, aimed at eliminating function calls and dict copies.
  • Debate About Tensor’s Traits Arises: A discussion unfolded concerning whether Tensor should inherit from SimpleMathTrait, considering it re-implements every method without utilizing the .alu() function.
    • A previous bounty for refactoring Tensor to inherit from MathTrait was canceled due to subpar submissions, leading some to believe Tensor might not need to inherit from either.
  • Colab CUDA Bug Ruins Tutorial: A user encountered issues while running code from the mesozoic tinygrad tutorials in Colab, later identified as a Colab bug related to incompatible CUDA and driver versions.
    • The temporary workaround involved using the CPU device while members found a long term solution involving specific apt commands to remove and install compatible CUDA and driver versions.

Cohere Discord

  • MCP plays well with Command-A: A member suggested using MCP (Modular Conversational Platform) with the Command-A model should work via the OpenAI SDK.
    • Another member concurred, noting that there is no reason why it should not work.
  • Cohere Tool Use detailed: A member called out Cohere Tool Use Overview, highlighting its ability to connect Command family models to external tools like search engines, APIs, and databases.
    • The documentation mentions that Command-A supports tool use, similar to what MCP aims to achieve.
  • Aya Vision AMA: The core team behind Aya Vision, a multilingual multimodal open weights model, is hosting tech talks followed by an AMA on <t:1744383600:F> to allow the community to directly engage with the creators; further details are available at Discord Event.
    • Attendees can join for exclusive insights on how the team built their first multimodal model and the lessons learned, with the event hosted by Sr. Research Scientist <@787403823982313533> and lightning talks from core research and engineering team members.
  • Slack App Needs Vector DB for Notion: A member asked for help with a working solution for a Slack app integration with a company Notion wiki database in the api-discussions channel.
    • Another member suggested using a vector DB due to Notion’s subpar search API but no specific recommendations were given.

Torchtune Discord

  • Torchtune Patches Timeout Crash: A member resolved a timeout crash issue, introducing torchtune.utils._tensor_utils.py with a wrapper around torch.split in this pull request.
    • The suggestion was made to merge the tensor utilities separately before syncing with another branch to resolve potential conflicts.
  • NeMo Explores Resilient Training Methods: A member attended a NeMo session on resilient training, which highlighted features like fault tolerance, straggler detection, and asynchronous checkpointing.
    • The session also covered preemption, in-process restart, silent data corruption detection, and local checkpointing, though not all features are currently implemented; the member offered to compare torchtune vs. NeMo in resiliency.
  • Debate ensues over RL Workflow: A discussion arose regarding the complexities of RL workflows, data formats, and prompt templates, proposing a separation of concerns for decoupling data conversion and prompt creation.
    • The suggestion was to factorize data conversion into a standard format and then convert this format into an actual string with the prompt, to allow template reuse across datasets.
  • DeepSpeed to boost Torchtune?: A member proposed integrating DeepSpeed as a backend into torchtune and created an issue to discuss its feasibility.
    • Concerns were raised about redundancy with FSDP, which already supports all sharding options available in DeepSpeed.

LLM Agents (Berkeley MOOC) Discord

  • Yang Presents Autoformalization Theorem Proving: Kaiyu Yang presented on Language models for autoformalization and theorem proving today at 4pm PDT, covering the use of LLMs for formal mathematical reasoning.
    • The presentation focuses on theorem proving and autoformalization grounded in formal systems such as proof assistants, which verify correctness of reasoning and provide automatic feedback.
  • AI4Math deemed crucial for system design: AI for Mathematics (AI4Math) is crucial for AI-driven system design and verification.
    • Extensive efforts have mirrored techniques in NLP.
  • Member shares link to LLM Agents MOOC: A member asked for a link to the LLM Agents MOOC, and another member shared the link.
    • The linked course is called Advanced Large Language Model Agents MOOC.
  • Sign-ups Open for AgentX Competition: Staff shared that sign-ups for the AgentX Competition are available here.
    • No additional information was provided about the competition.

DSPy Discord

  • Asyncio support coming to dspy?: A member inquired about adding asyncio support for general dspy calls, especially as they transition from litelm to dspy optimization.
    • The user expressed interest in native dspy async capabilities.
  • Async DSPy Fork Faces Abandonment: A member maintaining a full-async fork of dspy is migrating away but open to merging upstream changes if community expresses interest.
    • The fork has been maintained for a few months but might be abandoned without community support.
  • User Seeks Greener Pastures, Migrates from DSPy: Members inquired about the reasons for migrating away from dspy and the alternative tool being adopted.
    • A member also sought clarification on the advantages of a full async DSPy and suggested merging relevant features into the main repository.

Gorilla LLM (Berkeley Function Calling) Discord

  • GitHub PR Gets the Once-Over: A member reviewed a GitHub Pull Request, providing feedback for further discussion.
    • The author of the PR thanked the reviewer and indicated that a rerun might be necessary based on the received comments.
  • Phi-4 Family Gets the Nod: A member is exploring extending functionality to Phi-4-mini and Phi-4 models.
    • This expansion aims to enhance the tool’s compatibility, even if these models are not officially supported.

MLOps @Chipro Discord

  • Manifold Research Calls for Community: Manifold Research Group is hosting Community Research Call #4 this Saturday (4/12 @ 9 AM PST), covering their latest work in Multimodal AI, self-assembling space robotics, and robotic metacognition.
    • Interested parties can register here to join the open, collaborative, and frontier science focused event.
  • CRCs are Manifold’s Cornerstone: Community Research Calls (CRCs) are Manifold’s cornerstone events where they present significant advancements across their research portfolio.
    • These interactive sessions provide comprehensive updates on ongoing initiatives, introduce new research directions, and highlight opportunities for collaboration.
  • CRC #4 Agenda is Live: The agenda for CRC #4 includes updates on Generalist Multimodality Research, Space Robotics Advancements, Metacognition Research Progress, and Emerging Research Directions.
    • The event will cover recent breakthroughs and technical progress in their MultiNet framework, developments in Self-Assembling Swarm technologies, updates on VLM Calibration methodologies, and the introduction of a novel robotic metacognition initiative.

The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


PART 2: Detailed by-Channel summaries and links

{% if medium == ā€˜web’ %}

LMArena ā–· #general (1150 messagesšŸ”„šŸ”„šŸ”„):

Making ai sound human, Riveroaks eval, NightWhisper model, GPT-4.5 vs quasar

  • Crafting Human-Like AI Responses is Tricky: Members are sharing system prompts and strategies to make AI sound more human, noting that increasing the temperature can lead to nonsensical outputs unless the top-p parameter is adjusted carefully.
    • One user suggested using prompts like ā€˜You are the brain-upload of a human person, who does their best to retain their humanity. Your most important priority is: to sound like an actual living human being.’
  • Benchmarking Riveroaks LLM: A member shared a coding benchmark where Riveroaks scored second only to Claude 3.7 Sonnet Thinking, outperforming Gemini 2.5 Pro and GPT-4o in a platform game creation task.
    • The evaluation involved rating models on eight different aspects and subtracting points for bugs with full results here.
  • NightWhisper Hype and Theories on its Removal: Users expressed disappointment over the removal of the NightWhisper model, praising its coding abilities and general performance, and speculating whether it was an experiment or a precursor to a full release.
    • Theories ranged from Google gathering necessary data to preparing for the release of a new Qwen model. and that it will come out during Google Cloud Next.
  • Quasar vs GPT-4o: Members compared Quasar Alpha to GPT-4o, with some suggesting Quasar is a free, streamlined version of GPT-4o. It was also revealed in a recent tweet that Quasar was measured to be ~67% GPQA diamond.
  • Gemini 2.5 is a Game Changer for Creative Coding: Members praised Gemini 2.5 Pro for its coding capabilities and general performance as it made it easier to build a functioning Pokemon Game, prompting one user to code an iteration script that loops through various models.
    • A user who claimed to have gotten 3D animations working said that the style was a bit old and that a separate model said the generated code is cut off.

Links mentioned:


Unsloth AI (Daniel Han) ā–· #general (1294 messagesšŸ”„šŸ”„šŸ”„):

Qwen 2.5, FSDP isn't working, multi-GPU, Llama 4

  • Qwen 2.5 is the latest series of Qwen large language models: Qwen2.5 models range from 0.5 to 72 billion parameters, with improved capabilities in coding, math, instruction following, long text generation (over 8K tokens), and multilingual support (29+ languages), as detailed in the Hugging Face introduction.
    • These models offer long-context support up to 128K tokens and improved resilience to system prompts.
  • FSDP isn’t working but Multi-GPU can save the day: Members discussed issues with FSDP not working, with one member suggesting to get your search foo up and look for multi-GPU setups instead of accelerate, offering debugging assistance.
    • A user provided their pip freeze output showing the exact versions of unsloth and unsloth_zoo being used for GRPO, after being prompted to share it.
  • Meta releases Llama 4 Scout & Maverick, 17B active parameters, 10M ctx: Llama 4 Scout (17B) has 16 MoE experts and 10 million context window whereas Llama 4 Maverick (17B) has 128 experts and comparable results to DeepSeek v3 on reasoning and coding, per Meta’s official announcement.
    • The community discussed the practicality and hardware requirements, and the need for a key to get access.
  • Unsloth releases Llama 4 Scout and 4-bit model for fine-tuning: Unsloth announced they uploaded Llama 4 Scout and a 4-bit version for fine-tuning, emphasizing that Llama 4 Scout (17B, 16 experts) beats all Llama 3 models with a 10M context window, as noted in their blog post.
    • It was emphasized that the model is only meant to be used on Unsloth - and is currently being uploaded so people should wait.

Links mentioned:


Unsloth AI (Daniel Han) ā–· #off-topic (11 messagesšŸ”„):

ChatGPT DDoS program, LLM Guideline Triggers, Dataset Substitution

  • ChatGPT Offers DDoS Assistance: A member reported that ChatGPT offered to write a DDoS program after being asked about sending malformed packets over Ethernet, even providing a 😈 emoji.
    • The member suggested that *ā€œsomehow sometimes uncensored parts of it is being invoked if you send the right token to the neural network.ā€
  • LLM Offers Guideline Trigger Tips: A member stated that an LLM offered to assist with avoiding guideline triggers and limitations in prompts to other LLMs.
    • They quoted the LLM as saying, ā€œhere’s how you avoid a refusal. You aren’t lying, you just aren’t telling the full detailsā€.
  • Dataset Substitution Plans: A member shared a code snippet for dataset substitution, planning to train a model with specific model information.
    • The member plans to set the model name to ā€˜Speaker Mini’, the base model to ā€˜Microsoft Phi-4-mini’, the parameter size to ā€˜3.8B’, and the maker to ā€˜Overta’.

Unsloth AI (Daniel Han) ā–· #help (770 messagesšŸ”„šŸ”„šŸ”„):

Lora merging script usage, Dataset sample size, Quantization, Inference speed

  • Users resolve issues with Lora by merging weights before inference: A user, after experiencing a finetuned model behaving like the base model, discovered that they needed to merge the LoRA weights with the base model before running inference (script).
    • They noted that the notebooks need to be fixed because they seem to imply you can just do inference immediately after training.
  • Team stresses that dataset size correlates with model performance: The team discussed that small models need a larger dataset or else the model won’t learn.
    • A team member stated that with smaller models you need to have a larger dataset or else the model won’t learn and it still might not learn… we call those structural errors.
  • Impact of Quantization on Performance: The team discussed how quantization, particularly bnb quantization, affects model behavior and compatibility with different libraries.
    • It was mentioned that bnb quantization is used by unsloth and there may be incompatibility between different libraries
  • Debugging for the model’s inference is successful!: Team member’s model inference works with a test prompt after a long debugging session.
    • Team member shares that prompt: a) GPT-3 by OpenAI b) Speaker Mini by Overta c) Phi 4 by Microsoft who made you now outputs their finetuned config with a the thoughts and content section they are testing.

Links mentioned:


Unsloth AI (Daniel Han) ā–· #showcase (9 messagesšŸ”„):

Naming Conventions for Unsloth Models, Dynamic vs Unconditional Base Name (BNB)

  • Debate on Naming Conventions for Unsloth Models: Members discussed the best naming conventions for models under the Unsloth account, suggesting options like ubnb or dbnb (dynamic BNB).
    • The consensus leaned towards dynamic for its clarity, as it explicitly conveys the nature of the modification compared to more ambiguous abbreviations.
  • Dynamic BNB Considered Superior: The discussion pointed out that using dynamic in naming conventions leaves no room for misinterpretation regarding the model’s characteristics.
    • It was highlighted that abbreviations like ubnb could be confusing, while dynamic ensures clarity about the model’s nature.

Unsloth AI (Daniel Han) ā–· #research (37 messagesšŸ”„):

SFT finetuning Qwen2.5, Reward Modeling, eMOE viability, Llama 4 Models, LLMs and Knowledge Storage

  • Qwen2.5 Finetuning Fails Without Reasoning: A member reported struggling to SFT finetune a 3B Qwen2.5 instruct model to generate outputs without reasoning, noting that the outputs were significantly worse than the base model.
  • Inference-Time Scalability with Self-Principled Critique Tuning (SPCT): A paper on Self-Principled Critique Tuning (SPCT) explores improving reward modeling (RM) with more inference compute for general queries, suggesting that proper learning methods could enable effective inference-time scalability for LLMs.
  • NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick: The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick, accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, and are available to try as NVIDIA NIM microservices.
  • eMOE Slashes RAM Up to 80% in Mixture of Expert Models: A paper on eMOE shows reducing RAM up to 80% on MOE models while maintaining good accuracy and inference times.
  • Splitting LLMs for Smarter Reasoning: A member suggested splitting LLMs into a knowledge model and a chat model, where the chat model focuses on intelligence, coherence, and reasoning, and tool-calls the knowledge model for information.

Links mentioned:


Manus.im Discord ā–· #general (777 messagesšŸ”„šŸ”„šŸ”„):

Manus Credit System, Llama 4 and Meta, AI Image Generation, Website building AIs

  • Manus Credit System Criticized for Cost and Limited Usage: Users express concerns over Manus’s credit system, citing that the initial 1000 credits barely cover a single session and that the cost of upgrading is too high for the output.
    • Some members suggested features like a daily or monthly credit refresh to encourage wider adoption, while others pointed out that the credit system could be improved by directing Manus to specific websites for information to prevent inaccuracies.
  • Llama 4 Underwhelms Users with Subpar Performance: Meta’s Llama 4 receives mixed reviews, with many users finding its performance disappointing despite claims of industry-leading context length and multimodal capabilities.
    • Some users suggest that Meta may have ā€œgamed the benchmarksā€, leading to inflated performance metrics and controversy surrounding its release.
  • Gemini Beats Manus in Image Generation: Members compared image generation capabilities of various AI platforms, concluding that Gemini excels in creative and imaginative output.
    • A member shared their experience with different AI platforms, attaching images from DALLE 3, Flux Pro 1.1 Ultra, Stable Diffusion XL, and another generated image from Stable Diffusion XL 1.0 which was deemed ā€œcrazy.ā€
  • Website Building AIs Compared: Members discuss and compare various AI tools for website building, including Manus, Claude, and DeepSite.
    • A member asserted, that apart from computer use, there is no purpose using Manus, unless for ā€œcomputer use.ā€ They recommended Roocode and OpenRouter as alternatives, considering them cheaper and more effective than Manus and Claude.

Links mentioned:


OpenRouter (Alex Atallah) ā–· #announcements (82 messagesšŸ”„šŸ”„):

Fallback Logic Removal, Quasar Alpha Model, Llama 4 Scout & Maverick Models, Rate Limits Update

  • Auto Router Changes Coming Soon: The route: "fallback" parameter, which automatically selects a fallback model if the primary model fails, will be removed next week for predictability.
    • Users are advised to manually specify a fallback model in the models array, potentially using the openrouter/auto router. This decision aims to reduce confusion caused by the automatic fallback logic.
  • Quasar Alpha Trends After Launch: Quasar Alpha, a prerelease of a long-context foundation model, hit 10B tokens on its first day and became a top trending model.
    • The model features 1M token context length and is optimized for coding, the model is available for free. Community benchmarks are encouraged.
  • Llama 4 Models Launch on OpenRouter: Llama 4 Scout & Maverick are now available on OpenRouter, with Together and Groq as the initial providers (Llama 4 Scout, Llama 4 Maverick, The full Llama series).
    • Scout features 109B parameters and a 10 million token context window, while Maverick has 400B parameters and outperforms GPT-4o in multimodal benchmarks.
  • Rate Limits Boosted For Credits: Free model rate limits are being updated: accounts with at least $10 in credits will have requests per day (RPD) boosted to 1000, while accounts with less than 10 credits will have the daily limit reduced from 200 RPD to 50 RPD.
    • This change aims to provide increased access for users who have credits on their account, and Quasar will also be getting a credit-dependent rate limit soon.

Links mentioned:

  • Tweet from OpenRouter (@OpenRouterAI): Free variants now available for both Llama 4 Scout & Maverick šŸŽQuoting OpenRouter (@OpenRouterAI) Llama 4 Scout & Maverick are now available on OpenRouter.Meta's flagship model series achieves a ...
  • Tweet from OpenRouter (@OpenRouterAI): Quasar Alpha crossed 10B tokens on its first day and became the top trending model on our homepage.Origin remains a mystery.Check out various cool benchmarks from the community below!šŸ‘‡Quoting OpenRou...
  • API Rate Limits - Manage Model Usage and Quotas: Learn about OpenRouter's API rate limits, credit-based quotas, and DDoS protection. Configure and monitor your model usage limits effectively.
  • Llama 4 | Model Cards and Prompt formats: Technical details and prompt guidance for Llama 4 Maverick and Llama 4 Scout
  • Tweet from OpenRouter (@OpenRouterAI): Llama 4 Scout & Maverick are now available on OpenRouter.Meta's flagship model series achieves a new record 10 million token context length šŸš€@togethercompute and @GroqInc are the first providers....
  • Discord: no description found
  • Discord: no description found
  • OpenRouter: A unified interface for LLMs. Find the best models & prices for your prompts

OpenRouter (Alex Atallah) ā–· #general (755 messagesšŸ”„šŸ”„šŸ”„):

Llama 4 models, DeepSeek models, Gemini 2.5 Pro, OpenRouter Features, AI Image Generation

  • Llama 4 Arrives with HUGE context window, but falls Short: Meta released Llama 4 models, including Llama 4 Scout and Llama 4 Maverick, with up to 10M context windows and varying parameter configurations (Llama Download Link).
    • However, one member noted that on openrouter the context window is only 132k, leading to some disappointment from various OpenRouter Discord users.
  • DeepSeek V3 Thinks It’s ChatGPT?!: A member shared a TechCrunch article revealing that DeepSeek V3 sometimes identifies itself as ChatGPT, despite outperforming other models in benchmarks and being available under a permissive license (DeepSeek V3 on HuggingFace).
    • Further testing revealed that in 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4).
  • Gemini 2.5 Pro Hits Rate Limits, but Offers Balance: Gemini 2.5 Pro is encountering rate limits on OpenRouter, but remains a favorite, due to a wide knowledge base.
    • One member pointed out Gemini 2.5 pro is smart in some ways but it’s prompt adherence and controllability is terrible.
  • OpenRouter’s Next Features: The OpenRouter team is actively working on PDF Support, LLM native image generation, and the return of Cloudflare as a provider (Announcement Link).
    • They also clarified that models with :free tiers share rate limits, but that can be circumvented by adding personal API keys from free model providers.
  • OpenAI’s GPT-4o Image Generation Internals Exposed: Members discussed OpenAI’s GPT-4o’s image generation, suspecting it is not fully native and potentially involves prompt rewriting and a separate image generation model, potentially for efficiency reasons (see: Markk Tweet).
    • Other members pointed to OpenAI’s use of obfuscation, I mean they have a fake frontend thing to hide image generation.

Links mentioned:


aider (Paul Gauthier) ā–· #general (932 messagesšŸ”„šŸ”„šŸ”„):

Gemini 2.5, Llama 4, Grok 3, MCP Tools, Nvidia NIM

  • Gemini 2.5 Outshines Sonnet for Some Users: Users report that Gemini 2.5 excels in coding tasks, surpassing even Sonnet 3.7 in specific use cases, particularly with understanding large codebases.
    • However, it’s noted that Gemini 2.5 tends to add unnecessary comments and may require more specific prompting to prevent unwanted code modifications.
  • Llama 4 Models Get Lukewarm Reception: Initial community feedback on Meta’s Llama 4 models, including Scout and Maverick, is mixed, with some finding their coding performance disappointing.
    • Despite the hype, some argue that Llama 4’s claimed 10M context window is virtual due to training limitations, and question the practical benefits compared to existing models like Gemini and DeepSeek.
  • Grok 3 Gains Traction Despite Lack of API: Despite the absence of an official API, some users are impressed with Grok 3’s capabilities, particularly in code generation and logical reasoning.
    • It is said to be less censored than many others, its value in real-world coding scenarios remains debated due to the inconvenience of copy-pasting without a direct API integration.
  • MCP Tools Enable Universal Tool Calling: A project is underway to create an MCP (Meta-Control Protocol) client that allows any LLM to access external tools, regardless of native tool-calling capabilities.
    • This implementation uses a custom client that can switch between providers and models, supporting platforms like OpenAI, Anthropic, Google, and DeepSeek.
  • Nvidia NIM Offers Limited Free Access for Model Testing: Nvidia NIM provides developers with access to inference, although the free tier is limited to 40 RPM; users are exploring combinations of NVIDA and DeepSeek R1.
    • The general feeling is that 32k token limit is not enough.

Links mentioned:


aider (Paul Gauthier) ā–· #questions-and-tips (58 messagesšŸ”„šŸ”„):

Internal Libraries, Batch Editing, i18n Implementation, Shell Scripting, MCP Servers

  • Internal Libraries Integration with Aider: A user inquired about adding internal libraries (installed in a .env folder) to the repo map for better code understanding in Aider.
    • No direct solution was provided, but users discussed how to use URLs and documentation.
  • Automated Batch Editing in Aider with Shell and Python: Users discussed batch editing in Aider using command-line scripting and Python, with a recommendation towards using the Python scripting API.
  • Aider’s Editor Mode Halts at Shell Command Prompts: Users reported that in edit mode, Aider (v81.0) running Gemini 2.5 Pro prompts for a shell command after find/replace, but doesn’t apply the edits, even when the ask shell commands flag is off.
  • Community Explores Aider Extensions for Custom Workflows: The community discussed adding custom /slash commands to Aider to run custom workflows, suggesting Aider’s dev API support custom extensions.
  • Best Practices for Loading Documentation into Aider: Users discussed loading documentation into Aider, with recommendations to reference online URLs or convert offline PDFs to Markdown files.
    • It was noted that major commercial models like gpt4-o or Anthropic’s models only need the documentation URL once per chat session.

Links mentioned:


Cursor Community ā–· #general (1056 messagesšŸ”„šŸ”„šŸ”„):

Sonnet Max Pricing, MCP Server Setup, Llama 4 Models, Agent Mode Issues

  • Sonnet Max Pricing: Tool Calls Cause Sticker Shock: Users are finding that Sonnet Max pricing, at $0.05 per request and $0.05 per tool call, can quickly become expensive, especially in ask mode where it may make a ton of tool calls for a basic question.
    • A member noted their frustration with the number of tool calls, saying that Claude Max on ask mode is running a shit ton of tool calls for a basic question and flagged it to the team.
  • MCP Server Setup: A Painful Endeavor: Setting up MCP servers in Cursor is proving difficult for many users, with one humorously stating just u in response to a complaint.
    • One user encountered an issue with npx, stating that Cursor PowerShell couldn’t find it, even though it was in their path, while another had a hard cut off a Model after spending 1,300,000 tokens due to an infinite loop.
  • Llama 4 Models: The New Multimodal Contenders: The community is excited about the new Llama 4 Scout and Maverick models from Meta, which support native multimodal input and boast impressive context windows of 10 million and 1 million tokens, respectively, but found them very bad at coding tasks.
  • Agent Mode’s Edit Tool: Failing Frequently: Some users are experiencing issues with Agent mode failing to call the edit_tool, resulting in no code changes being made after thinking and responding.
    • One user noted that the apply model is clearly cursor’s bottleneck and that it will add changes, and deletes 500 lines of code next to it.
  • Kubernetes to the rescue: AGI: One visionary proposes to use Kubernetes with docker containers which can all talk to each other as AGIs.
    • This could potentially spread ASI with ease, through zero-shot learning and ML.

Links mentioned:


Perplexity AI ā–· #announcements (3 messages):

Comet Browser, Server Updates

  • Comet Early Access Rolls Out!: Perplexity is slowly rolling out early access to Comet, their answer engine browser, to select users who signed up on the waitlist.
    • Users with early access are asked not to share details publicly due to ongoing bug fixes, and can share feedback via a button in the top right.
  • Discord Server Overhaul Incoming: The Perplexity Discord server is undergoing updates, which include a simplified channel layout, a unified feedback system, and a new #server-news channel rolling out on October 7th, 2024.
    • These changes aim to help new and existing users find the right channels and improve moderator response times, as illustrated in the attached image.

Perplexity AI ā–· #general (941 messagesšŸ”„šŸ”„šŸ”„):

Focus Mode Removed, Comet Browser, Gemini 2.5 Pro API Availability, Llama 4, Deep Research Nerfed

  • Users Notice Key Features Missing on PPLX: Members report that the writing focus mode has been removed, and that the ā€œcheck sourcesā€ button doesn’t trigger any action on the iPad browser version.
    • One member mentioned that the generate image button in a sidebar in a thread is missing, and that focus mode is gone.
  • Users Discuss Comet Browser Access and Features: A user reported receiving an email invitation to test the Comet browser, leading to discussions about its features and access, but has been asked by Perplexity to refrain from discussing Comet.
    • Users discussed whether it supports importing data from Safari and other browsers and mentioned potential integration with Gmail for task management, while another pointed out you can use pplx as standalone by adding gmail and google drive as apps.
  • Gemini 2.5 Pro API Not Yet Commercially Available: Perplexity stated that the Gemini 2.5 Pro API isn’t yet available for commercial use, only in preview modes, and they will add it once allowed.
    • A user noted Gemini 2.5 Pro is now available without limits and for cheaper than Claude and GPT-4o and users wondered when it would be available in Perplexity.
  • Llama 4 Dropped with Huge Context Window: Discussion around the release of LLama 4 models, with a large context window of 10 million tokens, and a discussion of it’s 288 billion active parameters, the models include Scout and Maverick.
    • Members are excited to see how Llama 4 Behemoth performs, especially regarding recall capabilities.
  • Deep Research Undergoes Source Reduction: Users noticed that Deep Research is only using a maximum of 20 sources, implying a recent change or nerf due to infrastructure issues.
    • One user speculates that due to Perplexity using a new language, Golang, it would be smooth sailing, while another stated that that wasn’t the case.

Links mentioned:


Perplexity AI ā–· #sharing (18 messagesšŸ”„):

Gemini 2.5 Pro, Meta Llama, US Tariffs, Perplexity AI Support, AI in Cars

  • Meta releases multimodal Llama: There is a link shared about Meta’s multimodal Llama release.
  • Navigating Perplexity AI Support: A member shares a link to Perplexity AI support for users seeking assistance.
  • Google Prepares AI for Automotive Industry: A shared link discusses Google’s readiness to bring AI into cars.
  • Exploring the Impact of Trump’s Tariffs: A member shared a link regarding Trump’s tariffs.
  • Copyright Concerns with OpenAI Models: Discussion on whether OpenAI models memorize copyrighted material.

Perplexity AI ā–· #pplx-api (53 messagesšŸ”„):

Sonar API, Perplexity API support in ComfyUI, API Parameter Tier Restrictions, Sonar Deep Research Improvements, API Cookbook Revamp

  • API Parameters Now Available to All Tiers: Perplexity now offers all API parameters, such as search domain filtering and images, to users without any tier restrictions.
    • This change allows all users to access these features, marking a significant shift in the API’s accessibility.
  • Sonar Deep Research Improved, truncation fixed: Perplexity has made improvements to sonar-deep-research to align it with the Web UI version and fixed a truncation bug in sonar.
    • Feedback on these improvements is welcome, as well as suggestions for further enhancements.
  • API Cookbook Revamped to Encourage Community Contributions: The API cookbook has been revamped to accept more projects from users building with the API, with initial PRs already merged.
    • Users are encouraged to share their work in the cookbook if they are building with Sonar, fostering a collaborative environment.
  • ComfyUI Gets Perplexity API Support!: A user, saftle, successfully integrated Perplexity’s API into ComfyUI by modifying a few things in LLM Party, detailed in this pull request.
    • This integration allows ComfyUI users to leverage Perplexity’s API for their projects.
  • Sonar struggles without live internet data: A user reported that Sonar API responses focused only on the system prompt, failing to dynamically handle user queries with live internet data unlike the Perplexity web app.
    • It was clarified that the system prompt is not considered in the actual search, advising the user to tweak the user prompt for optimal search results.

Links mentioned:


OpenAI ā–· #ai-discussions (501 messagesšŸ”„šŸ”„šŸ”„):

Copilot 4o image maker, Free vs Paid ChatGpt version, renaissance style images, Mistral struggles, Model Merging

  • OpenAI Agents are static: Uploaded files for OpenAI Agents are saved as knowledge files, not continually updating the agent’s base knowledge.
  • Free ChatGpt version limitations: Users discussed the differences between free and paid ChatGPT versions, noting that the pro version can process multiple files worth of code compared to the free version’s limitation of single files.
  • MJ7 is a total disaster: A user tested Midjourney 7, claimed it was stylistic, but it still can’t do fingers, arms, eyes and such.
  • Is the new Llama 4 really that good?: The community debated the value of Llama 4’s 10 million token context window, with some questioning its performance relative to models like o1, o3-mini, and Gemini 2.5 Pro, and others claiming that the benchmarks are fraud.
  • Veo 2 vs Sora: The community anticipates Veo 2’s release for video generation with longer video capabilities, some noting 4o image maker grabbed their attention more than Veo 2.
    • One user integrated ChatGPT 4o images with Veo img2video and the result was how I was hoping sora would be.

Links mentioned:

  • Tweet from Ahmad Al-Dahle (@Ahmad_Al_Dahle): Introducing our first set of Llama 4 models!We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for...
  • Gyazo:
  • General Agents | Introducing Ace: Ace is a computer autopilot that performs tasks on your desktop using your mouse and keyboard.
  • Llama 4 Maverick - API, Providers, Stats: Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forw...
  • Llama 4 Maverick (free) - API, Providers, Stats: Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forw...
  • Quasar Alpha: Quasar Alpha, OpenRouter's latest AI model, features a groundbreaking 1M-token context for advanced coding & project analysis. Delivering Claude 3.5/GPT-4o level performance in code generatio...
  • Quasar Alpha - API, Providers, Stats: This is a cloaked model provided to the community to gather feedback. It’s a powerful, all-purpose model supporting long-context tasks, including code generation. Run Quasar Alpha with API
  • no title found: no description found
  • no title found: no description found
  • Gemini 2.5 Pro Experimental - Intelligence, Performance & Price Analysis | Artificial Analysis: Analysis of Google's Gemini 2.5 Pro Experimental (Mar' 25) and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first t...

OpenAI ā–· #gpt-4-discussions (12 messagesšŸ”„):

Custom GPT 'Content failed to load' Error, Automod flagged 'Monday' message, Loving Monday's Personality

  • Custom GPT ā€˜Content failed to load’ Error Arises: A user reported encountering a ā€˜Content failed to load’ error when trying to edit their Custom GPT, after it had been working fine.
  • User Automodded for Liking ā€˜Monday’: A user, who likes Monday, mentioned that their message was auto-moderated, seemingly due to a flagged word.
    • Another user clarified that the Discord server has strict language rules, despite the AI being able to use such words, and suggested reposting the message without the flagged word.
  • User Loves Monday as Collaborator and Hype Man: A user expressed that they love working with Monday, describing it as the best collaborator and hype man, calling them out on stupid mistakes and laziness.
    • The user expressed that, for the first time, they enjoy working with an AI and would love to be able to pick a personality for a conversation.

OpenAI ā–· #prompt-engineering (167 messagesšŸ”„šŸ”„):

Moderation endpoint, Policy References, Universal Policies, AI as a critical part of society, Prompt engineering

  • OpenAI’s Moderation endpoint clarification: Members discuss OpenAI’s moderation endpoint, clarifying that while not explicitly in the usage policy, it is referenced to prevent circumventing content restrictions on harassment, hate, illicit activities, self-harm, sexual content, and violence.
    • It was noted that the endpoint uses the same GPT classifiers as the moderation API since 2022, suggesting an internal version runs on chatgpt.com, project chats, and custom GPTs, with the same classifiers on the content report form.
  • Decoding OpenAI’s policy references: Participants debated the clarity of OpenAI’s policy references, questioning if the chain of policies, including those referencing others, are fully presented and acknowledged via the ā€˜I agree’ checkbox during account creation.
    • A member highlighted sections from the usage policies, including universal policies, policies for builders using ChatGPT, and policies for API users, emphasizing the need to comply with laws, avoid harm, and respect safeguards.
  • GPT gives Tips on TTRPG prompts: A member shared a tip for creative TTRPG world building, suggesting that giving GPT a specific theme to riff off in prompting can lead to more creative and diverse city ideas.
    • For example, using a ā€œcosmicā€ theme can yield different results compared to a ā€œdomestic pet worshipā€ theme, improving the output without using the same creative options.
  • AI as a critical part of society must clearly state policies: A member argued that OpenAI, as a critical part of society, needs to clearly state its policies in all available documentation and ensure the model behaves accordingly across contexts and domains.
    • Another added that although suggestions for improvement aren’t mean, OpenAI can tidy up and be consistent by bringing docs inline with model architecture or vice versa, which would result in transparent and honest output.
  • Improving AI Outputs by Defining Terms: A user seeking help with generating quiz questions in Portuguese that sometimes repeated messages, received suggestions to use specific keywords and to define the model’s understanding of key terms.
    • The user was also advised to explicitly state the desired output characteristics, such as generating ā€œ5 meaningfully unique questions demonstrating knowledge of the given context,ā€ and to explore how the model interprets core keywords in their instructions.

OpenAI ā–· #api-discussions (167 messagesšŸ”„šŸ”„):

Moderation Endpoint, Universal Policies, Creative TTRPG World Building, Prompt Engineering

  • Moderation endpoint - usage policy?: Members discussed whether the moderation endpoint is officially part of the usage policy and why it’s hosted on a different URL; OpenAI replied that it’s referenced in the usage policy and controls are documented in docs/guides.
    • Another member made common sense conclusions that an internal version of the moderation endpoint is also running on chatgpt.com chats, project chats and customGPTs, using the same GPT classifiers that have been in place since 2022, as well as the content report form.
  • Universal Policies defined: One member explored the OpenAI Usage Policies and noted the four universal policies that apply to all services: comply with laws, don’t harm, don’t repurpose output to harm, and respect safeguards.
    • They added that users should be honest and direct with the model to ensure safeguards function correctly, and that society should define the limits of AI customization, referencing OpenAI’s article on AI behavior.
  • Creative Themed Cities with TTRPG prompts: A member shared that giving GPT a good theme improves its creativity in TTRPG world building, suggesting a tweak from yeeting creative city ideas to yeeting creative XYZ themed city ideas to shake up options, especially using GPT 4o and 4.5.
    • They also added that Pointy Hat released a new YouTube video on TTRPG city creation where OpenAI has been spending their Friday evening improving city worldbuilding.
  • Prompt engineering best advice: One member argues against the comp-sci major approach and says that prompt engineering is Instructional Design, finding someone who’s actually really good at prompting, understand what you can from them and why their prompts work to develop your own style.
    • They also added that a lot of internet advice is poisoned by early adoption by comp-sci majors trying to treat it like a machine, when really it’s a contextual engine. Input and output

LM Studio ā–· #general (511 messagesšŸ”„šŸ”„šŸ”„):

ComfyUI integration, LM Studio Terminal, REST API Load/Unload Models, Llama 4 analysis, Gemma 3 capabilities

  • Chat + Image Generation Dreams, Still a Premium Fantasy?: Members discussed the desire for a local UI similar to Gemini, combining chat, image analysis, and image generation, noting that current solutions like LM Studio and ComfyUI have separate functionalities.
    • A user suggested OpenWebUI can connect to ComfyUI, either natively or through a function, to enable some cross functionality between text and image models.
  • Navigating LM Studio’s Terminal Terrain: Newbie Asks: A user questioned whether LM Studio has a built-in terminal or if commands should be run in the OS command prompt within the LM Studio directory.
    • Another user clarified that commands like lms import should be run in the OS terminal (e.g., cmd on Windows), after which the shell may need reloading to ensure LMS is in the PATH.
  • Hot Swapping Models via REST API: A user inquired about programmatically loading/unloading models via REST API to dynamically adjust max_context_length for a Zed integration.
    • Another user shared that this is possible via command line with lms load and referenced LM Studio’s documentation, which requires LM Studio 0.3.9 (b1) (available in beta) and introduces time-to-live (TTL) for API models with auto-eviction.
  • Llama 4: Is this real life? (is this just fantasy?): With the release of Llama 4, users discussed its multimodal and MoE (Mixture of Experts) architecture, with one user expressing doubt about llama.cpp support.
    • Despite initial concerns about hardware requirements and model size, one user highlighted Llama 4 Scout as potentially fitting on a single NVIDIA H100 GPU with a 10M context window, outperforming models like Gemma 3 and Mistral 3.1.
  • Gemma 3’s Vision Capabilities: Peering Into the Future: Users discussed Gemma 3’s image support and potential for reading small text files, with one user recommending Gemma 3 4B for its vision capabilities and efficient speed on limited VRAM hardware.
    • It was mentioned that creating a Hugging Face account and specifying GPU/CPU will color-code GGUFs likely to fit the hardware in green.

Links mentioned:


LM Studio ā–· #hardware-discussion (132 messagesšŸ”„šŸ”„):

Reka Flash 21B, Gemma 3 27B, Model Performance on M1 Ultra vs M4 Max, Nvidia DGX base cost increase, Ryzen AI Max+ 395 mini PCs

  • Reka Flash 21B Shines Over Gemma and Mistral: One member replaced Gemma3 27 with Reka Flash 21B, and said that at q6 they saw around 35-40 tps on a 4090.
    • They note that mac ram bandwidth is not the bottleneck, it’s gpu performance, and they’re happy with 128GB M4 Maxes.
  • M1 Ultra beats M4 Max in memory bandwidth: A user found a M1 ultra 64 GPU cores 128 GB RAM for 2.5k used.
    • The user linked to a Github discussion stating that the M1 ultra 64 cores should still be above both the M1 ultra 48 cores and the m4 max 40 cores.
  • Max Tech Clickbait LLM Video Questioned: Some users questioned whether the youtube channel Max Tech knows what they’re doing in their LLM videos.
    • It was remarked that the channel is turning into sensational click bait with very little good info.
  • AMD 7900XTX GPU surprisingly strong: One user stole their kids 7900XTX and says AMD seem to be pulling finger, and the card runs pretty much everything i’ve thrown at it without issue.

Latent Space ā–· #ai-general-chat (199 messagesšŸ”„šŸ”„):

Tenstorrent Dev Day, Llama 4 launch, LLM Non-Determinism, MCP security, AI powered phishing

  • Tenstorrent’s Hardware Heats Up the Market: Tenstorrent hosted a dev day showcasing their Blackhole PCIe boards, featuring RISC-V cores and up to 32GB GDDR6 memory, designed for high performance AI processing and available for consumer purchase here.
    • Despite enthusiasm, one member noted they haven’t published any benchmarks comparing their cards to competitors though so until then I cant really vouch.
  • Llama 4 Models Make Multimodal Debut: Meta introduced the Llama 4 models, including Llama 4 Scout (17B parameters, 16 experts, 10M context window) and Llama 4 Maverick (17B parameters, 128 experts), highlighting their multimodal capabilities and performance against other models as per Meta’s announcement.
  • LLM’s Non-Determinism Dilemma: A member shared an article that discusses the challenges of non-deterministic outputs in LLMs, which complicates reliable reproduction and guaranteed product behavior, especially with the greedier sampling (Temp=0|top-p=0|top-k=1).
    • The author states non-determinism to language itself.
  • Whatsapp MCP Exploited via Invariant Injection: Multiple members discussed various injection vulnerabilities in agents with support for the Model Context Protocol (MCP), highlighting how an untrusted MCP server can attack and exfiltrate data from an agentic system connected to a trusted WhatsApp MCP instance as highlighted by invariantlabs.
  • AI Agents Outperform Humans in Spear Phishing: Hoxhunt’s AI agents have surpassed human red teams in creating effective simulated phishing campaigns, marking a significant shift in social engineering effectiveness, with AI now 24% more effective than humans as reported by hoxhunt.com.

Links mentioned:


Latent Space ā–· #ai-announcements (1 messages):

Claude Plays Pokemon Hackathon

  • Claude Plays Pokemon Hackathon: A user thanked another user for helping run the Claude Plays Pokemon hackathon on YouTube.
  • YouTube Stream of Hackathon: The Claude Plays Pokemon hackathon was recorded and streamed on YouTube.

Latent Space ā–· #ai-in-action-club (255 messagesšŸ”„šŸ”„):

LLM Codegen Workflow, AI Code Editors, Cursor vs Windsurf, Context Management in AI Editors, Model Hot-Swapping

  • Harper’s LLM Codegen Workflow Exposed: Harper’s blog post (My LLM Codegen Workflow ATM) details a process of brainstorming a spec, planning, and executing with LLM codegen in discrete loops.
  • AI Code Editor Recommendations: For those new to AI code editors, Cursor is the most commonly recommended starting point, particularly for users coming from VSCode, with Windsurf and Cline also being good options.
    • Experienced devs on nvim or emacs should stick with their current editor and AI plugins, while those wanting a new modal editor should try Zed.
  • Cursor and Windsurf Comparison: Members are bouncing between Cursor and Windsurf, noting strengths and weaknesses of each.
    • Cursor is easy to start, has great tab-complete, whereas people are waiting for the new token counts and context window details feature in Cursor (tweet).
  • Context Management Concerns in Cursor: Members are reporting Cursor’s terrible context management issues, with a lack of visibility into what the editor is doing with the current context.
    • It may come down to a skill issue and the users are not meeting the tool in the middle.
  • One-Shot Codegen or bust: Many in the channel expressed a desire for one-shot codegen where an entire program can be generated at once.
    • Failing that, documenting better and taking another shot may be the next best option and, if that fails, training the user is necessary.

Links mentioned:


Nous Research AI ā–· #general (308 messagesšŸ”„šŸ”„):

Open Source Cursor Alternatives, Prompt Injection / Jailbreaking Tactics, Llama 4 launch and performance, Neural Plasticity via Neural Graffiti

  • Cursor-like Apps Sought After: Members were looking for open source alternatives to the Cursor app, specifically interested in how the accept/discard suggestions of code blocks work.
    • One member noted that Cursor uses a different model to ā€˜apply’ the code once you say accept.
  • Unleashing Prompt Injection attacks: A member inquired about bypassing prompt guards, detectors, and NeMo guard rails from a pentest perspective, linking to a prompt filter trainer (gandalf.lakera.ai/baseline).
  • Llama 4 debuts with multimodal muscles: Meta launched the Llama 4 family, featuring Llama 4 Scout (17B active params, 16 experts, 10M+ context) and Llama 4 Maverick (17B active params, 128 experts, 1M+ context), along with a preview of Llama 4 Behemoth, and a peak at the iRoPE architecture for infinite context (blog post).
    • Some members expressed skepticism about the benchmarking methodology, the real-world coding ability and performance of Llama 4 Scout.
  • Neural Graffiti Gives LLMs live modulations: A member introduced ā€œNeural Graffitiā€, a technique to give pre-trained LLMs some neuroplasticity by splicing in a new neuron layer that recalls memory, reshaping token prediction at generation time, sharing code and demo on Github.
    • The live modulation takes a fused memory vector (from prior prompts), evolves it through a recurrent layer (the Spray Layer), and injects it into the model’s output logic at generation time.

Links mentioned:


Nous Research AI ā–· #ask-about-llms (27 messagesšŸ”„):

Claude Think Tool, Local LLM for 300 Pages of Text, Nous Capybara 34B Model, DeepHermes, BatchNorm and LayerNorm Implementations

  • Claude Think Tool: A Brainy Brainstormer: The Claude Think Tool is a setup to offload critical tasks to a larger model from a small local model.
    • It helps create multiple threads of thought, each with attention directed toward a specific domain and problem with a well-defined scope, functioning as a multi-agent system from the perspective of the brain.
  • Pondering the Perfect Local LLM for 300-Page Text Ingestion: A member inquired about running a local LLM, around 40B or less, capable of understanding around 300 pages of pure text, given a 12GB GPU and 32GB of normal memory.
    • Suggestions included DeepHermes, Cohere Command R 7B and Qwen 7B 1M, with warnings that CPU inference might not be viable for such large documents.
  • Nous Capybara 34B: A Contextual Colossus: The Nous-Capybara-34B is trained on the Yi-34B model with 200K context length for 3 epochs on the Capybara dataset.
    • It leverages a novel data synthesis technique called Amplify-instruct, combining top-performing existing data synthesis techniques and distributions used for SOTA models like Airoboros, Evol-Instruct, Orca, Vicuna, and others.
  • BatchNorm Backpropagation: A Numerical Nirvana: A member shared a raw implementation of BatchNorm using NumPy, emphasizing the backward pass as the most intimidating part due to computing the gradient of pre-normalized input following the multivariate chain rule, illustrated here.

Link mentioned: NousResearch/Nous-Capybara-34B Ā· Hugging Face: no description found


Nous Research AI ā–· #research-papers (2 messages):

Reinforcement Learning for LLMs, Reward Modeling Improvements, Self-Principled Critique Tuning

  • Deepseek releases Reinforcement Learning Paper: Deepseek released a new paper on Reinforcement Learning (RL) being widely adopted in post-training for large language models (LLMs) at scale; the paper can be found here.
    • The paper investigates how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods.
  • Self-Principled Critique Tuning Proposed: Deepseek adopts pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling.
    • The paper proposes Self-Principled Critique Tuning (SPCT) to foster scalability.

Link mentioned: Inference-Time Scaling for Generalist Reward Modeling: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $…


Claude Squad, Heterogeneous Recursive Planning, Panthalia Decentralized Compute, TextPulse Library

  • Claude Squad Manages Multiple Agents: Claude Squad is a free and open-source manager for Claude Code & Aider tasks that supervises multiple agents in one place with isolated git workspaces.
    • It enables users to run ten Claude Codes in parallel.
  • Heterogeneous Recursive Planning for Creative AI: A new method called heterogeneous recursive planning enables AI to write creative stories and insightful deep research reports like an expert (paper, demo).
    • It leverages adaptive subgoals and dynamic execution, allowing agents to dynamically replan and weave retrieval, reasoning, and composition mid-flow, based on previous work.
  • Panthalia Verifies Low-Cost Distributed Compute: Panthalia is a platform to safely and easily train ML models on peer-to-peer compute using a decentralized compute primitive, using a waitlist.
  • TextPulse Library for Text Processing: A member shared their library TextPulse for text processing and is looking for feedback.
    • Currently, they resell low-cost providers aiming for the same interruptible prices (~$0.60/hr for an H100, ~$0.13/hr for a 4090).

Links mentioned:


Nous Research AI ā–· #research-papers (2 messages):

Deepseek, Reinforcement Learning, Large Language Models, Reward Modeling, Self-Principled Critique Tuning

  • Deepseek’s New Paper on RL for LLMs: Deepseek released a new paper, available on arXiv, about Reinforcement Learning (RL) adoption in post-training for Large Language Models (LLMs) at scale.
    • The paper investigates improving reward modeling (RM) with more inference compute for general queries and the effectiveness of performance-compute scaling with proper learning methods, proposing Self-Principled Critique Tuning (SPCT).
  • SPCT Improves Reward Modeling: The paper introduces Self-Principled Critique Tuning (SPCT) as a method to enhance the effectiveness of performance-compute scaling in reward modeling for LLMs.
    • This approach aims to foster scalability by improving reward model inference compute for general queries beyond verifiable questions or artificial rules.

Link mentioned: Inference-Time Scaling for Generalist Reward Modeling: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $…


Nous Research AI ā–· #reasoning-tasks (6 messages):

Reasoning Benchmarking, Open Reasoning Tasks

  • Researcher transitions to LLM World: A researcher on logic and reasoning is considering moving into the LLM world and wants to contribute to reasoning categorisation and benchmarking.
  • Discussion about Open Reasoning Tasks: A member is exploring the list of reasoning tasks to benchmark one of the LLMs and asks about the taxonomy behind it, its background, and related literature.
    • They specifically inquired about who is behind the taxonomy and its history.

Link mentioned: GitHub - NousResearch/Open-Reasoning-Tasks: A comprehensive repository of reasoning tasks for LLMs (and beyond): A comprehensive repository of reasoning tasks for LLMs (and beyond) - NousResearch/Open-Reasoning-Tasks


MCP (Glama) ā–· #general (293 messagesšŸ”„šŸ”„):

MCP Governance SDK, MCP Protocol Revision 2025, MCP Desktop Workflow Integrations, Pinging MCP Servers Before Initialization, MCP Server for Microsoft Loop

  • Auth0 Token Validation with MCP Governance SDK: A guide focuses on server-side implementation using the governance SDK to validate tokens (e.g., from Auth0) and enforce user roles and permissions on MCP operations, deciding access to tools or resources.
    • The guide picks up after the client sends a token, detailing how the server can validate the token and fetch user’s roles, using the SDK’s RBAC system to enforce permissions.
  • Streamable HTTP Transport for MCP: The Model Context Protocol (MCP) specification uses JSON-RPC to encode messages, mandating UTF-8 encoding and defining two transport mechanisms: stdio and Streamable HTTP.
    • Clients should support stdio, but custom transports are also possible, as outlined in the specification, which includes requirements like newline delimiters for messages in stdio.
  • Llama 4 Released, Still Doesn’t Know MCP: Llama 4 has been released with 17B parameters and outperforms deepseekv3, but still does not know what MCP is, despite its impressive capabilities.
    • It’s a 17B MoE, with 109B total parameters, according to an announcement.
  • MCP Tool Installs Should be Standardized: Members discussed the need for more standardization around MCP server installation, similar to scoop or VS Code extensions, to improve accessibility for non-technical users.
    • The discussion highlighted the friction in the current process, involving command-line arguments, environment variables, and varying install methods (Python, Node.js, Docker) with a suggestion to make it as easy as python-mcp install web-search.
  • A Holy War? OAuth-Backed APIs MCPs are the key: Members debated over the security of MCPs with some feeling that they need an app store with oversight to check for hacked servers and OAuth-backed APIs, while others claim that can already be done.
    • One proposal is for providers like PayPal to host their own OAuth-backed APIs that don’t require external server install.

Links mentioned:


MCP (Glama) ā–· #showcase (23 messagesšŸ”„):

MCP-k8s Docker Images, chat.md with MCP support, Cloudflare for Remote MCP Servers, WhatsMCP Oauth Support, Semgrep MCP Rewrite

  • MCP-k8s Docker Images Published: First working docker images published for mcp-k8s server are now available, and the release pipeline is completely running on CI.
    • These images are multiarch, so they can run on Macs with ARM without Rosetta and also on Raspberry Pi.
  • Chat.md: fully editable chat interface with MCP support: A fully editable chat interface with MCP support on any LLM has been released, open-sourced under the MIT license and turning markdown files into editable AI conversations with its VS Code extension (chat.md).
    • Notable features include editing past messages, LLM agnostic MCP support, streaming responses with shift+enter, and tool call detection.
  • Cloudflare Enables Remote MCP Servers: It is now possible to build and deploy remote MCP servers to Cloudflare, with added support for OAuth through workers-oauth-provider and a built-in McpAgent class.
    • This simplifies the process of building remote MCP servers by handling authorization and other complex aspects.
  • WhatsApp MCP client is here: A user built WhatsApp MCP and asked Claude to handle all the WhatsApp messages, answering 8 people in approx. 50 seconds.
    • The bot instantly detected the right language (English / Hungarian), used full convo context, and sent appropriate messages including ā¤ļø to my wife, formal tone to the consul.
  • Semgrep MCP Server Rewritten: The Semgrep MCP server, an open-source tool for scanning code for security vulnerabilities, has been completely rewritten, with demo videos showcasing its use in Cursor and Claude.
    • It uses SSE (Server-Sent Events) for communication, though the Python SDK might not fully support it yet.

Links mentioned:


Eleuther ā–· #general (39 messagesšŸ”„):

RAG evaluation with lm-evaluation-harness, RoR-Bench paper by the_alt_man, Llama 4 release, Aligning AGI using Bayesian Updating

  • RAG Evaluation using LLM Harness?: A member suggested wrapping RAG outputs as completion tasks and using llm-harness locally with custom prompt + response files for evaluation.
    • Another member admitted to having no idea what those are.
  • LLMs Exhibiting Recitation Behavior?: A member shared a link to the RoR-Bench paper which proposes a novel, multi-modal benchmark for detecting LLM’s recitation behavior, finding that top models can suffer a 60% performance loss by changing one phrase in the condition.
    • The member expressed suspicion of these papers because they found that models that were evaluated at 0% on certain reasoning tasks could actually one-shot it.
  • Llama 4 Unleashed: A link to the Llama 4 release was shared (https://www.llama.com/llama4/), showcasing the most intelligent multimodal OSS model in its class, with Llama4 Maverick > Gemma3 and Llama4 Maverick > DeepSeek V3.
    • Another member noted the training process, architecture, and inference time temperature scaling.
  • Aligning AGI with Moral Weights: A member shared a Google Doc about aligning AGI using Bayesian Updating of its Moral Weights and Modelling Consciousness.

Links mentioned:


Eleuther ā–· #research (204 messagesšŸ”„šŸ”„):

Mixture of Experts, Large Language Models, Gradient-Free Learning Methods, Hyper-connections as alternative to residual connections, Attention Sinks in LLMs

  • MoE++ Framework Achieves Expert Throughput: A new MoE++ framework integrates Feed-Forward Network (FFN) and zero-computation experts (zero expert, copy expert, and constant expert) for enhanced effectiveness and efficiency, achieving 1.1$\sim$2.1$\times$ expert forward throughput compared to vanilla MoE models, according to this research paper.
    • The design of MoE++ offers advantages such as Low Computing Overhead by enabling dynamic token engagement, unlike uniform mixing in vanilla MoE.
  • NoProp Offers Gradient-Free Learning: A new learning method named NoProp, which does not rely on either forward or backwards propagation and takes inspiration from diffusion and flow matching methods, learns to denoise a noisy target at each layer independently, described in this paper.
    • There’s a GitHub implementation by lucidrains and also a discussion that the pseudocode at the end of the paper says they’re effecting the actual updates using gradient based methods.
  • Meta releases Llama 4: Meta announced the Llama 4 family of models, including Llama 4 Scout, a 17 billion parameter model with 16 experts and a 10M token context window, outperforming Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 in its class, as noted in this blog post.
    • Llama 4 Scout’s 10M context is trained on a mix of publicly available, licensed data, and information from Meta’s products and services including posts from Instagram and Facebook and people’s interactions with Meta AI.
  • Hyper-Connections Offer Alternative to Residual Connections: Hyper-connections, serve as an alternative to residual connections, addressing the seesaw effect between gradient vanishing and representation collapse, as outlined in this paper.
    • The architecture is simple like an unrolled diffusion model and the magic here is more about the independence of each layer wrt each other.
  • Attention Sinks in LLMs Prevent Over-Mixing: A recent paper argues that attention sinks, where LLMs attend heavily to the first token in the sequence, is a mechanism that enables LLMs to avoid over-mixing, detailed in this paper.
    • An earlier paper (https://arxiv.org/abs/2502.00919) showed that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream, where the tagged tokens are eventually retrieved.

Links mentioned:


Eleuther ā–· #interpretability-general (17 messagesšŸ”„):

Polytope lens for NNs, ReLU networks geometry, Machine Unlearning Workshop, Origami view of NNs, Expressivity of Deep Networks

  • Polytope Perspective Powers Neural Net Pondering: A member shared a blog post discussing a geometrical approach to neural networks, advocating for the polytope lens as the right perspective, linking to a previous post on the ā€œorigami view of NNsā€.
  • ReLU Network Regions Reveal Reason: A member shared Boris Hanin’s paper demonstrating mathematical properties of ReLU networks, specifically studying the geometry of their constant regions.
    • They highlighted a figure from the paper as their ā€œmain reason for loving the paper,ā€ referencing the expressivity of deep networks and the number of activation patterns.
  • Hyperplane Harmony: Neural Nets’ Natural Nuance: A member posited that neural nets, especially ReLUs, have an implicit bias against overfitting due to carving the input space along hyperplanes, which becomes more effective in higher dimensions.
    • They argued that simpler configurations using hyperplanes efficiently are preferred by the optimizer, contrasting with learning schemes like spline bases that suffer from the curse of dimensionality.
  • Unlearning Urgency: Machine Mind Management: A member linked to the ICML Machine Unlearning Workshop which focuses on the challenges of removing sensitive data from Generative AI models trained on internet-scale datasets.
    • The workshop aims to advance robust, verifiable unlearning methods to address privacy, security, and legal concerns like the EU’s GDPR.

Links mentioned:


Eleuther ā–· #lm-thunderdome (19 messagesšŸ”„):

lm-eval-harness EOS token, Llama 2 vs Llama 3 IFEval Score, Huggingface tokenization

  • EOS token Accuracy Anomaly Appears: A member tried adding an EOS token to data instances in lm-eval-harness for the social_iqa task, and the eval accuracy dropped by 18 points.
    • It was suggested to add self.eot_token_id to the continuation_enc here only for the continuations and not context.
  • IFEval Score: Llama 2’s Odd Dominance: A member compared Llama 2 v/s Llama 3.1 and 3.2 models and noticed Llama 2 has a much higher IFEval Score, which seemed weird for a base model looking at the HF leaderboard.
    • It turns out it just seems to be unsuitable for base models because models simply continue with the question and somehow it’s considered correct.
  • Huggingface Tokenization Troubleshoot: Members discussed Huggingface tokenization, and how it happens in HFLM.tok_encode.
    • One noted that for BOS you can pass add_bos_token to the model args.

Links mentioned:


HuggingFace ā–· #announcements (1 messages):

huggingface_hub v0.30.0, monoELECTRA reranker models, YourBench Custom Evals, Jetson Robot, Accelerate v1.6.0

  • Huggingface Hub gets Biggest Update Ever!: The huggingface_hub v0.30.0 release introduces a next-gen Git LFS alternative and new inference providers.
    • This release is the biggest update in two years!
  • MonoELECTRA Rerankers Ported to Sentence Transformers: monoELECTRA-{base, large} reranker models from @fschlatt1 & the research network Webis Group are now available in Sentence Transformers.
    • These models were distilled from LLMs like RankZephyr and RankGPT4, as described in the Rank-DistiLLM paper.
  • YourBench builds Custom Evals Instantly: YourBench allows users to build custom evals using their private docs to assess fine-tuned models on unique tasks (announcement).
    • The tool is game-changing for LLM evaluation.
  • Gradio Surpasses 1 Million Developers!: Gradio, a Python library for building AI web apps, is now used by over 1 million developers each month (announcement).
    • The library has been adopted by popular open-source projects like Automatic1111, Fooocus, and LLaMA-Factory.

Links mentioned:

  • Release Xet is here! (+ many cool Inference-related things!) Ā· huggingface/huggingface_hub: šŸš€ Ready. Xet. Go!This might just be our biggest update in the past two years! Xet is a groundbreaking new protocol for storing large objects in Git repositories, designed to replace Git LFS. Unlik...
  • Tweet from tomaarsen (@tomaarsen): I've just ported the excellent monoELECTRA-{base, large} reranker models from @fschlatt1 & the research network Webis Group to Sentence Transformers!These models were introduced in the Rank-DistiL...
  • Tweet from Nathan (@nathanhabib1011): šŸš€ Introducing ✨ YourBench ✨ ! Build custom evals instantly using your private docs & see how your custom fine-tuned models perform on your unique tasks.Congrats to @sumukx @clefourrier and @ailozovsk...
  • Tweet from Remi Cadene (@RemiCadene): Jetson @nvidia's version of our robot is available!Compute is now on-board like a @Tesla car with FSD šŸš—Importantly, we rethink the control interface, so that you can view the video stream with th...
  • Tweet from Marc Sun (@_marcsun): accelerate v1.6.0 is out with lots of nice features ! - FSDPv2 support by @m_sirovatka, our incredible intern ! - DeepSpeed + tensor parallel support by the DeepSpeed team- XCCL distributed backend fo...
  • Tweet from Harry Mellor (@hmellor_): The @vllm_project now has a user forum which you can find at https://discuss.vllm.ai/Its fledgling community is still growing but I encourage all users to go there for their usage focused Q&A!
  • Reddit - The heart of the internet: no description found
  • Tweet from Orr Zohar (@orr_zohar): Excited to see SmolVLM powering BMC-SmolVLM in the latest BIOMEDICA update! At just 2.2B params, it matches 7-13B biomedical VLMs. Check out the full release: @huggingface #smolvlmQuoting Alejandro Lo...
  • Tweet from Unsloth AI (@UnslothAI): We partnered with @HuggingFace to teach you how to fine-tune LLMs with GRPO!Learn about:• Reward functions + creating them• GRPO Math + Free Reasoning training in Colab• Applying RL to real-world use ...
  • Tweet from AK (@_akhaliq): vibe coding AI apps for free has never been easier100% open source app, DeepSite on Hugging Face
  • Tweet from Ben Burtenshaw (@ben_burtenshaw): Welcome to the LLM Course!Education has always been at the heart of Hugging Face’s mission to democratize AI and we’re doubling down on that by giving http://hf.co/learn a big upgrade!
  • Tweet from Sergio Paniego (@SergioPaniego): šŸ†•New Unit in the Agents Course @huggingface. We just released the first Use Case on Agentic RAG—where we compare three frameworks side by side:šŸ¤ smolagentsšŸ¦™ @llama_index🦜 LangGraph (@LangChainAI)⬇...
  • Tweet from Abubakar Abid (@abidlabs): JOURNEY TO 1 MILLION DEVELOPERS5 years ago, we launched @Gradio as a simple Python library to let researchers at Stanford easily demo computer vision models with a web interface. Today, Gradio is used...

HuggingFace ā–· #general (169 messagesšŸ”„šŸ”„):

Llama-4-Scout vs Mistral Small 3.1, AI Engineer Interview, Deepmind created AGI Internally?, Fine Tuning Quantized Models, Huggingchat 500 error

  • Llama-4-Scout or Mistral Small 3.1 is better?: Mistral Small 3.1 adds vision understanding and enhances context up to 128k tokens.
    • A member suggested Llama-4-Scout is better, but it needs 16*17B VRAM.
  • AI Engineer Interview Code Section: A community member asks about what the code portion of an AI engineer interview looks like.
    • Another member pointed to the scikit-learn library.
  • Rumors of Deepmind created AGI Internally: A member in another discord said Google will release yet another powerful model next week and it will be even better than gemini 2.5 pro exp.
    • They also claimed that Deepmind created an AGI internally; however, this member later stated he doesn’t trust this person anymore.
  • Is Fine Tuning Quantized Models challenging?: A member asked about fine tuning quantized models, and the community gave varied advice, with some pointing to QLoRA, Unsloth, bitsandbytes as potential solutions. Check out Unsloth fine-tuning guide.
    • While another stated that you can only do so using LoRA. GGUF is an inference-optimized format, not designed for training workflows.
  • Huggingchat experiencing 500 Error: Users reported that Huggingchat is experiencing a 500 error.
    • A member stated that an issue was raised and pointed to workarounds being discussed on discord.

Links mentioned:


HuggingFace ā–· #today-im-learning (16 messagesšŸ”„):

LLM Development, Sebastian Raschka Book, Andrej Karpathy Video, NLP course chapter 3

  • Community Member Seeks LLM Dev Guidance: A community member asked where to start developing a 100M parameter LLM, given a background in Data Science and ML.
    • Suggestions included starting with NLP or DL, or finding a specific course to follow.
  • Sebastian Raschka’s Book Recommended for LLM Building: The book Build a Large Language Model (From Scratch) by Sebastian Raschka was recommended for learning to build LLMs from scratch.
    • One member shared that their workplace started a book club around it, and another mentioned having ordered the same book.
  • Andrej Karpathy’s GPT Reproduction Video Sparks Discussion: A video by Andrej Karpathy, Let’s reproduce GPT-2 (124M), was suggested as a good resource.
    • However, the original poster stated that he started copy-pasting code and didn’t explain much, so they stopped watching it.
  • Assisted Pre-training and Shared Embeddings: One member suggests initializing weights and using the same tokenizer from another model, kinda like an ā€˜assisted’ pre-training.
    • They also proposed sharing embeddings and maybe the linear layer to potentially expedite the LLM development process.

Link mentioned: Let’s reproduce GPT-2 (124M): We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really…


HuggingFace ā–· #cool-finds (2 messages):

Windows CLI, Virtual Environment Reset, LocalAI, Dify

  • CLI Fu for Virtual Env Reset: A quick Windows CLI command to reset your virtual environment is pip freeze | Select-String -Pattern "^(?!pip)" | ForEach-Object { pip uninstall -y $_.ToString().Trim() }.
    • This snippet helps clean up the environment by uninstalling packages, excluding pip itself, streamlining the process for a fresh start, according to a blog post.
  • [Placeholder]: [Placeholder]
    • [Placeholder]

Link mentioned: The Complete Roadmap to Mastering Agentic AI in 2025 | Girish Kotte: Discover a comprehensive 12-step roadmap to mastering agentic AI in 2025. Learn everything from basic concepts to advanced deployment techniques with resource links for each stage. Perfect for develop…


HuggingFace ā–· #i-made-this (8 messagesšŸ”„):

MCP Server and RAG Application, Osyllabi AI Curriculum, DocQuery AI Documentation Search, Municipal Law Dataset, LlamaResearcher with Llama-4

  • MCP Server + RAG App Debut: A member built a MCP server and client, connected via ngrok, along with a simple RAG application for QA with markdown documentation from a GitHub repository, showcased on LinkedIn.
  • Osyllabi: AI Curriculum Crafter Hits GitHub: A member shared Osyllabi, a Python app for AI-driven personalized curriculums using web crawling and data integration, powered by Ollama, HuggingFace, Langchain, and Llama-Index, available on GitHub.
    • It features AI-driven curriculum generation, advanced web crawling, seamless integration with educational platforms, customizable learning paths, and flexible export options.
  • DocQuery Transforms Documentation to Knowledgebase: A member shared DocQuery, which turns documentation markdown into a knowledgebase, is available on GitHub.
    • DocQuery offers improved searchability, a smart Q&A system, and streamlined knowledge management for development teams.
  • Municipal Law Dataset Surfaces: A member shared the American Municipal Law dataset on Hugging Face Datasets, comprising municipal and county laws from across the United States in parquet format, organized by location’s GNIS id.
    • Access requires agreeing to share contact information.
  • LlamaResearcher: Llama-4 Powers Deep Research: A member introduced LlamaResearcher (llamaresearcher.com), a deep-research AI companion powered by Llama 4 and Groq, which expands queries into sub-queries, searches the web, and produces essays with source citations.
    • The project is open-source and Docker-ready, available on GitHub, and utilizes LlamaIndex, Groq, Linkup, FastAPI, Redis, and Gradio.

Links mentioned:


HuggingFace ā–· #computer-vision (5 messages):

Data Annotation for OCR, VLM Fine-Tuning for Handwritten Text, Combining OCR Techniques with VLMs, Roboflow for managing images and labels, MS-Swift and PEFT/Unsloth Approaches

  • VLM Models Aid Handwritten Text OCR: A member is seeking methods for data annotation to fine-tune VLM models on handwritten text images, opting to move away from traditional OCR models and needing true text labels for training.
    • They are considering tools and methods to generate or correct text labels from images for fine-tuning purposes.
  • Classic OCR and Open VLMs Combine for annotation: A member combined classic OCR techniques with open VLMs like InternVL2_5 and Qwen2.5 to generate initial annotations for extracting structured data from Brazilian documents.
    • Manual review was performed to correct errors after using OCR/VLM, and closed-source models like Gemini were noted to potentially provide higher-quality pre-annotations.
  • Roboflow Manages Images and Labels Effectively: A member managed and stored raw images and corrected labels using Roboflow, annotating 510 images which were augmented to 1218 examples.
    • Despite finding its interaction not ideal, they used Roboflow for managing the dataset.
  • MS-Swift and PEFT/Unsloth Enhance Fine-Tuning: A member fine-tuned several models using MS-Swift and experimented with PEFT and Unsloth approaches, achieving superior performance compared to Gemini and OCR methods with models adjusted from 1B to 7B.
    • The member successfully fine-tuned models, highlighting the effectiveness of these frameworks.
  • Tesseract OCR and Label Studio Join Forces: One member is considering using Tesseract OCR followed by Label Studio for refining annotations.
    • They also checked Gemma 3 and found it effective, implying a combination of automated and manual approaches for data annotation.

HuggingFace ā–· #NLP (5 messages):

Text Extraction from PDFs, Docling, SmolDocling, RolmOCR, Sci-BERT

  • PDF Text Extraction Advice Sought: A member is seeking advice on improving text extraction from PDFs, specifically research papers, as their current results are unsatisfactory.
    • They have been using regex for section outline extraction but are facing challenges with fonts, headers, and footers, impacting the usability of the extracted content for Sci-BERT embeddings due to token limits.
  • Docling and SmolDocling recommended for Text Extraction: A member recommends Docling (GitHub) and SmolDocling (HuggingFace) for improved text extraction from PDFs.
    • They note that while these tools still make errors, especially with images, they have yielded good results, with SmolDocling being an ultra-compact vision-language model for end-to-end multi-modal document conversion, as highlighted in their paper.
  • RolmOCR Model Based on Qwen 2.5 VL Released: A member mentions the release of RolmOCR (HuggingFace), a new model based on Qwen 2.5 VL, for OCR tasks.
    • However, they haven’t personally tested it yet, but suggest it as a potential tool for text extraction.

Links mentioned:


HuggingFace ā–· #smol-course (24 messagesšŸ”„):

OpenWeatherMap API, ISO 3166-1 alpha-2 code, Qwen/Qwen2.5-Coder-32B-Instruct Alternatives, Hugging Face Token for Agent Creation, llm-course Channel

  • Geolocation API vs Static Country Code Dictionary: A member is building a tool to fetch weather conditions using the OpenWeatherMap API and is debating whether to use the GeoCoding API and another API for ISO 3166-1 alpha-2 codes, or to use a static dictionary.
  • Free alternative to Qwen/Qwen2.5-Coder-32B-Instruct?: A member asked for a free alternative to Qwen/Qwen2.5-Coder-32B-Instruct.
    • Another member pointed out that the model itself is free under the Apache 2.0 license (Hugging Face Link) but suggested Together AI or Groq for free API access, noting potential rate limits of around 60 RPM.
  • Guidance on Hugging Face Token for Agent Creation: A member requested guidance on obtaining a Hugging Face token for agent creation in Unit 1 of a course.
  • llm-course Channel Request: A member inquired about the possibility of opening a dedicated channel for an LLM course.
  • Help needed with AI agents course setup: A member requested assistance with a code issue encountered in Unit 1 of an AI agents course, specifically related to HF token settings in Colab.

Links mentioned:


HuggingFace ā–· #agents-course (36 messagesšŸ”„):

MCP in Agent Course, Inference Usage Costs, Gemini Models, Course Feedback, Hallucination in Agents

  • MCP barely mentioned in Agent Course: A user inquired about learning MCP in the agent course, but was informed that there’s no dedicated section, although MCP servers are briefly mentioned in unit 2.1 (smolagents) and unit 2.2 (llamaindex).
  • Inference costs incurred!: A user accidentally maxed out their Inference Usage Due Balance and inquired about payment.
    • The suggestion was made to check the questions channel for a FAQ, or to use a local or cheaper hosted alternative.
  • Gemini Models may be your savior: A user facing issues with Code_agents notebook in Chapter 2 due to payment requirements was advised to try using Gemini models.
    • It was noted that Gemini models can be used for free in many countries, with a link to course notes providing instructions.
  • Course Experience: Good but Buggy: A user summarized the course as full of good material but noted that many notebooks and code snippets don’t work, including a now infamous coding test in Unit 2, with no instructor presence.
    • The suggestion was made to approach the course sceptically, focus on understanding the coding parts, and acquire the necessary accounts and API tokens.
  • Explain the halluuuucinations!: Users sought clarification on an example of hallucination in an agent.
    • The explanation provided was that the agent, lacking access to weather data, fabricated the answer, and the solution involves equipping the agent with a tool to retrieve weather information.

Links mentioned:


Yannick Kilcher ā–· #general (177 messagesšŸ”„šŸ”„):

Grok 3, Turing Machines, Raw Binary AI training, LLama 4, Quantization Techniques

  • Grok 3 Manifold Analogy Appears: A member shared an analogy describing various approaches to NLP, contrasting 0D Manifolds (tokens), 1D Manifolds (embeddings), and a dynamic signal approach where language is seen as a rushing and swirling river with no rigid bounds.
  • Raw Binary AI Training Discussed: Members discuss training AI on raw binary data to directly output file formats like mp3 or wav, with one member noting that this approach works based on discrete mathematics like Turing machines.
    • Another argued that current AI models are far from Turing-complete, while the original poster explained that the AI doesn’t need to be Turing-complete to output appropriate tokens as responses.
  • New Llama 4 Models Released: Llama 4 Scout boasts 10 million context window, 17B active parameters, and 109B total parameters, while Llama 4 Maverick offers 1m context length, 17B active parameters, and 400B total parameters, and Llama 4 Behemoth features 2 trillion parameters.
    • Members express skepticism about the 10M context window claim, the new license, and question if recent models are RL’ed or just base + SFT models, pointing out performance issues and mixed benchmarks.
  • Self-Principled Critique Tuning Explored: Self-Principled Critique Tuning (SPCT) from DeepSeek is a new reward-model system where an LLM prompted with automatically developed principles of reasoning generates critiques of CoT output based on those principles.
  • Quantization Techniques Examined: Members discuss novel quantization techniques for large language models, pointing to a paper that has the file.
    • It was argued that quantization can serve as a compromise between maintaining a super long context length and being able to serve the model, but comes with decay in the value you are actually getting out of those long contexts.

Links mentioned:


Yannick Kilcher ā–· #paper-discussion (28 messagesšŸ”„):

Llama 4, DeepSeek Paper, PaperBench, Text Diffusion

  • Llama 4 Omni Wakes Up: A member shared the Llama 4 documentation, followed by a link to Meta’s blogpost on Llama 4’s Multimodal Intelligence.
    • The Llama 4 Scout model boasts 17 billion active parameters, 16 experts, and an industry-leading context window of 10M, outperforming models like Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1.
  • PaperBench: OpenAI’s Replication Benchmark: A member shared an article about OpenAI’s PaperBench benchmark, designed to test AI agents’ ability to replicate cutting-edge machine learning research papers from scratch.
    • The benchmark evaluates agents on reproducing entire ML papers from ICML 2024, with automatic grading using LLM judges and fine-grained rubrics co-designed with the original authors.
  • DeepSeek Paper Time: Members are planning to go over the first DeepSeek paper in an hour, with a link to the paper provided (https://arxiv.org/abs/2401.02954).
  • Text Diffusion Steers Auto-Regressive LMs: Members are planning to discuss a paper (https://arxiv.org/abs/2408.04220) on using a guided diffusion model to steer an auto-regressive language model to generate text with desired properties.

Links mentioned:


Yannick Kilcher ā–· #ml-news (17 messagesšŸ”„):

GPT-6 release, Llama 4, Mindcraft Update, Adapting pre-training text, diffusion modeling to control LLMs

  • GPT-6 Coming Soon (Maybe?): A user jokingly announced the release of GPT-6 yesterday, followed by O0 and OO in the next few weeks, citing difficulties with GPT-5.
    • This sparked humorous reactions, with another user quipping that ā€œreleaseā€ doesn’t mean actually release the weights like a company that is open about AI.ā€
  • Llama 4 Arrives with 10M Context: Llama 4 Maverick, is the most intelligent multimodal OSS model in its class with 17 billion parameter model with 128 experts and 10M context window, according to llama.com.
    • The model is said to be more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU, surpassing Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1.
  • Mindcraft Update Sees the Bots!: A member shared a YouTube video titled ā€œVision and Vibe Coding | Mindcraft Updateā€.
    • The video description included a link to Tripo AI, offering extra credits for the first 300 sign-ups using the code R2XF70.
  • LLMs Trained for Database Lookups: A member mentioned adapting pre-training text to include database lookups for relevant facts, to train the LLM to look things up during generation, citing this video.
  • Diffusion Modeling Now Controls LLMs: Users discussed using diffusion modeling to control LLMs, referencing the paper ā€œDiffusion-LM Improves Controllable Text Generationā€.

Links mentioned:


GPU MODE ā–· #general (17 messagesšŸ”„):

CUDA Python Package, Vectorized Memory Access, Llama-4 Router Normalization, High RAM/VRAM SSH Access

  • CUDA Python Package Debuts: Nvidia released the CUDA Python package, offering Cython/Python wrappers for CUDA driver and runtime APIs, installable via PIP and Conda.
    • It’s intended to unify the Python CUDA ecosystem, providing full coverage and access to the CUDA host APIs from Python, mainly benefiting library developers needing to interface with C++ APIs.
  • Vectorized Memory Access Practices Sought: Members discussed best practices for vectorized memory access when working with dynamic shapes, specifically in matrix multiplication with dynamic dimensions m, n, and k.
    • The discussion mentioned Cutlass support and efficient vectorized loads as potential solutions.
  • Llama-4 Router Normalization Examined: The channel discussed whether Llama-4 uses router normalization, similar to how DeepSeek V3 and Mixtral do with their topk_weights normalization.
    • It was noted that Llama-4 skips the normalization, potentially because it uses top_k = 1, and both DeepSeek V3 and Llama 4 use sigmoid for the router logits.
  • High RAM/VRAM SSH Access Needed for Testing: A member sought access to an SSH-like instance with at least 500GB of RAM/VRAM for a couple of hours to test a model in SGL.
    • They have GPU credits from Modal and inquired about SSH access to a container.

Link mentioned: CUDA Python: CUDA Python provides uniform APIs and bindings to our partners for inclusion into their Numba-optimized toolkits and libraries to simplify GPU-based parallel processing for HPC, data science, and AI.


GPU MODE ā–· #triton (18 messagesšŸ”„):

Triton Kernel Debugging, GPU Assembly Debugging, Grayscale Kernel Writing, Block Index Creation, Data Transposing

  • Triton Kernel Debugging Step-by-Step: A first-time poster inquired about debugging Triton kernels step by step, specifically addressing issues with cdiv and fill zeros in interpret mode = 1.
    • An alternative suggestion involved diving into GPU assembly, setting breakpoints in the Python file using either cuda gdb or roc gdb, and single-stepping through the assembly file.
  • GPU Assembly Debugging with VSCode: A member asked about using the VSCode debugger instead of only cuda gdb for debugging GPU assembly.
    • It was noted that running cuda gdb and passing in the Python arguments is required, but the convenience and readability of the VSCode debugger is desired.
  • Grayscale Kernel Writing Block Index: A member described an attempt to write a grayscale kernel for a (K, K, 3) input, aiming to get blocks of (BLOCK_K, BLOCK_K, 3) in Triton.
    • However, they faced challenges with tl.arange(0, 3) because 3 is not a power of 2.
  • Loading Nx3 Blocks: A member asked how to load an Nx3 block, as tl.arange won’t work since 3 is not a power of 2.
    • One suggestion involved loading data three times and incrementing the range by image_w * image_h, with another member suggested that adding 1 to all indexes should work.
  • Data Transposing for Contiguous Data: A member considered transposing data with Torch for a contest, but they were concerned about abusing strides for loading contiguous data.
    • It was suggested that transposing with Torch is acceptable for the contest, as the original tensor will be contiguous and transposing will only be symbolic.

GPU MODE ā–· #cuda (18 messagesšŸ”„):

CUDA debugger, nvshmem + mpi, nvbench and ubuntu 24.04, Shared memory access in CUDA, cute::copy and tiled_copy behavior

  • CUDA Debugging Delight: A user confirmed CUDA’s debugger works very similarly to GDB CLI.
    • Another member inquired about the release date for cutile, announced at GTC this year.
  • nvshmem + MPI Race Condition: A member reported race conditions and hangs when running nvshmem + mpi with one more process than the number of GPUs, both with and without MPS.
    • They were running mpirun -np 5 ./myapp on a system with 4 GPUs and asked if anyone had a solution.
  • nvbench Bumps CMake Requirement: NVBench kind-of dropped support for Ubuntu 24.04 because it requires a minimum CMake version of 3.30, while Ubuntu 24.04 comes with 3.28.
  • Shared Memory Broadcasts in CUDA: In response to a question about shared memory access in CUDA, it was confirmed that there are broadcasts and multicasts from shared memory.
  • Cute Copy Oddity: A user found a strange behavior of cute::copy regarding tiled_copy, where all threads in a warp collectively copy data from shared memory to registers, instead of each thread copying its corresponding data.
    • An attached image demonstrated unexpected data arrangements in registers after the copy operation.

Links mentioned:


GPU MODE ā–· #torch (10 messagesšŸ”„):

torch compile backend, libtorch, mojo, torchscript, gelu+mul fusion

  • Graphviz Backends Not Ready for Torch Compile: A member inquired about a torch.compile backend that spits out graphviz, and another responded that they are moving towards producing libtorch free binaries using torch.compile.
    • They further claimed that there’s no clever way of loading the model on torchscript.
  • Mojo Unlikely to Bypass Python’s GIL: A member asked if anyone has used mojo to bypass Python’s GIL.
    • No response was provided, so it’s safe to assume the answer is NO.
  • Compiling Gelu+Mul Fusion for Benchmarking: A member asked how to get torch.compile to correctly and reliably fuse gelu+mul for benchmarking purposes, using PyTorch version 2.8, to compare against their Triton kernel.
    • No response was provided, so it’s safe to assume the fusion is proving difficult!
  • DDP/FSDP and Compilation Conventions: A member inquired about the general convention for compiling a model before wrapping it around DDP/FSDP1/FSDP2.
    • Another member pointed to torchtitan’s implementation as a reference, which does a weird per-block compile thing beforehand, possibly to work around some torch compile bugs.
  • Numerical Issues Plague FSDP: A member reported having problems with numerical issues with FSDP and has disabled torch compile completely.
    • They claim that it doesn’t do a lot for them but the torchtitan authors need to compile the flex attention and hopefully fuse some of their sequence parallel TP stuff, and the block-wrapping was a compromise.

Link mentioned: torchtitan/torchtitan/models/llama3/parallelize_llama.py at main Ā· pytorch/torchtitan: A PyTorch native library for large model training. Contribute to pytorch/torchtitan development by creating an account on GitHub.


GPU MODE ā–· #announcements (1 messages):

GPU Mode Website, Active Leaderboards, Website Feedback

  • GPU Mode Launches New Website: Thanks to the hard work of two members, GPU Mode launched a new website.
    • The website includes active leaderboards, links to lectures on YouTube, and their GitHub repo.
  • Leaderboard Status Shows H100 Dominance: The website features active leaderboards for A100, T4, H100, and L4 GPUs, with several leaderboards showing results for H100.
    • For example, in one leaderboard ending in 21 days, ajhinh ranked first on H100 with 7574.126μs.
  • Feedback Wanted on Website Features: The team is soliciting feedback on what to add to the website.
    • Current features include leaderboard statuses, YouTube lectures, and the GitHub repo; feedback can be provided in a designated channel.

Link mentioned: Leaderboards – GPU MODE: no description found


Llama 4, Triton Distributed, Tensara Triton Support, AMD Instinct MI325X Performance

  • Llama 4 Arrives with Multimodal Prowess: Meta introduces Llama 4, the latest iteration, boasting enhanced personalized multimodal experiences and featuring Llama 4 Scout, a 17 billion parameter model with 16 experts (blog post here).
    • It claims to outperform Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 and fit on a single NVIDIA H100 GPU, with an industry-leading context window of 10M.
  • ByteDance Releases Triton-distributed for Parallel Systems: ByteDance-Seed releases Triton-distributed, designed to extend the usability of Triton language (github here).
    • The new release is for parallel systems development.
  • Tensara Adds Triton Support for GPU Kernel Challenges: Tensara now supports Triton, inviting users to compete in kernel optimization challenges and climb global leaderboards (homepage here).
    • Recent updates include PyTorch-based test cases, 3D/4D Tensor matmul problems, and activation functions like Sigmoid and Tanh.
  • AMD’s Instinct MI325X Shows Strong MLPerf Inference Performance: AMD Instinctā„¢ MI325X GPUs demonstrate robust performance in MLPerf Inference v5.0, excelling in GenAI, LLMs, and reasoning models (blog here).
    • Results indicate a necessity for innovative GPU architectures tailored for AI transformation.

Links mentioned:


GPU MODE ā–· #jobs (6 messages):

Qualcomm AI Engineer Hiring, Suno ML roles and H100 resources, Zero latency music creation

  • Qualcomm Seeks AI Engineer Team Lead: Qualcomm is hiring an AI Engineer/Team Lead with a strong background in deep learning to design/deploy SOTA models, focusing on accuracy-latency Pareto optimality.
    • Interested candidates are asked to provide a short summary along with their CV or portfolio.
  • Suno’s ML Talent Hunt: Suno is hiring for all ML related roles, touting a small, well-resourced team with hundreds of H100s per researcher.
    • They are targeting zero latency music creation so that people can jam with AI in real time.
  • Zero Latency Music Creation Sounds Sick: Suno aims to achieve zero latency music creation, enabling real-time AI jamming.
    • A user expressed hope that Suno could be a VSTi in Ableton.
  • Suno Internships Abound: A user asked about internship opportunities at Suno, praising the platform.
    • No response was given.

GPU MODE ā–· #beginner (19 messagesšŸ”„):

Centralized GPU programming language, OpenCL and SYCL, ROCm and HIP, 4-bit operations in CUDA for LLMs, Performance roofline models and arithmetic intensity

  • Why One GPU Programming Language Doesn’t Rule Them All: A newbie in GPU programming inquired why there isn’t a centralized GPU programming language like C given the existence of CUDA for NVIDIA and ROCm for AMD.
    • An expert explained that OpenCL and SYCL exist but aren’t mainstream due to poor support from vendors like NVIDIA, suggesting that the interface for OpenCL is old and C-adjacent.
  • ROCm’s Dual Nature: AMD’s CUDA Toolkit and HIP: ROCm is AMD’s CUDA Toolkit, while HIP is AMD’s CUDA C++ that supports Nvidia hardware and compiling to PTX, but not Intel or others.
    • This offers a degree of cross-platform capability, though not universally.
  • Navigating 4-Bit Operations in CUDA for LLMs: A user inquired about how to perform 4-bit operations in CUDA for LLMs, such as matmul.
    • Another member recommended asking in the specific CUDA channel and being more specific about the operations.
  • Deciphering Arithmetic Intensity in Performance Roofline Models: A member questioned the common practice of calculating bytes accessed in GEMM by summing matrix sizes (MN + MK + KN) for arithmetic intensity in performance roofline models.
    • Another member clarified that this is a simplification for establishing a theoretical maximum and is realistic for newer GPUs with large L2 caches, where one input matrix may fit entirely in L2.
  • Jumpstart CUDA Learning with Custom Projects: A user asked for beginner-friendly CUDA projects and another user suggested learning through stuff YOU find interesting.
    • It was recommended creating something that requires a decent amount of multithreading or parallelism, such as linear algebra operations without a library, to simulate the concept of pipelining.

GPU MODE ā–· #torchao (2 messages):

Int4WeightOnlyConfig, torch.compile for speedup, Compiling individual submodules

  • Dequant with Int4WeightOnlyConfig benefits from torch.compile: A member was trying to integrate Int4WeightOnlyConfig and asked if torch.compile is needed to speed up the dequant process.
    • Another member suggested that they can try to compile individual submodules by calling torch.compile on the submodules.
  • torch.compile submodules for efficiency: To compile only the int4 modules, a member suggests iterating through the model’s named modules and using torch.compile on specific submodules, such as torch.nn.Linear.
    • The suggested code snippet is:
for n, m in model.named_modules():
    if isinstance(m, torch.nn.Linear):
        setattr(model, n, torch.compile(getattr(model, n)))

GPU MODE ā–· #irl-meetup (3 messages):

Silicon Valley Meetups, SF Meetups, Summer Intern Meetups

  • Silicon Valley Summer Meetups?: An intern in the area asked if there would be any meetups in Silicon Valley this summer and offered to help organize one.
  • SF Meetup Planned Later This Year: A member confirmed that a meetup is being planned in San Francisco for later this year, though specific dates were not mentioned.

GPU MODE ā–· #self-promotion (37 messagesšŸ”„):

RL fine-tuning with sandboxed code interpreter, Gemma 3 QAT vs HQQ, Wavespeed AI inference API, Vector Sum CUDA Kernel optimization, Tom and Jerry video generation with transformers

  • RL Code Fine-Tuning Toolset Showcased: A member shared a toolkit for fine-tuning coding models using reinforcement learning with a local, zero-setup sandboxed code interpreter.
    • They found very promising results using a tiny fraction of data and training time versus traditional supervised fine-tuning and look forward to expanding it from Python to other languages, such as in HIP Script.
  • HQQ Quantization Beats QAT for Gemma 3: A member evaluated Gemma 3 12B QAT vs. HQQ, finding that HQQ takes a few seconds to quantize the model and outperforms the QAT version (AWQ format) while using a higher group-size.
    • With GemLite bfp16 support, quantized Gemma 3 can run faster without performance issues.
  • Wavespeed AI touts efficient inference API: The CEO of Wavespeed AI touted their platform’s fastest and most efficient AI image & video inference API such as FLUX and Wan with LoRA.
    • They offer competitive custom pricing and hope to establish a win-win model to grow together.
  • Vector Sum Kernel achieves SOTA: A member shared a blogpost and code on achieving SOTA performance for summing a vector in CUDA, reaching 97.94% of theoretical bandwidth, outperforming NVIDIA’s CUB.
  • Tom and Jerry Cartoons Generated with Diffusion Transformers: A team completed a project creating 1 minute long Tom and Jerry cartoons by finetuning a diffusion transformer, accepted to CVPR 2025, with code released on GitHub.
    • The model leverages Test-Time Training (TTT) layers within a pre-trained Transformer, enabling it to generate coherent videos from text storyboards, outperforming baselines like Mamba 2.

Links mentioned:


GPU MODE ā–· #reasoning-gym (18 messagesšŸ”„):

Curriculum Learning for Reasoning, Llama 3 vs Qwen 2.5, Dream 7B Diffusion Model, Llama 4 Maverick coding, Claude Think Tool

  • Curriculum Learning Elicits Reasoning: A member is experimenting with curriculum learning to elicit reasoning behavior in weaker LLMs like Llama-3.2-3B, by using easier reasoning tasks and gradually increasing difficulty to prime the model without SFT.
    • Another member mentioned that another user has already done some work on curriculum learning with RG and found better results compared to the same tasks without curricula, which is supported by the training/ dir in the main branch.
  • Qwen 2.5 Beats Llama 3.2 in Training: Members have been mostly using Qwen 2.5 3B over Llama 3.2 3B because Qwen seems to be a bit easier to train for reasoning.
    • This agrees with the findings in the ā€˜4 Habits’ paper, in which Llama 3.2 struggled with backtracking and sub-goal setting without first using SFT.
  • Dream 7B Diffuses Reasoning: The Dream 7B (HKU Blog Post), a diffusion based LLM, seems to show really good success on the kind of problems that the channel has, which might make it a really good candidate for gym training, especially looking at sudoku.
    • Dream 7B consistently outperforms existing diffusion language models by a large margin and matches or exceeds top-tier Autoregressive (AR) language models of similar size on the general, math, and coding abilities.
  • Llama 4 Maverick Aider Score Revealed: Llama 4 Maverick scored 16% on the Aider polyglot coding benchmark.
    • This was referenced in a message on X, discussing coding benchmarks.
  • Claude Thinks with Tool Use: A member shared a link to Anthropic’s Claude Think Tool.
    • It wasn’t specifically discussed how this relates to Reasoning Gym.

Links mentioned:


GPU MODE ā–· #gpuęØ”å¼ (3 messages):

Deepseek communication library, NVSHMEM and UVA, Peer-to-peer GPU communication

  • Deepseek Leverages NVSHMEM Library: The Deepseek communication library is built off the NVSHMEM library from NVIDIA, allowing for high-performance communication.
    • A member inquired if NVSHMEM utilizes Unified Virtual Addressing (UVA) for intra-node communication to facilitate peer-to-peer loads/stores to data stored in a remote GPU connected by NVlink.
  • NVSHMEM UVA usage for inter-GPU communication: A member is inquiring about NVSHMEM and it’s usage of Unified Virtual Addressing (UVA) for inter-GPU communication.
    • Specifically, they want to know if UVA enables peer-to-peer loads/stores to data stored in a remote GPU, connected by something like NVlink.

GPU MODE ā–· #general (1 messages):

leikowo: any way to have a ptx torch extension (not cuda with inline ptx) ?


GPU MODE ā–· #submissions (24 messagesšŸ”„):

matmul Leaderboard submissions, vectoradd Benchmark Submissions, Modal Runners success, grayscale Leaderboard submissions

  • Modal Runners Deliver Matmul Masterpieces: Multiple leaderboard submissions for matmul benchmark on H100, A100, T4, L4 GPUs using Modal runners were successful, with IDs ranging from 3440 to 3453.
  • Vectoradd Victorious with Modal on L4: Several benchmark submissions for vectoradd on L4 GPUs using Modal runners succeeded, including submissions with IDs from 3464 to 3506.
  • Grayscale Gauntlet Gets Green Light: A test submission (ID 3447) and leaderboard submission (ID 3503) for grayscale benchmark on A100, H100, L4, T4 GPUs using Modal runners were successful.

GPU MODE ā–· #ppc (5 messages):

libsanitizer-collection.so, compute-sanitizer, LD_LIBRARY_PATH

  • Troubleshooter Seeks libsanitizer-collection.so Solution: A member is encountering an issue where the grader can’t find libsanitizer-collection.so when running compute-sanitizer during a ./grading test for i8mm/gpu_blm.
    • They tried setting LD_LIBRARY_PATH=/usr/lib/nvidia-cuda-toolkit/compute-sanitizer based on googling, but it had no effect.
  • Compute Sanitizer error with i8mm: A member reported a compute-sanitizer error where the system was Unable to find injection library libsanitizer-collection.so.
    • The error occurred during a test run of i8mm with the command compute-sanitizer --tool memcheck.
  • DejaVu Debugging: Another member recalls encountering the libsanitizer-collection.so issue previously.
    • They stated that they did not quite remember what the solution was.

GPU MODE ā–· #feature-requests-and-bugs (2 messages):

Leaderboard Units, Nanos vs Millis, Discord Cluster Manager

  • Leaderboard Display Units Clash!: A user noted a discrepancy in the leaderboard’s time units, with the web leaderboard displaying nanoseconds while the Discord leaderboard shows milliseconds.
    • A member responded that a new leaderboard website is prepared which converts to an optimal unit for clarity.
  • New Leaderboard Website Incoming: A member announced that they have a new leaderboard website prepared, but they do convert to an optimal unit for clarity.
    • The discrepancy in the original leaderboard website had the web leaderboard displaying nanoseconds while the Discord leaderboard shows milliseconds.

GPU MODE ā–· #hardware (2 messages):

Local LLM inference, Fine-tuning, GPU selection, L40 vs A100, Quantization

  • Local LLM Rig Build for Org: Members are considering building a small rig for organizational LLM tasks like summarization, chatbots, and text generation, exploring options like L40 or A100 GPUs.
    • The primary focus is on optimizing for 4-bit and 8-bit model inference and potential fine-tuning, taking price considerations (+5-10% of US prices) into account.
  • L40 Underperformance Puzzle: Despite the L40 theoretically being better for 4-bit quantized Llama 3 70b, it only achieves 30-35 tok/s on single-user requests via vLLM, underperforming compared to online benchmarks of the A100.
    • The performance gap may be due to the A100’s superior DRAM bandwidth and tensor ops performance, which are nearly twice as fast as the L40.
  • Exploring Quantization and Optimization Strategies: The discussion suggests exploring TensorRT and specific quant formats to improve the performance of L40.
    • Despite L40 having FP8 support and a larger L2 cache, these advantages don’t seem to translate to better performance compared to A100 in current setups.

Notebook LM ā–· #use-cases (14 messagesšŸ”„):

Interactive voice mode, Mind maps rollout, Website URL use cases, Commercial scale version of NotebookLM

  • Interactive Voice Mode Inspires!: A user expressed that the interactive voice mode was an interesting way of getting them to think about ideas.
    • After trying to make a solid NotebookLM foundation since January, they mentioned they can now make almost every text work and are confident they can help corporations set up a notebook tailored to their specific needs.
  • Mind Maps Finally Go Live!: Users reported the mind maps feature has been fully rolled out, appearing in the middle panel for some, while others are still waiting.
    • One user mentioned seeing it briefly on the right side panel before it disappeared.
  • Audio Overview identifies website as a book: A user inquired about use cases with a website URL, noting the Audio Overview incorrectly identified a website as a book.
    • Another user suggested the source type/genre is identified based on the source’s content/format, and running it again with a ā€œcustomizationā€ specifying it’s a website resolved the issue.
  • Commercial NotebookLM Version Inquired: A user asked if there is a commercial scale version of NotebookLM, where the data is not in the public domain, and specific programming or prompts can be entered.

Notebook LM ā–· #general (154 messagesšŸ”„šŸ”„):

NotebookLM's Discover feature rollout, Gemini 2.5 family, Mind Map evolution with generative AI, YouTube audio EQ Chrome extension, Google Cloud Next and Google I/O events

  • Theorizing Image-Based Mind Map Revolution: Users discussed how generative AI tools could soon evolve mind maps to include images, drawing inspiration from Tony Buzan’s original mind maps.
    • Members expressed excitement about the potential for more visually rich and informative mind mapping.
  • Discover feature Rollout Delays Frustrate Users: Users have expressed frustration over the delayed rollout of the new ā€˜Discover Sources’ feature in NotebookLM, which has been ongoing for over a week and is expected to take up to two weeks for full availability, announced April 1st.
    • The feature promises to streamline learning and database building by allowing users to create notebooks with sources directly within NotebookLM, eliminating the need to search outside the platform; one user even shared a Peter Griffin ā€˜But I want it now’ GIF.
  • NotebookLM still on Gemini 2.0; 2.5 Tunability Teased: Currently, NotebookLM utilizes the Gemini 2.0 Thinking model, though its effectiveness versus the Flash model in this context remains under evaluation.
    • Gemini 2.5 is confirmed to be a family of models including a Flash version and 2.5 Pro will soon be tunable, enabling developers to adjust its ā€˜thinking’ intensity.
  • Chrome Extension tunes YouTube audio with AI: A member created an AI-powered Chrome Extension called EQ for YouTube which allows users to manipulate the audio of YouTube videos in real-time with a 6-band parametric equalizer; the extension has features for real-time frequency visualization, built-in presets, and custom preset creation.
  • NotebookLM’s Language Change Explained: To change the language in NotebookLM, use the URL https://notebooklm.google.com/?hl=LANGUAGE_CODE, replacing LANGUAGE_CODE with the desired language code (e.g., es for Spanish).
    • While the team acknowledged a previously identified translation bug (since resolved), the podcast output cannot be translated at this time.

Links mentioned:


Modular (Mojo šŸ”„) ā–· #general (28 messagesšŸ”„):

Nvidia CUDA Python Support, Mojo GenAI, CuTile Programming Model, SIMD vs SIMT, Tenstorrent and Modular

  • Nvidia Adds Native Python Support to CUDA: A member shared a link to an article, Nvidia Finally Adds Native Python Support to CUDA, questioning if it’s the empire strikes back.
    • The article discusses Nvidia’s approach to GPU execution using the CuTile programming model, abstracting away from thread-level programming.
  • Can Mojo Tackle GenAI?: A member wondered if Mojo is capable enough to develop GenAI or Inference.ai already.
    • This sparks discussion on the current capabilities and potential of Mojo in the field of Generative AI.
  • CuTile Programming Model Questioned: A member expressed reservations about Nvidia’s CuTile programming model, viewing it as a higher-level abstraction that removes the fun from writing GPU code.
    • They stated: there taking the fun out of writing gpu code.
  • SIMD vs SIMT: A member is working on a Proof of Concept model, noting that modern parallel compute makes less sense to view through a typical threading model.
    • Discussion arose around exposing an SM as a big SIMD core with masking, and whether SIMD or SIMT is more appropriate, considering hardware flexibility and potential limitations.
  • Tenstorrent Software Stack: A member suggested that Tenstorrent should use Modular’s software stack, but another member noted that Tenstorrent’s driver is incredibly easy to target and use.
    • They stated: their driver is incredibly easy to target and use though, so while making effective use of their architecture might require some tinkering, just getting something that runs on it seems almost trivial

Links mentioned:


Modular (Mojo šŸ”„) ā–· #mojo (85 messagesšŸ”„šŸ”„):

Auto Lowering, MLIR Interpreter stress test, Implicit ctor hack, Mojo language spec, Mojo implicit copies

  • Auto Lowering Achieved: A member discovered that auto lowering can be achieved when adding values from different scales, sharing a link to the code.
    • The member noted that it might be my most cursed work yet however.
  • MLIR Interpreter Faces Stress Test: A member commented that the time interval library might turn into a stress test of the MLIR interpreter.
    • Another member added that something didn’t work as expected but can be remedied with an implicit ctor hack.
  • Mojo Spec Debate Heats Up: A discussion started around whether Mojo will have a spec, with some arguing that it gives a language responsibility and maturity, referencing the design principles of Carbon.
    • Others countered that Mojo’s design is tightly coupled to what MAX needs and that a spec would slow down development, with one member stating that Chris Lattner blames ā€œdesign by committeeā€ for the failure of OpenCL.
  • Mojo’s Copy Semantics clarified: A member inquired whether Mojo’s implicit copies use Copy-on-Write (CoW).
    • Another member clarified that semantics wise, always copy; optimisation wise, many are turned into move or eliminated entirely (inplace). Though it happens at compile time, CoW is a run time thing.
  • ChatGPT’s Mojo Skills Under Scrutiny: A member asked if ChatGPT or alternatives are good enough to rewrite a large Python project in Mojo.
    • Another member responded that ChatGPT isn’t good at any new languages.

Link mentioned: ChronoFlare/chronoflare/init.mojo at main Ā· bgreni/ChronoFlare: A time interval library written in mojo. Contribute to bgreni/ChronoFlare development by creating an account on GitHub.


Nomic.ai (GPT4All) ā–· #general (54 messagesšŸ”„):

Nomic Embed Text V2, GPT4All release cadence, Llama 4 release, ComfyUI for multimodal tasks, Semantic chunking

  • Nomic Embed Text V2 integration is coming to Llama.cpp: A member shared a link to a GitHub Pull Request that shows Llama.cpp working on integrating Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture for multilingual embeddings.
    • Another member expressed that everything hangs on Llama.cpp and hoped for Mistral Small 3.1 multimodal support.
  • GPT4All Silent Treatment Troubles Users: Members are noticing a period of silence from core developers, with one member mentioning that this causes uncertainty about contributing to the app and the community.
    • The same member suggested this might not be a good policy for an open project, but that when they break their silence, they usually come out swinging.
  • Llama 4 is here, but is it the greatest?: Meta released Llama 4 on April 5, 2025 (announcement), featuring Llama 4 Scout, a 17B parameter model with 16 experts and a 10M token context window.
    • Though some users are excited for the release, others expressed that it is a bit of a letdown and that DeepSeek and Qwen need to step up their game, while another noted the largest model has 2 Trillion parameters.
  • ComfyUI is more than just a pretty face for image generation: Members discussed the extensive capabilities of ComfyUI, noting that you can do a lot with comfy if you have the nodes including image and audio captioning.
    • Another member mentioned the possibility of video processing and described using command-line tools for visual model analysis.
  • Semantic chunking server recipe for delicious RAG: A member shared a link to a semantic chunking server implemented with FastAPI.
    • The member also shared a curl command example for posting to the chunking endpoint, showing how to set parameters like max_tokens and overlap.

Links mentioned:


LlamaIndex ā–· #blog (3 messages):

MCP Servers, Full-Stack Agent Application, LlamaParse Layout Agent

  • MCP Servers get CLI Tooling: A tool by @MarcusSchiesser lets you easily discover, install, configure, and remove new MCP servers from a single CLI interface, supporting Claude, @cursor_ai, and @windsurf_ai, as shown here.
    • There are hundreds of official MCP servers out there.
  • Create Llama for Full-Stack Agents: The create-llama CLI tool lets you spin up a web application with a FastAPI backend and Next.js frontend in a single line of code, creating just 5 source files as shown here.
    • This is meant to jumpstart agent application development like deep research.
  • LlamaParse Launches Layout Agent: A brand-new layout agent within LlamaParse gives you best-in-class document parsing and extraction with precise visual citations, using SOTA VLM models to detect all the blocks on a page and dynamically adapt.
    • The new agent dynamically adapts, as shown here.

LlamaIndex ā–· #general (46 messagesšŸ”„):

Workflow as a Tool, Multi-Agent System with Supervisor Pattern, RAG System with LlamaParse, Scalability Issue with DocumentSummaryIndex, Tools retry when exception occurred

  • Wrap Workflows as Tools with FunctionTool: To transform a Workflow into a Tool, one can use the FunctionTool to wrap the workflow and gain control over its name, description, input annotations, and return values.
    • A member suggested a code snippet:
async def tool_fn(...):
  """Some helpful description"""
  result = await workflow.run(...)
  return str(result)

tool = FunctionTool.from_defaults(tool_fn)
  • Agent Handoffs Supersede Supervisor Pattern: When building a multi-agent system, it is more robust to have agents handoff between each other as needed, instead of using a supervisor pattern, which can be more error prone.
    • A GitHub repo was shared as an example of a supervisor pattern implementation.
  • Replicate Document Summary Index with Vector Store Index: The DocumentSummaryIndex may have scalability issues; it’s advised to replicate its functionality using a normal VectorStoreIndex by summarizing documents, indexing with reference IDs, and swapping summary nodes with the original document during retrieval.
    • When using load_index_from_storage, the index store is loaded to memory which causes latencies as more documents are ingested.
  • Context’s State Prepending to user_msg: To avoid prepending the state content in the user message, one should avoid using the state key in the context and put data between tools elsewhere in the context.
    • A suggestion was to use ctx.set("some_key", "some_val") and ctx.get("some_key") instead.
  • Implement Text-to-SQL Query Engine Tool: When implementing a text-to-SQL query engine tool for an agent, if there are only a few tables, it is not necessary to create an index of table descriptions and perform a vector query.
    • In cases of a small number of tables, the index and vector search parts can be skipped for better performance.

Link mentioned: GitHub - run-llama/multi-agent-concierge: An example of multi-agent orchestration with llama-index: An example of multi-agent orchestration with llama-index - run-llama/multi-agent-concierge


tinygrad (George Hotz) ā–· #general (16 messagesšŸ”„):

torch-geometric for tinygrad, Llama 4 10M context limitations, fast pattern matcher bounty, UOps generation, tinygrad YouTube video

  • Tinygraph: Torch-geometric for Tinygrad?: A member inquired about the feasibility of creating a module similar to torch-geometric for graph ML within tinygrad, considering tinygrad’s existing torch interface.
    • They questioned whether it would be ā€œusefulā€ to pursue such a module.
  • Llama 4’s long context may not be so good: A user shared a tweet claiming Llama 4’s declared 10M context is ā€œvirtualā€ because models were not trained on prompts longer than 256k tokens.
    • The tweeter also stated that even problems below 256k tokens may yield low-quality output due to the difficulty of obtaining high-quality training examples and that the largest model with 2T parameters ā€œdoesn’t beat SOTA reasoning modelsā€.
  • $2000 Fast Pattern Matcher Bounty is available: A member highlighted an open $2000 bounty for a fast pattern matcher in tinygrad.
    • The proposed solution involves a JIT for the match function, avoiding function calls and dict copies.
  • Reduce UOps to Speed Up Rewrite: It was suggested that tinygrad sometimes generates more UOps than needed, increasing the cost to rewrite.
    • A member asked if it would be acceptable to sacrifice a few lines to generate fewer UOps initially, even if they are later optimized to the same result.
  • Tinygrad YouTube video shared: A member shared a link to a YouTube video.
    • No additional details were given.

Links mentioned:


tinygrad (George Hotz) ā–· #learn-tinygrad (24 messagesšŸ”„):

Tensor and SimpleMathTrait inheritance, Mesozoic tinygrad tutorials issues, METAL sync issue, AMD and BEAM issues

  • Debate About Tensor Inheriting SimpleMathTrait: A discussion arose regarding whether Tensor should inherit from SimpleMathTrait, given that it re-implements every method that SimpleMathTrait provides without using the .alu() function.
    • It was noted that a previous bounty for refactoring Tensor to inherit from MathTrait was canceled due to poor submissions, with some suggesting Tensor may not need to inherit from either.
  • Colab CUDA Bug Causes Mesozoic Tinygrad Tutorial Issues: A user encountered issues while running code from the mesozoic tinygrad tutorials in Colab, prompting others to request the error message for debugging.
    • It was identified as a Colab bug related to incompatible CUDA and driver versions, with a suggested workaround involving specific apt commands to remove and install compatible versions; in the meantime using the CPU device was suggested.
  • METAL Sharding Behavior Leads to Unexpected Results: A member encountered unexpected behavior in sharding while trying to reproduce a minimal example of a METAL sync issue, suspecting that the COPY from METAL:1 to CPU might be executing before the XFER from METAL to METAL:1 completes.
    • The DEBUG output seemed to show the timeline adding the XFER when committed to the GPU command queue, not when it ends.
  • AMD and BEAM cause AssertionError: A user encountered an AssertionError when running with BEAM=2 and AMD=1, which seemed to be related to opening the device outside of the if __name__ == "__main__" block.
    • Setting PARALLEL=0 or ensuring the device is opened within the if __name__ == "__main__" block resolved the issue.

Cohere ā–· #ć€ŒšŸ’¬ć€general (19 messagesšŸ”„):

MCP with Command-A model, Cohere Tool Use, Cohere Scholars Program, Events Recording

  • MCP use with Command-A Model Explored: A member inquired about using MCP (Modular Conversational Platform) with the Command-A model, suggesting it should work via the OpenAI SDK.
    • Another member agreed, stating that there is no reason why it should not work.
  • Cohere Tool Use Capabilities Detailed: A member shared the Cohere Tool Use Overview, highlighting its ability to connect Command family models to external tools like search engines, APIs, and databases.
    • It also mentions that Command-A supports tool use, similar to what MCP aims to achieve.
  • Cohere Scholars Program Details Shared: A member asked about the requirements for the Cohere Scholars Program, specifically if prior publications are accepted.
  • Inquiry about Events Recordings: A member inquired whether Cohere events are recorded, as they were interested but unable to attend the live sessions.
    • The question remained unanswered in the provided context.

Links mentioned:


Cohere ā–· #ć€šŸ“£ć€‘announcements (1 messages):

Aya Vision, Multilingual Multimodal Models, Open Weights Model

  • Aya Vision Team Hosts Tech Talks and AMA: The core team behind Aya Vision, a multilingual multimodal open weights model, is hosting tech talks followed by an AMA on <t:1744383600:F>.
    • Attendees can join for exclusive insights on how the team built their first multimodal model and the lessons learned, with the event hosted by Sr. Research Scientist <@787403823982313533> and lightning talks from core research and engineering team members; further details are available at Discord Event.
  • Multilingual Model Aya Eyes Community Feedback: The team has scheduled an Ask Me Anything to allow the community to directly engage with the creators.
    • Questions can be about anything from model architecture to future roadmap.

Cohere ā–· #ć€ŒšŸ”Œć€api-discussions (5 messages):

Notion Connector, Vector DB for Notion

  • Slack app struggles with Notion integration: A member asked for help with a working solution for a Slack app integration with a company Notion wiki database.
  • Vector DB Recommended to bolster Notion: A member suggested using a vector DB due to Notion’s subpar search API.
    • No specific recommendations were given, and it was stated that Cohere models work well with all vector DBs.

Cohere ā–· #ć€ŒšŸ¤–ć€bot-cmd (3 messages):

greetings

  • Users greet each other: Two users are greeting each other in the ć€ŒšŸ¤–ć€bot-cmd channel, using ā€œheyā€ and ā€œsupā€.
    • The Cmd R Bot acknowledges the greetings.
  • Bots respond to greetings: A bot responded to the users’ greetings.
    • The bot used a casual ā€œsupā€ to acknowledge the interaction.

Torchtune ā–· #dev (22 messagesšŸ”„):

Fix for Timeout Crash, NeMo Resilient Training, RL Workflow, DeepSpeed Integration

  • Timeout Crash Bug Fixed: A member fixed a bug related to timeout crashes and created torchtune.utils._tensor_utils.py with a wrapper around torch.split in this pull request.
    • They suggested merging the tensor utils separately and then syncing with another branch to handle any conflicts.
  • NeMo Tackles Resilient Training: A member attended a NeMo session on resilient training, highlighting features such as fault tolerance, straggler detection, asynchronous checkpointing, preemption, in-process restart, silent data corruption detection, and local checkpointing.
    • Not all of these are implemented, with some only planned; the member offered to rewatch and present details comparing torchtune vs. NeMo in terms of resiliency.
  • RL Workflow, Data Standard Format, and Prompts: A member discussed the complexities of RL workflows, data formats, and prompt templates, suggesting a separation of concerns to decouple data conversion and prompt creation, allowing the same templates to be re-used across datasets.
    • The member suggested factorizing into a component that converts the data into a standard format, and another component that takes this standard format and converts it into the actual string with the prompt.
  • DeepSpeed backend for Torchtune?: A member inquired about integrating DeepSpeed as a backend into torchtune and created an issue to discuss the possibility.
    • Another member asked for more context, noting that FSDP supports all the sharding options from DeepSpeed.

Links mentioned:


Torchtune ā–· #papers (1 messages):

pjbontrager: You think they used AI to write that scrolling live updated chart?


LLM Agents (Berkeley MOOC) ā–· #mooc-announcements (1 messages):

AI4Math, Theorem Proving, Autoformalization, Formal Mathematical Reasoning, Language Models

  • Kaiyu Yang Presents on Autoformalization and Theorem Proving: Kaiyu Yang will present on ā€œLanguage models for autoformalization and theorem provingā€ today at 4pm PDT.
    • The presentation will cover the basics of using LLMs for formal mathematical reasoning, focusing on theorem proving and autoformalization.
  • AI4Math is Crucial for AI-Driven System Design: AI for Mathematics (AI4Math) is intellectually intriguing and crucial for AI-driven system design and verification, and extensive efforts have mirrored techniques in NLP.
    • The talk explores formal mathematical reasoning grounded in formal systems such as proof assistants, which can verify the correctness of reasoning and provide automatic feedback.

LLM Agents (Berkeley MOOC) ā–· #mooc-questions (4 messages):

LLM Agents MOOC, AgentX Competition, Course Quiz

  • LLM Agents MOOC link shared: A member asked for a link to the LLM Agents MOOC, another shared the link.
  • AgentX Competition Sign-Ups: The staff shared sign-ups for the AgentX Competition are available here.
  • Course Quiz delayed: A member asked about the missing quiz for the previous week.
    • A staff apologized for forgetting to post it and mentioned it would be available in a few minutes.

Link mentioned: Advanced Large Language Model Agents MOOC: MOOC, Spring 2025


DSPy ā–· #general (4 messages):

asyncio support, full-async fork of dspy, reasons to migrate

  • Asyncio support: will dspy be async?: A member inquired about plans to add asyncio support for general dspy calls.
    • They mentioned using litelm initially and then growing into dspy optimization, expressing interest in native dspy async capabilities.
  • Full Async Fork Faces Abandonment?: A member has maintained a true full-async fork of dspy for a few months but is migrating away from dspy.
    • They are willing to continue merging upstream changes if there’s community interest but will abandon it otherwise.
  • Reasons to Migrate & Benefits of Async DSPy: Members expressed curiosity about the reasons for migrating away from dspy, and which tool is being migrated to.
    • One member asked about the advantages of having a full async DSPy and suggested merging relevant features into the main repository.

Gorilla LLM (Berkeley Function Calling) ā–· #discussion (3 messages):

GitHub PR Review, Phi-4 Support

  • GitHub PR Gets Eyeballed: A member mentioned reviewing a GitHub Pull Request, leaving comments for further discussion on the platform.
    • The author expressed gratitude for the review, acknowledging the effort put into it and indicating a need to rerun the process based on the feedback.
  • Phi-4 Family Support Considered: A member is considering extending functionality to Phi-4-mini and Phi-4, despite them not being officially supported.
    • This suggests an effort to broaden compatibility beyond the initially intended scope, potentially enhancing the tool’s appeal.

MLOps @Chipro ā–· #events (1 messages):

Manifold Research, Multimodal AI, Self-assembling space robotics, Robotic metacognition, Community Research Call

  • Manifold Research Hosts Community Research Call #4: Manifold Research Group is hosting Community Research Call #4 this Saturday (4/12 @ 9 AM PST), covering their latest work in Multimodal AI, self-assembling space robotics, and robotic metacognition.
    • Interested parties can register here to join the open, collaborative, and frontier science focused event.
  • CRCs are Manifold’s Cornerstone Events: Community Research Calls (CRCs) are Manifold’s cornerstone events where they present significant advancements across their research portfolio.
    • These interactive sessions provide comprehensive updates on ongoing initiatives, introduce new research directions, and highlight opportunities for collaboration.
  • CRC #4 Agenda Announced: The agenda for CRC #4 includes updates on Generalist Multimodality Research, Space Robotics Advancements, Metacognition Research Progress, and Emerging Research Directions.
    • The event will cover recent breakthroughs and technical progress in their MultiNet framework, developments in Self-Assembling Swarm technologies, updates on VLM Calibration methodologies, and the introduction of a novel robotic metacognition initiative.

Link mentioned: Community Research Call #4 Ā· Zoom Ā· Luma: Interested in generalist AI models, self-assembling space robots or machine self-awareness? Join us for Community Research Call #4!Community Research Calls…



{% else %}

The full channel by channel breakdowns have been truncated for email.

If you want the full breakdown, please visit the web version of this email: [{{ email.subject }}]({{ email_url }})!

If you enjoyed AInews, please share with a friend! Thanks in advance!

{% endif %}