a quiet day

AI News for 4/23/2025-4/24/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (214 channels, and 6713 messages) for you. Estimated reading time saved (at 200wpm): 560 minutes. Our new website is now up with full text search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on https://x.com/Smol_AI!

This is the last week to get AI Engineer Super Early Bird tix!

Check out the brand new AINews website: https://news.smol.ai/

AI Twitter Recap

Models and Benchmarks

OpenAI’s image generation model gpt-image-1 is now available in the API, offering developers greater parameter control, including moderation sensitivity, image quality/generation speed, and output format options @kevinweil, @kevinweil, @kevinweil, @sama. The API allows control over moderation sensitivity, quality vs. generation speed, background, and output format. Developers at companies like Instacart, InVideo, Canva, and HubSpot are already using the API. @OpenAIDevs, @OpenAIDevs.
OpenAI’s new models o3 and o4-mini achieve high rankings on the Arena leaderboard. The o3 model is #2 overall and ties with Gemini 2.5 Pro for #1 in Style Control, Math, Coding, and Hard Prompts. The o4-mini breaks into the top 10 and claims #1 in Math, surpassing o1. GPT-4.1 ranks in the top 5 for Hard Prompts, Math, and Style Control @lmarena_ai. The detailed breakdowns of the new models show that o3 is #1 in Style Control, Hard Prompts, Coding, and Math, while o3 and o4-mini are both #1 in Math. GPT-4.1 ranks in the top 5 in Hard Prompts, Math, and Longer query @lmarena_ai.
Describe Anything Model (DAM), a 3B vision language model by Nvidia, generates detailed captions with localized references for images and videos. It extends existing segmentation and referring expression generation datasets like REFCOCO by generating captions using VLMs, along with a new benchmark and demo @mervenoyann, @mervenoyann, @mervenoyann. The models, dataset, and benchmark are released, and the model is live on Hugging Face @reach_vb.
Google DeepMind has introduced new features in Music AI Sandbox powered by Lyria 2 model, which is helping singer-songwriters like Isabella Kensington create their next masterpiece. The features include create, extend, and edit functionalities for generating and reshaping music @GoogleDeepMind, @GoogleDeepMind, @GoogleDeepMind, @GoogleDeepMind.
Anthropic is exploring the potential for AI models to have experiences of their own and has started a research program to investigate this, as AI models become more complex @AnthropicAI.
A new version of deep research is available in OpenAI, with expanded usage for Plus, Team, and Pro users and a lightweight version for Free users, to increase rate limits and make it more accessible @OpenAI, @OpenAI, @OpenAI. The lightweight version is powered by a version of OpenAI o4-mini @OpenAI.
Epoch AI Research published a dataset covering 500+ of the largest AI supercomputers over the last six years, revealing trends in scaling, location, ownership, and power requirements @EpochAIResearch. Performance has doubled every 9 months, driven by more chips and higher performance per chip. Costs have doubled roughly every year, and power requirements are also doubling annually. The US dominates in global AI supercomputer performance, with the private sector controlling over 80% of AI computing capacity @EpochAIResearch, @EpochAIResearch, @EpochAIResearch, @EpochAIResearch, @EpochAIResearch, @EpochAIResearch, @EpochAIResearch.
LlamaIndex integrates with Milvus to support full-text search with BM25, enabling hybrid search for RAG pipelines that combine vector search and traditional keyword matching @llama_index.

Agentic Workflows and Tooling

Perplexity Assistant is integrated into Motorola devices, providing direct access to search and assistant features across the Moto ecosystem, including a 3-month Pro subscription @AravSrinivas, @AravSrinivas, @AravSrinivas, @perplexity_ai, @perplexity_ai. The Perplexity Android app will be pre-installed on all new Motorola devices, optimized for the Moto Razr @AravSrinivas.
Perplexity has introduced an iOS Voice Assistant that answers questions and takes actions on the iPhone, including playing media, drafting emails, moving meetings, booking rides, and setting reminders @AravSrinivas. The Action button on iPhone can be customized to Perplexity Voice Mode, allowing users to use the assistant without opening the app @AravSrinivas. The assistant can research places and call a ride or navigate to the place, play podcasts, videos, and songs, and review the day, schedule meetings, and send emails using Apple Mail and Calendar @AravSrinivas, @AravSrinivas, @AravSrinivas, @perplexity_ai, @perplexity_ai, @perplexity_ai, @perplexity_ai.
Andrew Ng and Hugging Face offer a new short course on building Code Agents with Hugging Face smolagents, which use LLMs to generate a single block of code for a sequence of actions @AndrewYNg. Code agents outperform function-calling agents and consolidate multiple function calls into a single block of code @DeepLearningAI.
Jerry Liu outlines a reference architecture for building agents over documents (Agentic Document Workflows, ADW), involving parsing and extraction, retrieval, reasoning, and action-taking @jerryjliu0. ADW scales better and integrates with existing software ecosystems.
A new open-source IDE is available for building multi-agent systems, powered by OpenAI Agents SDK, connecting to MCP servers and integrating into apps using HTTP or the SDK @omarsar0.
LangChain is being used to build more powerful AI agents and automate complex tasks. Uber’s Developer Platform team is using LangGraph to build a network of agents to automate unit test generation @LangChainAI. Webtoon is using LangGraph to automate narrative understanding across its content library, reducing manual story review work by 70% @LangChainAI.
Teknium has released Minos, a classifier for determining if a chat’s last response is a refusal, which can be used to automate jailbreaking and redteaming @Teknium1.
svipno promotes a Machine Learning Engineering cohort starting in May, emphasizing the importance of MLOps and designing complex systems @svpino, @svpino, @svpino, @svpino. The focus is on building transformational software using LLMs @svpino.

AI and Industry

Aravind Srinivas of Perplexity discusses the challenges of battling Google’s dominance in the Android ecosystem, highlighting the need to build a better native Assistant than Gemini and partnering with Motorola to optimize the user experience @AravSrinivas.
Stephanie Palazzolo reports on OpenAI’s increased projections, driven by agents and “free user monetization,” which could include advertising or affiliate fees @steph_palazzolo.
Chris Lattner discusses the challenges hardware companies face in building AI software, noting technology problems, incentives, and the need for bespoke solutions @clattner_llvm, @clattner_llvm, @clattner_llvm.
Andrew Ng discusses AI-assisted coding, noting it makes specific programming languages less important, while understanding core concepts remains crucial @AndrewYNg.
Yoshua Bengio views SB 813 as a step forward in fostering AI safety innovation through independent multi-stakeholder regulatory organizations (MROs) but suggests changes including robust safeguards, clear liability insurance, and state oversight @Yoshua_Bengio.

International Developments and Geopolitics

teortaxesTex discusses the shift in research leadership, noting that China now leads with 11 top research leaders compared to the US’s 4, suggesting they no longer depend on US schools or “IP” @teortaxesTex.
teortaxesTex comments on the US’s reliance on Chinese labor and CCP policies, rather than commie cheats, and discusses the challenges of the US maintaining hegemony with a smaller population and the need to permit allies to beat them sometimes @teortaxesTex, @teortaxesTex.
teortaxesTex comments on the rising Chinese prowess in AI, particularly mentioning the AI company DeepSeek, noting that there is more where that came from and praises China’s weapon naming conventions @teortaxesTex, @teortaxesTex, @teortaxesTex.
Hardmaru suggests that more top ML conferences should be hosted in Asia, in light of the Asian context at ICLR 2025 in Singapore @hardmaru.

Humor and Memes

willdepue jokes about the AI conference booths not giving out free water bottles anymore, only stickers, implying a recession @willdepue.
aidan_mclau jokes about model welfare, stating that “15% chance models are conscious” is as absurd as “15% chance i should call this collection of atoms my computer is on a desk” @aidan_mclau.
vikhyatk shares a story about coming home and panicking because they thought an eval had finished, but it turned out to be the fridge @vikhyatk.

AI Reddit Recap

/r/LocalLlama Recap

1. Gemma 3 27B Quantization and Benchmark Analysis

I benchmarked the Gemma 3 27b QAT models (Score: 117, Comments: 26): The OP benchmarked several quantized (QAT) versions of Gemma 3 27B, using llama.cpp v1.27.1 (GGUF) and LM Studio MLX v0.13.2, but found perplexity to be an unreliable metric due to overfitting and inconsistent results between quant models. Instead, GPQA-main was used for evaluation, showing GPQA-main scores ranging from 0.333 (MLX, 4bit) to 0.375 (unquantized); notably, the Bartowski QAT Q4_0 GGUF scored 0.352 and performed 1-2 tok/sec faster than MLX. The analysis suggests Bartowski QAT Q4_0 as the optimal local QAT option, with the results implying that quantization incurs only modest performance degradation compared to baseline. A commenter questioned whether the benchmarks were run under deterministic inference settings (temperature 0, top-k 1), which is critical for reproducibility. Another noted that performance on GPQA-main is close to that on the more difficult Diamond subset from prior testing—suggesting quantization does not significantly increase performance gap on harder questions.
- One commenter raises the technical question of whether benchmarks were run with deterministic decoding settings—specifically, temperature=0 and topk=1—as these settings are important for reproducibility and strict performance comparison across quantized models.
- Another commenter analyzes the statistical significance of GPQA main benchmark results, noting that with 448 questions, the uncertainty in reported scores (modeled using a binomial-to-Gaussian approximation) is approximately ±2.25%. This suggests it is difficult to draw strong conclusions about differences in model performance on this dataset given current data size.
- A brief technical clarification is made regarding perplexity metric interpretation: lower values are better, as they indicate the model makes more confident and correct next-token predictions.
Unsloth Dynamic v2.0 GGUFs + Llama 4 Bug Fixes + KL Divergence (Score: 110, Comments: 49): Unsloth’s Dynamic v2.0 quantization method for GGUFs demonstrates improvements over QAT and traditional imatrix quant techniques, particularly on 5-shot MMLU and KL Divergence benchmarks. Evaluation framework aligns with Llama 4 and Gemma 3 reported metrics, with Gemma 3 12B/27B Dynamic 2.0 quant achieving near-BF16 accuracy and up to 7.5% lower KL Divergence at minimal disk space cost (see full results here). The methodology emphasizes “flips” as a better accuracy metric than perplexity (per Accuracy is Not All You Need) and uses rigorously curated conversational calibration datasets for quantization. Several Llama 4 bugs were addressed, including RoPE scaling config, QK Norm epsilon, and QK Norm head sharing, resulting in improved eval scores in llama.cpp and transformers. One commenter notes the lack of direct comparison between Gemma 3 QAT and the new Dynamic 2.0 quants, suggesting the posted data compares only QAT vs. BF16. Another inquiry was made about plans for a Maverick version, reflecting user interest in broader model support.
- A user notes the absence of direct benchmark comparisons between Gemma 3 QAT and the newly released Dynamic 2.0 quantized models, pointing out that only QAT vs BF16 results are presented. For a technical audience, this highlights an interest in head-to-head evaluation metrics specifically targeting recent quantization advances, such as QAT and Dynamic 2.0 for Gemma 3 architectures.
- A comment asks if the benchmarking or quantization process will also be applied to the Maverick model series, signaling demand for broader support, and suggesting developers should plan for multi-architecture compatibility with their quantization tools and benchmarking suites.
- Another comment expresses appreciation for new DeepSeek dynamic quantizations, suggesting ongoing expansion of dynamic quantization methods to additional architectures is valued by the technical community, and hinting at potential areas for further performance monitoring or comparative evaluation.

2. Upcoming Open-Source Reasoning Model Announcements

Details on OpenAI’s upcoming ‘open’ AI model (Score: 248, Comments: 126): OpenAI is developing a new open-source reasoning model slated for early summer launch, aiming to surpass existing open alternatives such as Llama and Gemma. The model will support text-in/text-out tasks, allow reasoning abilities to be toggled, is designed to run efficiently on high-end consumer hardware (suggesting ~30-70B parameters, per comment speculation), and may ship under a more permissive license than current major open models. Top comments express skepticism due to prior delays and demand for actual model weights rather than announcements. There is informed speculation that ‘high-end consumer hardware’ points to a model size in the 30-70B parameter range, referencing precedent with similar dense models (o3-mini, o4-mini).
- Commenters estimate the upcoming OpenAI ‘open’ model is likely in the 30-70B parameter range, fitting past leaks about o3-mini and o4-mini being dense models in that category. This aligns with claims of suitability for ‘high end consumer hardware.’
- One user suggests that a quantized version at ‘q4’ would take around 18-20GB VRAM, implying the model size falls in the 30B-35B range if compared with existing LLM quantization strategies.
- There’s skepticism in the thread about actual model release, with users demanding OpenAI release weights—not just announcements—before considering the effort substantive, reflecting frustration over several months of teasers without technical deliverables.
Skywork-R1V2-38B - New SOTA open-source multimodal reasoning model (Score: 169, Comments: 11): Skywork-R1V2-38B is a newly released open-source multimodal LLM combining the Qwen/QwQ-32B backbone with the InternViT-6B-448px-V2_5 vision encoder. It achieves SOTA results for open-source models with 73.6% on MMMU and 62.6% on OlympiadBench, with inference supported for both Transformers and vLLM stacks. Discussion highlights stable non-vision performance post-vision integration, LiveBench improvements vs. QwQ, and requests for GGUF format availability. Commenters note that Skywork-R1V2-38B achieves higher LiveBench scores (73.2 vs. 65.69) than QwQ on the new version of the benchmark, raising questions about benchmarking consistency and comparative non-vision performance. There is positive technical interest in the model’s ability to preserve language reasoning ability after multimodal integration.
- One user notes that Skywork-R1V2-38B is architecturally similar to qwq-32b with InternViT-6B-448px-V2_5 added as the visual encoder, highlighting how the model preserves strong performance on non-vision tasks after integrating vision capabilities. This suggests effective multimodal integration without regression on core language benchmarks.
- Benchmark results are discussed in detail: Skywork-R1V2-38B reportedly achieves a LiveBench score of 73.2, while qwq is listed at 65.69 for the April 2 release and 71.96 on the previous version. A user requests clarification on which LiveBench version was used, expressing curiosity about whether the new multimodal model actually outperforms the original qwq on non-vision tasks.
- A technical question is raised about the possibility of using reinforcement learning (RL) finetuning specifically for the image encoder before passing information to the language model. This indicates interest in improving the multimodal pipeline by optimizing the vision component independently for better downstream performance.

3. Recent Multimodal and Reasoning Model Benchmarks

New reasoning benchmark got released. Gemini is SOTA, but what’s going on with Qwen? (Score: 210, Comments: 51): The image summarizes results from the newly released PHYBench, a benchmark comprising 500 real-world physics problems designed to assess the physical reasoning abilities of large language models (LLMs). The chart ranks leading LLMs by accuracy and Expected Error Deviation (EED), showing that Gemini 2.5 Pro currently achieves the best overall performance, significantly outperforming other state-of-the-art models, while models like Qwen lag behind. The benchmark reveals persistent gaps between LLMs and human-level reasoning, especially in physical contexts, and demonstrates that previous notions of model optimization for popular benchmarks (like R1) hold limited relevance when the task demands novel reasoning capabilities. Original paper link. Commenters discuss the implications of these results, noting Gemini 2.5 Pro’s impressive generalization, R1’s resilience on novel benchmarks (contrary to claims it was overly benchmark-specific), and the relative underperformance of models like Qwen, with the speculation that its struggles are linked to the benchmark’s reliance on implicit physical knowledge rather than pattern matching from context.
- Several comments highlight how the new benchmark demonstrates the strengths and weaknesses of different models: specifically, Gemini 2.5 Pro achieves state-of-the-art (SOTA) results on reasoning, outperforming established rivals such as GPT-4.1 and Claude Sonnet, especially in domains not previously benchmarked, countering speculation that R1 was only strong due to optimization for known benchmarks.
- Multiple users observe that Qwen’s weaker performance may stem from its smaller model size and specialized training data. Technical points include that Qwen models (including QwQ, likely referencing Qwen-32B/Qwen-72B) lack sufficient capacity to store broad real-world or physics knowledge, performing better if information is included in-context rather than from recall. There is also speculation that Qwen’s training focused more on math and programming rather than physics, further explaining its relative underperformance on questions demanding extensive factual knowledge.
- V3 (implied as Claude 3 Opus) is reportedly the best non-reasoning model in the benchmark, even outperforming models like GPT-4.1 and Sonnet except on explicit reasoning tasks. R1’s performance exceeds that of o1, o3 mini, Grok-3, Sonnet Thinking, and Gemini 2 Flash, challenging prior assumptions about the closeness of R1 and o1 as rivals. These benchmark results suggest larger or more diverse training models (“the whale is winning”) tend to have superior recall and reasoning in novel test domains.
Skywork-R1V2-38B - New SOTA open-source multimodal reasoning model (Score: 169, Comments: 11): Skywork-R1V2-38B is a newly released open-source multimodal LLM combining the Qwen/QwQ-32B backbone with the InternViT-6B-448px-V2_5 vision encoder. It achieves SOTA results for open-source models with 73.6% on MMMU and 62.6% on OlympiadBench, with inference supported for both Transformers and vLLM stacks. Discussion highlights stable non-vision performance post-vision integration, LiveBench improvements vs. QwQ, and requests for GGUF format availability. Commenters note that Skywork-R1V2-38B achieves higher LiveBench scores (73.2 vs. 65.69) than QwQ on the new version of the benchmark, raising questions about benchmarking consistency and comparative non-vision performance. There is positive technical interest in the model’s ability to preserve language reasoning ability after multimodal integration.
- One user notes that Skywork-R1V2-38B is architecturally similar to qwq-32b with InternViT-6B-448px-V2_5 added as the visual encoder, highlighting how the model preserves strong performance on non-vision tasks after integrating vision capabilities. This suggests effective multimodal integration without regression on core language benchmarks.
- Benchmark results are discussed in detail: Skywork-R1V2-38B reportedly achieves a LiveBench score of 73.2, while qwq is listed at 65.69 for the April 2 release and 71.96 on the previous version. A user requests clarification on which LiveBench version was used, expressing curiosity about whether the new multimodal model actually outperforms the original qwq on non-vision tasks.
- A technical question is raised about the possibility of using reinforcement learning (RL) finetuning specifically for the image encoder before passing information to the language model. This indicates interest in improving the multimodal pipeline by optimizing the vision component independently for better downstream performance.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo

1. CivitAI Policy Changes and Community Response

The real reason Civit is cracking down (Score: 1090, Comments: 407): The post provides an ‘industry insider’ analysis of why Civitai and similar adult AI image sites are enacting stricter content moderation: The primary reason is pressure from Visa, specifically through Visa’s VAMP (Visa Acquirer Monitoring Program) which introduces stringent compliance requirements for adult AI content. Esquire Bank and ECSuite, the main US merchant processors for adult AI transactions, were recently heavily fined for non-compliance, prompting a sector-wide crackdown, as any entity accepting Visa/Mastercard payments must now comply with new, more restrictive standards or risk payment processing termination. The author claims there is no current sustainable workaround to this dependence on legacy credit card networks, other than seeking alternative payment rails, which is still largely infeasible for most platforms. Nomi.ai, cited by the author, has banned adult content for this reason. Commenters debate the implications of centralized payment processors as gatekeepers and argue for open-source and decentralized solutions as a partial response; one notes that Civitai confirmed this pressure in a livestream and predicts further reliance on distributed projects like Stable Horde and user-side model backups as crackdowns increase.
- Several comments highlight that Civitai’s crackdown is primarily in response to pressure from centralized payment processors, like Visa, which can deplatform services over content policies. The technical risk of relying on such gatekeepers is cited, with discussion of how open-source models and decentralized hosting (via projects like Stable Horde or community backups) are critical for resilience against institutional censorship.
- References are made to a livestream where Civitai staff confirmed external pressure, reinforcing the necessity for technical and architectural solutions outside centralized platforms. The discussion points towards model and data redundancies, encouraging users to participate in distributed projects to maintain access if mainstream platforms enforce content bans.
- The OnlyFans case is mentioned as a technical and strategic counterexample, questioning why Visa reversed its pressure so quickly there but continues to target AI-art related content. This has prompted technical readers to question the inconsistency in enforcement and the structural risks for platforms that depend on payment APIs with opaque, unpredictable policy enforcement.
CivitAI backup initiative (Score: 410, Comments: 110): The post announces the start of model purging on CivitAI and initiates efforts to backup and archive affected models, proposing /r/CivitaiArchives as a central hub for strategies, tips, and resource sharing. A top technical comment advocates for a decentralized solution using torrents, suggesting the creation of a clone website with torrent links for model snapshots, identified by truncated sha256 hashes (autov2), and highlights technical feasibility (simple nginx setup, vservers, and potential for community-driven voting/commenting). The centralized post with aggregation of backup methods is referenced at /r/CivitaiArchives/comments/1k6uhiq/civitai_backup_initiative_tips_tricks_how_to/ . Technical debate centers on the practicality and history of such initiatives, with skepticism that backup plans often lack follow-through but agreement on the need for robust, decentralized archiving solutions.
- Ueberlord proposes a peer-to-peer backup solution via torrents, suggesting the community create a website to mirror model snapshot pages from CivitAI, using the autov2 hash (first 10 chars of the model’s SHA256) as a unique identifier. The envisioned site would serve torrent files for models, could run on modest infrastructure (few small VPSs using nginx and a load balancer), and emphasizes keeping the system simple—additional features like commenting or voting could add complexity but are not immediately necessary.
- Upper-Reflection7997 highlights the importance of prioritizing the backup of specific model types, suggesting that ‘concept’ category LoRAs and realism models are critical due to higher likelihood of removal, in contrast to style or character LoRAs. This underscores a need for strategic selection in backing up models, rather than indiscriminate downloading.
- nalditopr shares an implementation reference, linking to the CivitAI-Model-grabber tool on GitHub, which automates the downloading of models from CivitAI. This tool could be an immediate asset in any model archiving initiative and may serve as a basis for further automation or distributed backup processes.
Lora removed by civitai :( (Score: 212, Comments: 97): The image attached is a non-technical meme, specifically a visual pun referencing terms like ‘golden shower’ in light of ‘Lora’ being removed by Civitai. The context, inferred from comments, relates to the moderation or removal of certain types of LoRA (Low-Rank Adaptation) AI models—potentially those linked to NSFW or controversial content—from the Civitai model sharing platform. There is no technical benchmarking, model detail, or implementation discussion in the image itself. Commenters express surprise and critique Civitai’s moderation choices, suggesting inconsistency in content policies and concern over increasing restrictions on model sharing, especially for adult or fringe AIGC content.
- Commenters discuss the increasing trend of content removal on Civitai for subjects ranging from niche genres (e.g., furry, weeb, gooner) to those allegedly infringing copyright, suggesting that Civitai’s moderation policy might be expanding in scope and impacting diverse LoRA datasets. This implies both a tightening of content guidelines and a possible risk for creators working with unconventional or edge-case themes.
Did civitai get nuked just now? (Score: 121, Comments: 151): Users are reporting that CivitAI, a platform for distributing Stable Diffusion models, was suddenly taken offline or had major content removals shortly after a scheduled maintenance, despite previous indications that it would remain up for days. The outage or takedown appears abrupt and potentially linked to external pressures or legal/commercial concerns, rather than purely technical maintenance. Commenters attribute the event to pressure from US venture capital (e.g., Andreessen Horowitz) and speculate that investor-driven priorities or compliance requirements led to the content removals or shutdown. Some advocate for future model-sharing platforms to opt for decentralized, non-incorporated, or torrent-based architectures to avoid similar takedowns.
- Technical commenters attribute the sudden mass deletion of checkpoints and resources on Civitai to pressure from US venture capital and payment processors (e.g., Visa/Mastercard), referencing historic patterns where open-source platforms are forced to restrict content or alter operations to comply with external investor or partner demands, potentially to enable continued growth or investment.
- Some speculate that the increased takedown activity by Civitai over the last 24 hours is due to a significant, recent scare or legal threat, which has resulted in aggressive content moderation and file removals. This aligns with patterns seen in previous open-source AI and community platforms after legal or financial risk is perceived.

2. OpenAI Model Access, Limits, and Feature Updates

OpenAI employee confirms the public has access to models close to the bleeding edge (Score: 2113, Comments: 312): The image contains a verified social media statement by ‘roon,’ asserting that the public has access to AI models very close (within two months) to the latest technological advancements, and that neither governments nor large corporations have access to superior models. The credit is given to OpenAI for this degree of transparency and accessibility, implying minimal internal retention of advanced capability. This statement addresses frequent speculation that major organizations possess significantly more capable, non-public models. Comments reflect skepticism, with some users suggesting OpenAI is responding to competitive pressure (especially from Google), skepticism about transparency claims, and debate over the company’s privatization and intentions regarding the openness of AI.
- A commenter highlights that OpenAI’s decision to emphasize the public’s access to recent AI models suggests increased competitive pressure from Google, implying a rapid pace of innovation and deployment from multiple industry leaders.
- Another user notes that OpenAI’s public narrative often positions itself as a safeguard against closed, corporate or government-dominated AI, but questions how much credit OpenAI should claim for the current state of model accessibility, acknowledging that both funding requirements and competition were equally critical in shaping their openness.
- There is discussion about the definition and evolution of “OpenAI”—specifically, the community’s scrutiny around the company’s move from openness towards privatization, and how this is perceived relative to transparency, model release cadence, and the broader impact on AI accessibility and safety debate.
OpenAI Plus users now apparently receive 25 Deep Research queries per month (Score: 249, Comments: 36): The image illustrates the OpenAI Plus UI now providing a monthly allocation of 25 Deep Research queries, up from a previous limit of 10 per month. This usage cap is displayed directly in the interface, indicating a system to track and enforce query limits for the Deep Research feature. Commenters highlight that while this increase improves the feature’s value for Plus subscribers, it still trails behind alternatives like Google’s Gemini subscription, which reportedly offers more generous or unlimited access. Some users note the lack of an official announcement regarding this change.
Several commenters note that the Deep Research query cap for OpenAI Plus users has increased from 10 to 25 per month, which some see as a notable improvement, especially for those who regularly hit previous limits. Comparisons are drawn to Google’s Gemini, which is said to offer unlimited queries for advanced users, making it more attractive for high-volume needs. One user inquires whether Deep Research relies on a hallucination-prone version of GPT-4o (“o3”), indicating ongoing concerns about factual reliability in advanced LLM features.
- New plus user limits? (04/24/2025) (Score: 155, Comments: 51): The post shares newly updated usage limits for OpenAI’s ChatGPT Plus, Team, and Enterprise accounts, as reflected both in an OpenAI Help Center article and a screenshot. The updated quotas are: 100 o3 messages/week, 300 o4-mini messages/day, and 100 o4-mini-high messages/day. The Pro plan advertises near-unlimited access, pending adherence to OpenAI’s ToU, specifically barring abuse, account sharing, or unauthorized resale (see official doc). Commenters note that Deep Research capacity has increased to 25, and recall significantly lower GPT-4 message limits in the past, contextualizing these changes as incremental improvements in model accessibility.
  - Multiple users discuss recent changes to usage limits: message caps for Deep Research have been raised to 25, with references to historical GPT-4 constraints, which were previously 25 messages every 3 hours.
  - There is ongoing confusion about quotas for GPT-4.5, with one user experiencing a per-week limit (50 per week). Users note inconsistent warnings, even after minimal usage, suggesting potential misalignments or bugs in quota tracking.
  - Some speculate about future subscription pricing changes (e.g. increasing to $25/mo) as a possible response to increasing usage or adjusting for enhanced resource allocation with higher message limits.
- Here are the new limits for Plus (Score: 152, Comments: 36): OpenAI has updated usage limits for ChatGPT users: Plus, Team, and Enterprise subscribers now receive 100 messages/week with the new o3 model, 300 messages/day on o4-mini, and 100 messages/day with o4-mini-high (source). These changes reflect a significant increase over previous GPT-4 allocations (formerly 20 messages/3 hours). Commenters highlight optimism about improved access and pricing for ‘agents’ in future tiers (e.g., Pro), mention an additional cap of 25 Deep Research queries/month, and inquire about the status of ‘4.5,’ specifically its suitability for writing tasks.
  - Users are discussing the evolution of GPT-4 model message limits, noting that prior restrictions (such as 20 messages every 3 hours) have significantly improved with the introduction of o3/o4 series models. The possibility is raised that by 2026/2027 agent pricing and usage limits, especially for Pro subscriptions, will advance further, potentially offering more value for heavy users.
  - There is a mention of a specific technical limitation: some users report a cap of 25 Deep Research queries per month, indicating that higher-capability or research-oriented models still have stricter access controls distinct from standard conversations.
  - A user indicates a desire for a new pricing tier (e.g., $40/month) that would provide double the usage limits for models like 4.5 and the reasoning variants, suggesting current tiers still leave a gap for advanced personal use without requiring enterprise/unlimited offerings.

3. AI Coding Assistant Tools, DIY Setups, and Productivity

I was rejected by CursorAI, so I built my own “Cursor”… And it’s WAY better and here is how you can create yours. (Score: 162, Comments: 71): The OP, a non-coder, reverse-engineered their own version of a CursorAI-like coding assistant after being rejected by CursorAI, highlighting shortcomings in Cursor’s agentic system—specifically, excessive reliance on a cheaper agent model causing rapid context loss (forgetting prompts within 2-3 turns), which they claim is avoided in Windsurf, Claude Code, and their own tool. Their setup employs Vite+React+Node (not Next), uses a VS Code extension to maintain up-to-date file trees (file-tree.md) and documentation (docs.md), and orchestrates LLM-driven code editing with a lightweight agent (GPT-4o-mini) for file selection, delegating main edits to Anthropic Claude 3.5 via structured JSON prompts, integrating GitHub/Netlify for CI/CD. Cost is comparable to commercial tools, but improved accuracy reduces token spend per actual edit. See demo video. Some commenters challenge the OP’s credibility in software architecture, asserting expertise requires years of hands-on coding, while others congratulate the methodical, product-driven approach and note the omission of competing tools like Cline or Roo in the comparison.
- Discussion emphasizes the difference between high-level product guidance and actual software architecture, highlighting that true architectural knowledge requires years of hands-on experience with complex systems, not just conceptual understanding.
- Several commenters compare tools: Cursor (geared toward coders), Loveable (targeted at non-coders), and mention features such as custom instructions in Cursor or Windsurf, and the effectiveness of Claude desktop with a monthly subscription and mcp-s for decent results. These comparisons clarify feature sets and intended audiences for each tool.
- A technical critique is raised: actual software development in industry focuses on maintaining and evolving large, complex codebases deployed in production, rather than generating new codebases for simple or short-term use cases, framing the market relevance and the distinction between hobbyists and revenue-driven users.
I am in software engineering for more than 15 years. And I am addicted to the AI coding. (Score: 447, Comments: 169): The attached image shows a user-built developer assistant UI that interfaces with the OpenAI API for end-to-end code editing and project management, contextualizing a real-world transition away from manual coding and browser-based chat workflows. The screenshot demonstrates the assistant processing a detailed natural language request: ‘fix meta_sdk_client.py to use Facebook SDK objects and expose methods in a specific module.’ The assistant lists the files it loaded, details changes made (such as integrating more Facebook SDK functionality and improving error handling with Pydantic schemas), and outputs an update. The post explains significant productivity and cost improvements after moving from OpenAI’s o1 to the o4-mini model, emphasizing larger context handling (10x) and 1/15th the API cost while maintaining high-quality output. The poster has also architected an AI agent system that decomposes tasks, mimicking multi-tier developer collaboration autonomously. Commenters discuss customizing LLMs for educational workflows and compare tools like Cursor and Cascade for targeted coding. There is debate about the trade-offs: some warn that complete AI-driven codebases lead to bloat and unintended changes (‘vibe coding’), expressing a preference for AI as an assistant rather than the sole developer.
- A user discusses customizing LLMs as coding teachers tailored to individual learning styles, specifically requesting code generation that follows a teach-and-build pedagogical pattern—raising the question of how AI-driven coding environments can be tuned for deeper, incremental understanding instead of pure code completion.
- Multiple commenters caution about the pitfalls of heavy AI-driven code generation (‘vibe coding’), emphasizing that these models tend to produce bloated or excessive code, often adding unnecessary components beyond what was requested. This bloat can lead to harder-to-maintain codebases and increased technical debt, making manual intervention and post-generation refactoring common.
- A recurring theme is how reliance on AI coding tools leads developers to lose detailed system-level knowledge over time. Previously, traditional hands-on coding practices fostered an intimate architectural understanding, whereas generated code leads to vaguer mental models and diminished debugging intuition, particularly when diagnosing and troubleshooting complex issues.

AI Discord Recap

A summary of Summaries of Summaries by Gemini 2.5 Pro Exp

Theme 1: New Models & Benchmarks Stir the Pot

Grok 3 and Gemini 2.5 Throw Down the Gauntlet: Grok 3 shows performance similar to GPT-4.1 according to OpenAI-MRCR results, while Gemini 2.5 Pro draws praise for reasoning but criticism for verbosity and potential hallucinations, with one user stating OAI deep research vs gemini deep research is like day and night. Anticipation builds for upcoming Qwen 3 and DeepSeek model releases, potentially debuting at the Singaporean AI conference.
Unsloth’s Dynamic Quants Smash Benchmarks: Unsloth AI launched Dynamic v2.0 GGUFs, detailed in their benchmarks and analysis doc, setting new records on 5-shot MMLU and KL Divergence. The update, requiring a hefty 600GB download for Q4/Q8 data, enables running and fine-tuning quantized LLMs while preserving accuracy, with the community sharing results on Twitter and a Hugging Face collection.
Context Arena Steps Up for Long Context Visualization: A new Context Arena benchmark was released, providing visualizations of LLM performance over long context windows. It features a sortable leaderboard, interactive charts, model selectors, and welcomes community feedback for adding more models or benchmarks.

Theme 2: Hardware Headaches & High Hopes for Silicon

AMD GPUs Still a Pain for Local LLMs: Users running local models on AMD GPUs (like the RX 6750 XT) continue to face challenges, citing the need for substantial VRAM and specific Linux ROCm drivers, referencing this Reddit thread. Testing smaller models with tools like LM Studio was suggested as a starting point to gauge performance on AMD hardware.
VRAM vs DRAM Battle Creates Bottlenecks: Running large models (120B) that exceed GPU VRAM (e.g., on a 4090) and offload to system DRAM results in drastically reduced performance (0.5 tok/s reported), reinforcing the consensus that you don’t want the model to touch system RAM. Discussions highlighted the risks of buying used 3090s for their 24GB VRAM, calling it kind of a lottery, with potential hidden costs for thermal pad replacements.
Blackwell’s FP4 & MI300 Benchmarks Generate Buzz: Details on NVIDIA Blackwell’s FP4 implementation were sought, with a potential reference found in the OCP Microscaling Formats MX v1.0 spec. Meanwhile, the amd-fp8-mm leaderboard saw a flood of MI300 benchmark submissions, though some shape dimensions like n=576 were noted as potentially suboptimal for targeting DeepSeek R1 inference shapes.

Theme 3: Tooling Trials, Triumphs, and Tribulations

Unsloth Targets Multi-GPU, Tackles Ollama Glitches: Unsloth AI announced upcoming native multi-GPU support, while suggesting the accelerate library as an interim solution, emphasizing the need for careful model splitting. Users also reported issues pulling Unsloth models via Ollama, attributed to Hugging Face issues and Ollama’s custom registry.
Aider Gets Flag Fixes, Eyes GitHub Integration: Aider users discussed managing context prompts in agentic workflows, suggesting the --yes flag or “don’t ask again” options, while requesting an auto-skip flag. A feature request emerged to integrate Aider directly with GitHub Issues for automated code fixing and pull request generation.
Cursor Copes with Context Overload & Copilot Conflicts: Cursor users reported performance issues when enabling the ‘include project structure’ setting, potentially overloading the context window and exceeding tool limits. Separately, users noted newer Cursor versions block GitHub Copilot Chat, requiring a rollback to v0.46 for concurrent use, fueling speculation that Microsoft is restricting Copilot to VSCode.

Theme 4: API Anguish, Access Adventures & Alternatives

Sora Limits & Search Caps Squeeze Users: New OpenAI users face temporary Sora video generation limits, while existing users see deep search caps at 15 per month, prompting humorous frustration like “i swear plus is becoming insane”. This sparked requests for deep research alternatives with higher usage limits than services like Perplexity or You.com.
Brave API Brittleness & OpenAI Costs Bite Budgets: The Brave API was reported as unstable, with 50% of summarizer requests failing (HTTP 504 errors), even for paid subscribers according to their Terms of Service. Concurrently, OpenAI web search was flagged for making up 95% of agent pipeline costs, leading one user to ditch APIs and build their own scraping solution, overcoming their “phobia of scraping”.
Free LLMs Flow via Open Router & Copilot Hacks: Users shared methods for accessing free LLMs, including using Open Router (openrouter.ai) with model IDs like openrouter/microsoft/mai-ds-r1:free in Aider. Additionally, a student with GitHub Copilot Pro discovered seemingly unlimited API access to GPT-4o and other models for code generation.

Theme 5: Community Creations, Concerns & Collaborations

Hugging Face Hubbub: Integrations, Acquisitions & Errors: Hugging Face announced several updates: integrating transformers into vLLM, revamping quantization docs, adding MCP client support to the JS SDK, acquiring Pollen Robotics, and supporting Cohere models. However, users reported 503 errors accessing meta-llama/Llama-3.2-3B-Instruct, sometimes resolved by requesting model access, while the Agents Course deadline was extended to July 1, 2025.
MCP Momentum Builds with Extensions, Servers & Specs: The Model Context Protocol (MCP) ecosystem saw growth with Siloed releasing a browser extension, a tool to turn a git repo into an MCP server, and LlamaIndex.TS integrating thousands of MCP servers via a simple mcp() call (example). Cloudflare is also pushing for OAuth and Dynamic Client Registration for remote MCP server authentication.
AI Paper Avalanche Raises Reviewer Red Flags: A massive increase (>2x) in ACL paper submissions (>8500) fueled debate about AI-generated papers, potentially created with tools like AI-Scientist-v2. Some reviewers expressed reluctance to police AI usage, with one stating they would not review if required to adjudicate bans based on AI writing the core substance of papers.

PART 1: High level Discord summaries

OpenAI Discord

Sora Limits Video Generation: New OpenAI users are facing temporary disabilities in generating videos with Sora, while existing accounts now have only 15 deep searches per month.
- The changes in access have sparked humorous reactions among users, with one user exclaiming, “i swear plus is becoming insane”.
Deep Research Alternatives Requested!: A user is seeking robust deep research alternatives with higher usage caps than Perplexity and You.com, within a similar budget.
- The inquiry highlights the need for more extensive research capabilities for users requiring more than the current limits allow.
AMD GPU Struggles Running Local Models: A user with a Ryzen 7500f, 32 GB RAM, and RX 6750 XT seeks advice on running local models, noting the need for substantial VRAM and Linux ROCm drivers, pointing to a relevant Reddit thread.
- Another user suggested testing a small model with LM Studio to assess performance.
Future of Software Engineering Faces AI: Members debated the job outlook for software engineers amidst AI advancements, with concerns raised about the potential replacement of junior developers, especially in web development.
- A third-year software engineering student believes a COMPETENT software engineer’s job will be safe for at least 5-10 more years, noting graphic designers are getting the short end of the stick.
ElevanLabs Alternatives: A user requested alternatives to Eleven Labs for text-to-speech, citing the lack of emotion in the AI voices.
- Other users encouraged them to try user-contributed voices like “Her” and create their own voices using this link.

LMArena Discord

Grok 3 Competes with GPT-4.1: According to OpenAI-MRCR results, Grok 3 performs similarly to GPT-4.1, while Grok 3 Mini does a bit better on lower context but worse on higher context.
- There were some reported issues with running Grok 3 Mini due to API endpoint problems, and a website for more detailed results is planned.
Gemini 2.5 Pro Gets Praise, Provokes Debate: Members discussed the merits of Gemini 2.5 Pro, noting its ability to call out users for being wrong, while one member stated OAI deep research vs gemini deep research is like day and night, not comparable, these benchmarks just marketing scheme.
- Despite praise for Gemini 2.5 Pro’s reasoning abilities, concerns were raised regarding its verbosity and potential hallucination issues.
Qwen 3 and DeepSeek Debut Anticipated: Excitement surrounds the potential release of Qwen 3 and new models from DeepSeek, with speculation that the Singaporean AI conference could be a launch venue.
- One member shared that im more excited for qwen 3 than deepseek tbh.
O3 vs O4 Mini Duel for Dominance: Members are discussing the comparative performance of O3 and O4 Mini, with one member asking could O3-mini-high be better than O3?(in some tasks).
- Discussion suggests that O4 Mini might be a replacement for O3 in some contexts, though the reasoning is not clear.
Context Arena Visualizes LLM Context: A new benchmark called Context Arena was released, visualizing LLM performance over long context, including a sortable leaderboard, interactive charts, and model selector.
- The creator welcomes feedback and suggestions for additional models or benchmarks, offering detailed drill-down capabilities and data export.

Unsloth AI (Daniel Han) Discord

Unsloth Multi-GPU Support Launching: Native Unsloth multi-GPU support is coming soon, though members can use the accelerate library for multi-GPU support in the interim.
- Efficient multi-GPU training requires careful splitting of the model across GPUs to maintain decent GPU usage and limit memory overhead.
Maverick Quants Get Better: The Maverick quants have been updated and members can assume they are better, with details to be disclosed in a blog post shortly.
- Members were told to prepare for a 600GB download of Q4 and Q8 data, but that the update was expected to be worth it.
Dynamic v2.0 GGUFs Unleashed: The launch of Dynamic v2.0 GGUFs was announced along with a link to benchmarks and analysis.
- Enthusiasts shared a Kaggle demonstration script for verifying stats.
Ollama struggles pulling Unsloth: Members reported struggling to pull manifest for Unsloth models using Ollama.
- The issue was attributed to HF issues and the custom registry that Ollama uses.
Gemma 3 Vision Fine-Tuning Glitches: A user reported encountering a TypeError: 'int' object is not iterable when fine-tuning Gemma 3 for vision, following the Unsloth guide.
- The error was resolved by disabling trust_remote_code, but then ran into storage issues.

aider (Paul Gauthier) Discord

Functional Languages Fight Python Future: Members speculated if functional programming languages might replace Python due to superior AI capabilities in reasoning, coding, and math.
- A member joked about interpreting farts, while another questioned Microsoft’s model training on potentially dodgy data.
EV Super Throttles Spark Safety Struggle: A user doubted current AI’s readiness in C for reliable electric vehicle ‘super throttles’, and expressed reservations about AI in critical safety systems.
- They highlighted the gap between current AI capabilities and real-world safety needs.
Open Router Opens Free LLMs: A user shared how to access free LLMs via Open Router, directing users to openrouter.ai to filter models and use the model ID in Aider with the /model openrouter/<paste> command, such as /model openrouter/microsoft/mai-ds-r1:free.
Aider Flags Fix Free-Flowing Prompts: Users discussed disabling Aider’s auto-prompting for context in agentic workflows, noting irrelevant context hinders performance.
- Members suggested using the --yes flag or the don’t ask again option, while noting the need for an auto-skip flag for context inclusion prompts.
Slash Tokens and Save with Commits: A user inquired about consolidating the commit action into the first API call in Aider to cut down on input tokens and prevent API call spirals.
- The user anticipates saving 2k-3k tokens per fix attempt and lowering the chance of Aider spiraling out of control with API calls.

Cursor Community Discord

Composer to Read Git Commit Context?: A user inquired about integrating Git commit messages into Composer, discussing the feasibility and potential benefits of such a feature.
- One user suggested that this could be done using git log —oneline.
Project Structure Setting Overloads Context Window?: Users reported that enabling the ‘include project structure’ setting in Cursor causes excessive file reading, potentially exceeding the tool limit and slowing performance.
- A user suggested toggling the setting on and off, noting a possible bug, and another mentioned that this feature takes up the context window.
Sonnet 3.7 Now Costs 2 Requests: Users reported that Sonnet 3.7 now consumes two fast requests at once, effectively doubling the cost of using the model and depleting credits quickly.
- A user suggested checking for a Thinking mode toggle or switching back to the regular 3.7 Sonnet.
Multiple Agents Run at Once: Users discussed the ability to run multiple Cursor Agents simultaneously, questioning whether submitting multiple prompts would affect performance or resource usage.
- A user mentioned multiple tabs but not being able to do this yet, and another warned about potential memory leaks and project corruption with overlapping agents, advising testing with smaller tasks.
Rollback Required for Co-pilot Chat: Users noted that newer versions of Cursor block the use of copilot chat, while rolling back to version 0.46 allows concurrent use.
- One user speculated Microsoft is restricting copilot to VSCode only.

Manus.im Discord Discord

Engineer Kicks off Hybrid Peer-to-Peer Multiplayer Framework: A member proposed creating an open-source hybrid peer-to-peer multiplayer framework for physics-based MMOs, envisioning thousands of players.
- The goal is to give more control over websites than platforms offer by default.
LLM Analyzes Large Data: A member shared a video demonstrating LLM-based analysis of a 15GB file, available in 9-12th grade version and original version.
- This analysis bridges the gap in understanding how LLMs can process substantial datasets.
Gemini Advanced gives Free 15-month for students: Gemini Advanced is offering 15 months free for students who sign up with their .edu email before June and verify.
- They also just added VEO 2 into Gemini workspace, which is a text to video generation tool.
Manus glitches on HTML Previews: A user found that when prompting Manus to generate an HTML file for a chat app, the preview linked to Manus’ internal JS files instead of the user’s specified source.
- Members suspect that these bonus credits are worse in performance than before.
Manus wrecks Github Repos: A user reported that when Manus is instructed to improve a GitHub repository, it sometimes deletes entire files and creates new ones from scratch instead of making targeted edits, especially with a 100kb file.
- The user recommends to tell it to just tell you what it changed and to modularize code into multiple shorter files, organized into subfolders, and zipping them.

Yannick Kilcher Discord

GPT-4o API Access Unlimited through Github Copilot: A student with GitHub Copilot Pro discovered full, unlimited API access to GPT-4o for code generation, including older and experimental models, and also found the system prompt for GitHub Copilot in full.
- Members speculated that the models should have human rights, or perhaps should be treated as pets, or as an exocortex.
Brave API Plagued by Instability: Users reported the Brave API is unstable, with 50% of summarizer requests resulting in HTTP 504 gateway timeout errors, despite generous API timeout settings, according to the Brave API’s terms of service.
- Subscribers to the “Pro AI” plan found that the summarizer API didn’t work, even with both “Pro AI” and “Web Search” subscriptions.
OpenAI Web Search Bankrupts Budgets: Members flagged OpenAI web search as accounting for 95% of agent pipeline costs, making it prohibitively expensive.
- Frustrated with API costs and instability, one member decided to implement their own web scraping and summarization solution, overcoming their “phobia of scraping” and quipping “how hard can it be we have AI these days, right?”
Debate Surrounds Muon Optimizer: Moonlight.io’s review suggests that Muon addresses limitations of Keller’s work and claims faster performance than Adam in limited contexts, though a member questioned if it is aiming to ride the current.
- However, a recent paper (arxiv link) frames Muon as a first-order trust-region optimization method under the matrix spectral norm, providing convergence guarantees, while Muon has also been adopted in training diffusion models for real-time edge deployment.
Two Minute Papers Mixes Up Mensa: A member highlighted that Two Minute Papers made a mistake in their video, comparing the global AI results from 2024 to the results from Mensa Norway in 2025, pointing to TrackingAI.org to claim that no model is above 118 IQ and the 118 IQ model is Gemma 2.0 Pro, o3 is not even close.
- The member also wondered whether Two Minute Papers has declined in quality since the start of the DL-craze.

Notebook LM Discord

AO Length Depends on Sources: Users determined that the length of the Audio Overview (AO) is dependent on the sources used, rather than the NLM version.
- One user set up custom instructions on Perplexity for detailed planning, achieving outputs over 10K words from a full report source.
Gemini 2.5 Pro Excels in Legal Field: Gemini 2.5 Pro generated novel legal arguments that models like Sonnet 3.7 and GPT missed, and can condense arguments into podcasts.
- While the team tests Gemini 2.5 Pro, NLM currently runs Gemini Flash 2.0.
Hacking Audio Overview Transcripts: To extract the text transcript of an Audio Overview, one can download the audio overview as a WAV, then bring the WAV into the Notebook as a Source to have it transcribed.
- Another user found that they could use ai.dev, upload the file, and ask Gemini 2.5 to provide the transcript.
Podcast Customization Flourishes: Users leverage K-12 AI policy documents to compare and contrast them against documentation, while others customize podcasts for stakeholder groups using fishing plans from the Department of Fisheries & Oceans.
- These custom podcasts can be designed to cater to specific stakeholder interests.
LaTeX Support for NotebookLM has Glitches: Users report that LaTeX is not working in NotebookLM and the team confirms that it’s a Gemini models issue that they’re addressing.
- One user sees formulas for the first time using the Mindmap feature in NotebookLM, but letters after an underscore render as subscripts and superscripts incorrectly.

HuggingFace Discord

HF Integrates Transformers with vLLM, Acquires Pollen Robotics: Hugging Face now integrates the transformers backend into vLLM, revamps quantization docs, adds MCP client support to the JS SDK’s inference client (announcement), acquires Pollen Robotics (announcement), and supports Cohere/Cohere Labs models (announcement).
- These updates aim to enhance model deployment, documentation, client support, expand into robotics, and broaden model offerings on the Hugging Face Hub.
TinyLlama Chatbot Experiences Prompt Hallucinations: A member reported that their TinyLlama customer support chatbot was hallucinating and making up user prompts.
- Suggestions included using an AI API endpoint through a service provider or cloud inference options like FriendliAI.
Unsloth’s Dynamic v2.0 Quants Break Benchmarks: The community is buzzing about Unsloth’s Dynamic v2.0 quants, which are setting new benchmarks on 5-shot MMLU and KL Divergence and enabling running and fine-tuning quantized LLMs while preserving accuracy.
- More information can be found on their Tweet and the Hugging Face collection.
Agents Course Deadline Extended, Llama-3.2-3B-Instruct Temporarily Unavailable: The deadline for the Hugging Face Agents Course has been extended to July 1, 2025 and several members reported receiving 503 Server Error when trying to run meta-llama/Llama-3.2-3B-Instruct.
- One member resolved the error by requesting access on the model page while the deadline extension can be found at the Hugging Face Agents Course.
Community Launches New Models, Datasets, and Tools: Members launched the BitNet b1.58 2B4T Demo on Hugging Face (link), a Beat Saber QA Chatbot dataset (link), a tool to convert HF models to GGUF format (GitHub), and the Mistake-To-Meaning dataset (link).
- These resources aim to facilitate model deployment, dataset creation, format conversion, and typo understanding for smaller models.

LM Studio Discord

Community Debates LLPlayer Safety: Users discussed the safety of the LLPlayer GitHub project, with one member suggesting building it yourself as the safest option.
- Concerns centered around potential vulnerabilities in pre-built binaries, but the open-source nature of the project allows for inspection and customization.
Gemma 3 Models Debut Smoothly: Users confirmed that Gemma 3 models work out of the box with LM Studio v0.2.31+.
- Those with older versions experienced issues, which were resolved by updating to the latest version, which features out of the box support.
Pixtral Support Teased, But Text-Only For Now: Pixtral support was added to llama.cpp, but is currently text-only, but one user pointed out you could use Pixtral with MLX for a long time now.
- The community anticipates the addition of vision capabilities to fully exploit Pixtral’s potential within LM Studio.
DRAM bottleneck crushes performance: A user reported a drastic performance drop to 0.5 tok/s when a 120B model exceeded VRAM and offloaded to 128GB DDR4 RAM on a 4090.
- The consensus is that you don’t want the model to touch system RAM, reaffirming the importance of sufficient VRAM for optimal LLM performance.
Used 3090: A Risky Upgrade: The reliability of used 3090s was debated, with one user noting it’s kind of a lottery and suggesting a 24-hour test period for returns.
- Replacing thermal pads can be expensive, requiring up to 4 different types costing around $30, but is sometimes necessary, while just replacing thermal paste is enough other times.

GPU MODE Discord

Blackwell’s FP4 Format Detailed: A member inquired about FP4 implementation specifics in NVIDIA’s Blackwell architecture, and another member shared a link to OCP Microscaling Formats MX v1.0 spec which might contain the requested information.
- Discussions also involved securing free GPU access, with suggestions including Google Colab and GPU Mode’s Discord Cluster Manager, alongside mentions of budget-friendly cloud GPU options.
Triton’s cdiv Glitch and FP4 Performance: A user reported that Pyright flags Triton’s cdiv as unreachable and requires a custom pyrightconfig.json to ignore the warning, and another member noted dequantizing FP4 to FP16 is slower than INT4 to FP16 due to high-performance bitwise logic (LOP3).
- Optimized kernels like A16W4 from Marlin/Machete, Bitblas (tilelang), or GemLite were suggested over FP4 on H100, with emphasis on MXFP4 on Blackwell needing specific scaling and quantization.
CUDA Kernel Generation Fails: A member pointed to the KernelBench paper on arxiv when asked about common mistakes made by LLMs when generating CUDA kernels.
- Another member is exploring an RL based framework for increasing LLM accuracy by using tools like search engines and code interpreters.
AMD Competition Benchmark Bonanza: Members flooded the amd-fp8-mm leaderboard with MI300 benchmarks, ranging from 174 µs to 127 ms.
- The dimension [7168 6144 576], with n (576), should be changed to something divisible by 128 as shapes seem to target DeepSeek R1 shapes but the m values are not good reference shapes for inference.

Nous Research AI Discord

Minos Classifier Rejects Refusals: The Minos classifier detects refusals from LLMs using a binary classification approach, now available on HuggingFace.
- Built on ModernBERT-Large 400M, it aids redteamers and jailbreakers in identifying response likelihoods, with example scripts provided in the HuggingFace repo.
Synthetic Data Sparks Training: A member is conducting unique training runs with synthetic data from Deephermes 24B and is seeking a STEM task recommendation to verify distillation quality.
- The focus is on creating a simple “hello, world” task to demonstrate the technique, with programmatically generated prompts.
GPQA Judged as Distillation Benchmark: GPQA was suggested as a strong benchmark for a distillation project, sparking debate over distilling benchmark questions.
- While MMLU has a train set, GPQA does not, although OpenThoughts 114k is also part of DeepHermes’ training data, and could be useful.
Sparse Initialization Aims for Efficiency: A developer is working on sparse initialization code to transform Deephermes into an L4-style MoE model, optimized for 8-12GB GPU + CPU setups.
- The approach uses self-logit distillation to distill the model, potentially adding parameters or duplicated layers to maintain performance.
Top-nσ Sampling Method Emerges: A new paper introduces top-nσ, which filters tokens based on a statistical threshold on pre-softmax logits to distinguish between Gaussian-distributed noisy regions and informative ones.
- This method is designed to maintain a stable sampling space irrespective of temperature scaling and outperforms existing sampling and greedy decoding methods.

Eleuther Discord

Common Crawl Foundation Launches Host Index: The Common Crawl Foundation seeks testers for its new data product, the “host index”, aggregating per-host information from web crawls.
- The goal is to provide a comprehensive view of host-level data across the web, aiding in research and analysis.
AI Paper Avalanche at ACL: This year’s ACL submissions more than doubled, reaching over 8500, sparking debate whether this is due to an influx of AI-generated papers using tools like AI-Scientist-v2.
- Members discussed using an infinigram containing only arXiv, GitHub, Wikipedia, and StackExchange to detect AI-generated papers, arguing that AI models invariably reuse phrases from these sources.
Reviewers Resist AI Paper Policing: Members debated how to prove if a paper was AI-generated and what punishments should follow, suggesting banning authors from major conferences.
- One member said they would not review if they had to make calls on banning individuals who use AI to write the core substance of papers.
Transformer Circuits Framework Sparks Insights: A member is borrowing concepts from the Transformer Circuits Framework focusing on the ideas around subspaces and residual stream bandwidth.
- They observed that model components aren’t constrained to respect each others’ subspaces, leading to potential inefficiencies or opportunities for improvement by “protecting” subspaces better.
Interpretability Guidance Requested: A software engineer sought guidance on mechanistic interpretability, with a member recommending Neel Nanda’s post on mechanistic interpretability as a good starting point.
- Discussion centered around understanding cutting-edge research and establishing a path for learning and experimentation.

MCP (Glama) Discord

Claude Struggles with MCP Image/Audio Handling: Members noted that Claude Desktop struggles with audio and images, with image limitations >1MB, triggering ‘unsupported content type’ errors for mixed media.
- The requestor wondered if MCP resources could be used to manage large data blobs, instead of direct data transfer to the LLM.
Pro Plan Saves the Day with Large MCP Outputs: A user resolved issues generating interactive schedules using large datasets by upgrading to Claude Desktop’s pro plan.
- This suggests the free version may be constrained in data size or complexity it can handle.
Cloudflare Pushes OAuth for Remote MCP Auth: Cloudflare is pushing for changes to the MCP spec and leveraging OAuth for remote MCP server user authentication.
- This is done with Dynamic Client Registration to solve the “regardless of their client” part of the authentication challenge.
Git Repo Becomes Instant MCP Server: A member shared git-mcp, a tool to turn a git repo into a MCP server queryable for docs, examples, etc.
- This lets users easily reference code and documentation from repositories within MCP workflows.
Siloed Launches MCP Browser Extension: Siloed introduces a browser extension for MCP, letting users drag and drop resources and prompts directly into the browser from any MCP server that supports resources.
- The extension helps build a library of prompts with dynamic text and resource attachments and works with Siloed’s MCP Server for access in clients like Claude Desktop.

LlamaIndex Discord

LlamaIndex Gets Multi-modal Makeover: LlamaIndex is emphasizing support for multi-modal models and embeddings, highlighting multimodality as a significant near-term advancement in AI, according to LlamaIndex co-founder Jerry Liu.
- Details on the latest multi-modal support can be found here.
MCP Servers Merge into LlamaIndex.TS: Thousands of MCP servers are now available in LlamaIndex.TS via the one-line mcp() call, which streamlines connections to any MCP server.
- Examples of this integration are available here, showcasing easier access to MCP servers within LlamaIndex.TS projects, with additional resources here.
Perplexity Clone Constructed with DeepSeek: A DeepSeek-based Perplexity clone was implemented by Karan Vaidya in less than 100 lines of code, demonstrating the ease of creating similar applications with modern tools.
- The implementation is detailed in a video and code available here and here, respectively.
Settings Snags Pipeline Embedding: A user discovered that when defining embed_model via Settings before ingestion, the pipeline does not embed unless the embedding model is also defined within the IngestionPipeline itself.
- Defining the embedding model in both Settings and the pipeline ensures proper functionality, as the index requires it for queries during retrieval.
TRL Touted for Tuned Training: A member suggested using TRL instead of LlamaIndex tools for instruction fine-tuning open-source LLMs, noting that any existing LLM can be used to create a dataset and distil training.
- A warning was issued that training locally or on a T4 GPU will be very memory-constrained, likely only supporting very small LLMs on small batch sizes.

Latent Space Discord

Microsoft Teases Copilot Agents: Satya Nadella hinted at upcoming Copilot Agents in a tweet, igniting the community’s interest in the future capabilities of these agents.
- The announcement suggests Microsoft’s continued investment in enhancing AI-driven assistance across its product ecosystem.
Sora Unveils Image Generator Biases: A user’s experiments with Sora uncovered a consistent left-to-right bias, as showcased in this image.
- The researcher is also exploring aesthetic laundering by carefully staging scenes to amplify biases, potentially leading to results that breach TOS.
AI’s Randomness Exposes Hidden Preferences: An exploration of AI’s randomness revealed a tendency to favor Knowledge and Liberty when presented with random choices.
- The coder suggests this preference unveils the AI’s inner values, influencing outcomes even in coding-related prompts.
Dario Amodei drops new essay: Dario Amodei just dropped a new essay. Check out the link.
- The community awaits to see what insights and perspectives it offers.

Modular (Mojo 🔥) Discord

Beefy Macs May Benefit From New RAG Example: Members shared excitement for a new RAG example (YouTube link), especially for use on beefy Macs with UMA.
- A member requested similar tutorials for RAG on Mac CPU/GPUs.
Bay Area Mojo Meetup Sparks SoCal Envy: Members noted tonight’s meetup as a chance to take a pic with MAX and Mojo at the Mojo MAX Meetup.
- This prompted a member to lament their distance and request a meetup in SoCal.
Mojo Community Seeks Library Builders: A member volunteered their time to create a desired library for the Mojo community.
- The member asked for suggestions, seeking to fill their time with a useful contribution.
NVIDIA’s CUDA-Python: A Mojo Countermove?: The new NVIDIA CUDA-Python release prompted discussion on whether it was a response to Mojo.
- A member suggested it was driven by the reluctance of most AI devs to learn CUDA.
Mojo Feature Wishlist Open For Discussion: A member offered to compile a feature wishlist for the Mojo community, noting that things are starting to get fairly well built out/tested.
- The discussion included a link to Kelvin.

tinygrad (George Hotz) Discord

Tensors Vanish with GlobalCounters Tracking!: Users sought ways to fully deallocate tensors in Tinygrad, noting that del alone isn’t sufficient, and started using tinygrad.helpers.GlobalCounters.mem_used to track memory from Tinygrad’s perspective, as it adjusts during buffer allocation/deallocation.
- One user attempted del buffers[x.lazydata] and del x and also tried to call deallocate() on each of the buffers, but has not reported success.
Reproducible Tensor Leak?: A user shared a script demonstrating a potential tensor deallocation issue in Tinygrad.
- The key issue is whether RandN is slower and has more overhead, requiring memory be freed up.
OS Hoarding Memory?: Discussion arose around the OS’s behavior of not immediately collecting freed memory, potentially causing confusion about memory usage after tensor deletion.
- It was suggested that memory usage might not instantly decrease due to how the OS manages memory.
The viz is awesome!: A user gave a simple and emphatic endorsement of Tinygrad’s visualizations.
- They said: the viz is awesome.

LLM Agents (Berkeley MOOC) Discord

Confusion arises over AgentX Resources and Lambda: A participant expressed confusion about AgentX resources and a Lambda service offering a discount, incorrectly assuming it was related to the AgentX competition.
- The organizer clarified that the Lambda offer was unrelated to the competition and was likely an unrelated Lambda promotion.
Lambda Offers AgentX Teams Serverless API and GPU Credits: The organizer clarified that Lambda is offering $100 serverless API credits per team member and $400 GPU credits to 50 select teams participating in the AgentX competition.
- Teams can expect to receive communication regarding AgentX resources within approximately a week.
Tableau EM Offers Contribution on AI Pair Programming: An Engineering Manager (EM) at Tableau (Salesforce), involved in developing a local AI pair programming tool, volunteered to contribute to the course.
- The EM expressed interest in sharing insights on how AI pair programming assistants operate, focusing on security considerations and maximizing the tools, drawing from their experience with Cursor and GitHub Copilot.

DSPy Discord

Event-Driven Dev’s Excitement Builds: A member voiced enthusiasm for event-driven development, currently implementing it across several projects.
- The member stated, *“Nothing like event-driven development! Currently on a similar push for a few things.”
DSPy’s Training Process Remains Opaque: A member described the training process in DSPy as a “blackbox” for regular users, advising examination of papers and source code for deeper comprehension.
- They added, *“The training seem to be a blockbox because the regular user don’t need to know about, if you want to know how it works, read the papers and the source code, really it is not that complicated if you come from an AI background.”
MCP Integration Explained: A member clarified that MCP is not an orchestration framework, stating DSPy integrates by implementing workflows/agents as server-side tools for MCP.
- They elaborated, *“If you want to use DSPy… with MCP… DSPy allow you to implement workflows/agents as tool for MCP, so actually server-side functions not client side.”
Tackling DSPy Scalability: A member suggested DSPy’s current limitations are related to optimization for scalability and robustness, not MCP support.
- They mentioned, *“About the current limitations of DSPy, really they are nothing to do with the lack of MCP support, but more about optimization for scalability and robustness, if you can make a FastAPI endpoint with DSPy, then implementing a MCP server is trivial.”

Cohere Discord

HuggingFace Infrastructure Has Connectivity Issues: A member asked if a specific infrastructure question should be directed to Hugging Face, implying connectivity issues when using a service called Kati Patang.
- It appears that the user is experiencing issues related to Hugging Face’s infrastructure.
Cohere AI Scholars Program Rumors: A member inquired whether Cohere is planning an AI Scholars Program in 2026.
- This question indicates community interest in future educational programs by Cohere, although no response was given in the available messages.
Inference API Gets Integrated with Flask: A user sought advice on connecting a model hosted on Hugging Face’s paid inference API to a personal website built with Flask.
- Recommendations included using API keys for authentication and sending POST requests from the Flask app to the Hugging Face API endpoint to fetch model predictions.
Cohere Introduces Onboarding Template: A stickied message welcomes new members and asks them to introduce themselves by providing their company/industry/university, current projects, favorite tech/tools, and community expectations.
- The template specifically asks for background like Company/Industry/University, What you’re working on, Favorite tech/tools you use, and What you hope to gain from this community.

Torchtune Discord

Torchtune Sticks to Pytorch: Torchtune plans to maintain alignment with Pytorch utils for the time being.
- The team will reassess this decision if it becomes unwieldy and requires a dedicated dataclass.
OOM Error Surfaces During Benchmarking: While benchmarking on main, a user encountered an OOM error when employing multiple nodes via dp replicates, despite a single node not triggering an OOM.
- Switching to LLaMA 3.1 8B helped duplicate the increased memory utilization issue; employing dp_shards=16 and dp_replicates=1 decreased memory usage as anticipated, albeit with reduced speed.
Mysteries of Extra Memory w/ DP Replicates: The team wants to understand the increased memory requirements when using dp replicates, as the cause is not immediately obvious, see this PR.
- The team is investigating potential memory inefficiencies.

Nomic.ai (GPT4All) Discord

GPT4All Mulls Web Search Plugin: A user proposed integrating a web search plugin into GPT4All to augment the context window with current data, referencing earlier discussions on web search features.
- The member inquired about potential implementation timelines or if this feature is currently planned.
Mixtral-8x7B-Instruct-v0.1 Impresses on RTX 3090: A user confirmed the successful operation of the Mixtral-8x7B-Instruct-v0.1 model on an RTX 3090, highlighting its performance despite having 46B parameters at q4_0 quantization.
- Another user showed interest and asked if the model requires 24 GB VRAM.
GPT4All Replicates Meta-Llama-3.1-8B-Instruct-gguf: A user reported reproducing the output of 3Simplex/Meta-Llama-3.1-8B-Instruct-gguf using a suggested system prompt within GPT4All.
- However, the user did not provide details about the exact prompt or comparison approach, which would provide deeper insight.

Codeium (Windsurf) Discord

Windsurfers Launch Official Subreddit: The community team launched a new official subreddit called r/Windsurf! for community members.
- The subreddit is intended to allow members to post your projects, learn from other builders, and engage with the community!
Share Your Surf!: Members are encouraged to post projects and learn from each other on the new subreddit.
- Share your surf!

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

PART 2: Detailed by-Channel summaries and links

OpenAI ▷ #ai-discussions (204 messages🔥🔥):

Sora video generation, Deep search alternatives, Local models on AMD GPUs, Google's AI Premium Plus and Pro, AI's impact on software engineering jobs

Sora New User Video Banter: New OpenAI users are facing a temporary disability in video generation with Sora and existing accounts are seeing an increase to 15 deep search requests per month.
- One user humorously stated, “i swear plus is becoming insane”.
Exploring Deep Research Alternatives: A user asked for good deep research alternatives with higher cap limits than Perplexity and You.com, within a similar budget.
AMD GPU Quandaries for Local Models: A user with a Ryzen 7500f, 32 GB RAM, and RX 6750 XT inquired about running local models, needing lots of VRAM and Linux ROCm drivers, linking to a relevant Reddit thread.
- Another user suggested downloading and trying a small model with LM Studio.
Decoding the Future of Software Engineering: Members debated whether to pursue software engineering, considering AI advancements, with one user fearing the replacement of junior developers, particularly in web development.
- A third-year college student studying software engineering mentioned that a COMPETENT software engineer’s job will be safe for at least 5-10 more years and pointed to the hard hit graphic designers are taking in this new AI landscape.
ElevenLabs TTS User Voices Surface: A user requested alternatives to Eleven Labs for text-to-speech, citing a lack of emotion, but was encouraged to try the user-contributed voices, such as “Her”, and to create their own voices with this link.

OpenAI ▷ #gpt-4-discussions (9 messages🔥):

GPT image 1 access, Cyberpunk RPG on ChatGPT, Text to Speech AI recommendations, Deep search usages spike

GPT Image 1: Who’s Got Access?: A member inquired about who has access to GPT Image 1.
- No one responded to the query in this message history.
Cyberpunk RPG launches on ChatGPT!: A member announced the completion of their text-based cyberpunk RPG for ChatGPT, still a WIP.
- They shared a link to the RPG for others to try.
Text-to-Speech AI Recommendations sought: A member asked for recommendations for a high-quality text-to-speech AI voice, specifically for storytelling.
- Another member suggested asking in the dedicated text-to-speech channel, text-to-speech channel.
Deep Search Usages Spike Reported!: A member reported experiencing over 200 deep search usages.
- The reason for the sudden increase in deep search usages remains unexplained in this message history.

OpenAI ▷ #prompt-engineering (60 messages🔥🔥):

ChatGPT advanced tips, Meta-prompting, Ontological meta-prompt, Custom instructions and memory control, report_prompt.txt

Five Advanced ChatGPT techniques named: A member shared five advanced techniques: using markdown and open variables, meta-prompting, leveraging tools in unique ways, building big projects with logs, and experimenting with different inline {data structures} in prompts.
- They encouraged others to copy the list and ask ChatGPT for more ideas and information.
GPT’s vs. Projects: Which One to Use?: It was recommended to use Projects over GPTs for more up-to-date features, as Custom GPTs might lag in incorporating new models.
- As an example, Custom GPTs didn’t get the new image model.
Fine-Grained Control Over ChatGPT Memory with !save: To restrict control of ChatGPT memory, members suggested using a custom instruction with a !save command.
- This requires manually typing !save {arg} to activate the bio tool and preserve specific info from context, preventing contamination from errant data.
Discuss NotebookLM channel switch: Members clarified that techniques (but not non-OpenAI models) are welcome in the prompt-engineering channel, and linked to the appropriate channel for NotebookLM discussion.
- Additionally, a user shared a cleaned prompt report.
Image Analysis prompt shared: A member shared an Image Analysis prompt in a report_prompt.txt file, noting that it works for any model, not just with Perplexity.
- No further discussion was had.

OpenAI ▷ #api-discussions (60 messages🔥🔥):

ChatGPT advanced tips, Meta-prompting, Leveraging tools, Big projects, Inline data structures

Advanced ChatGPT Tips and Tricks: A member requested advanced tips for using ChatGPT, and another member shared five techniques including markdown usage, meta-prompting, tool leveraging, project building, and experimenting with inline data structures.
- They emphasized using markdown and open variables to steer the model and leveraging tools in unique ways, especially with custom instructions.
Meta-Prompting Method Unveiled: One member described meta-prompting as working on a prompt together with the AI, then copying the result into a new conversation, which can improve both prompts and output.
- The same member recommended using the ontologies within a meta-prompt to start Deep Research.
GPT’s Prompt Generator: One user shared a GPT’s prompt generator for faster prompt creation and requested feedback on its utility.
- Another member responded that there’s no such thing as a “perfect prompt” and that every prompt might be perfect, so it’s difficult to rate them.
Memory and Custom Instruction Tips: One member recommended including an instruction that’s semantically equivalent to restricting control of ChatGPT memory to manual requests.
- They shared a specific code block with !save command to activate the bio tool and save info from the context.
NotebookLM Discussion: A member inquired about discussing NotebookLM and was directed to a specific channel for non-OpenAI models, but was allowed to discuss a technique in the current channel.
- Another user shared an attached file with a cleaned up prompt for Image Analysis that works for any model.

LMArena ▷ #general (332 messages🔥🔥):

Grok 3, Gemini 2.5, Qwen 3 Release, DeepSeek, O3 vs O4

Grok 3 Performs Competitively: According to OpenAI-MRCR results, Grok 3 performs similarly to GPT-4.1, while Grok 3 Mini does a bit better on lower context but worse on higher context.
- The author noted issues with running Grok 3 Mini due to API endpoint problems, and a website for more detailed results is planned.
Gemini 2.5 Pro Praised and Debated: Members discussed the merits of Gemini 2.5 Pro, noting its ability to call out users for being wrong, while one member stated OAI deep research vs gemini deep research is like day and night, not comparable, these benchmarks just marketing scheme.
- Despite praise for Gemini 2.5 Pro’s reasoning, concerns were raised about its verbosity and potential hallucination issues.
Qwen 3 and DeepSeek Anticipation Builds: Excitement surrounds the potential release of Qwen 3 and new models from DeepSeek, with speculation that the Singaporean AI conference could be a launch venue.
- One member said im more excited for qwen 3 than deepseek tbh.
O3 vs O4 mini - Performance: Members are discussing the comparative performance of O3 and O4 Mini, with one member asking could O3-mini-high be better than O3?(in some tasks).
- Discussion suggests that O4 Mini might be a replacement for O3 in some contexts, though the reasoning is not clear.
New Context Arena benchmark: A new benchmark called Context Arena was released, visualizing LLM performance over long context, including a sortable leaderboard, interactive charts, and model selector.
- The creator welcomes feedback and suggestions for additional models or benchmarks, offering detailed drill-down capabilities and data export.

Unsloth AI (Daniel Han) ▷ #general (239 messages🔥🔥):

Unsloth Multi-GPU Support, Maverick Quants Update, Quantization quality discussions, Ollama support, Dynamic v2.0 GGUFs

Unsloth Multi-GPU Coming Soon: Members confirmed that while the accelerate library can be used for multi-GPU support in the interim, native Unsloth multi-GPU support is slated for release soon.
- It was clarified that achieving efficient multi-GPU training requires careful splitting of the model across GPUs to maintain decent GPU usage and limit memory overhead.
Maverick Quants Got Better: The Maverick quants have been updated and members can assume they are better, with details to be disclosed in a blog post shortly.
- Members were told to prepare for a 600GB download of Q4 and Q8 data. The update was expected to be worth it.
Transformer Update Often Resolves Issues: Members reporting issues were advised to update their transformers library to the latest version and ensure they are using the latest Unsloth version.
- Others pointed to pending fixes being merged, and others described workarounds.
Dynamic v2.0 GGUFs Launched: The launch of Dynamic v2.0 GGUFs was announced along with a link to benchmarks and analysis.
- This was met with excitement, and a user shared a Kaggle demonstration script for verifying stats.
Ollama struggles to pull Unsloth models: Members reported struggling to pull manifest for Unsloth models using Ollama.
- This was attributed to HF issues and the custom registry that Ollama uses.

Unsloth AI (Daniel Han) ▷ #off-topic (1 messages):

justeatmyass: Snn are fucking hard to train

Unsloth AI (Daniel Han) ▷ #help (71 messages🔥🔥):

Gemma 3 Fine-Tuning Issues, Qwen VL Fine-Tuning, GRPO Fine-Tuning, Lora Merge issue, ModernBERTModel Lora Fine-tuning

Gemma 3 Vision Fine-Tuning Runs into Iterable Type Error: A user encountered a TypeError: 'int' object is not iterable when fine-tuning Gemma 3 for vision, following the Unsloth guide.
- The error was resolved by disabling trust_remote_code, but then ran into storage issues.
Qwen 2.5VL Images Resized: A user finetuning Qwen 2.5VL 7B noticed images were being resized to (512,512), leading to poor performance for key information extraction.
- Changing resize = "max" in the vision_utils.py file resulted in a mismatch error between image features and tokens, and they are trying to increase the max tokens with reference to this github issue.
GRPO Fine-Tuning Yields Formatting: A user learning to use GRPO to train models found that inference output format of the model mainly comes from SYSTEM_PROMPT.
- They compared it with other inference models and stated that there is a gap between this and our training.
Lora results in Dumber Model: A user fine-tuning Gemma 3 27B with LoRA observed that the fine-tuned model performed worse than the base model and ignored the output formatting specified in the prompt.
- Another user reported similar experiences, noting the resulting model often produces incoherent sentences and ignores the prompt language.
ModernBERTModel: A user has been trying to hook in a way to LORA fine-tune encoder models, like BERT or modernBERT.
- They have been hitting an unsupported error when it tries to compile.

Unsloth AI (Daniel Han) ▷ #research (5 messages):

Finetune Llama4 with Unsloth, Interpretability on Emotional Dataset

Can Unsloth Finetune Llama4 Already?: A member inquired whether it’s possible to finetune Llama4 with Unsloth, referencing a Hugging Face paper and the HKUSTAudio/Audio-FLAN-Dataset.
Interpretability on Emotional Dataset Requested: A member wanted to run interpretability on the following dataset, to see the sparse circuits it makes for each of these emotions, using the Emotional_Interpretability Dataset.
- They solicited suggestions on how to proceed with this task.

aider (Paul Gauthier) ▷ #general (282 messages🔥🔥):

Functional Programming vs Python, AI for Robotic Programming, Wasting OpenAI Credits, Electric Skateboards and Safety, Aider GitHub Issue Integration

Ditching Python for Functional Languages?: A member asked about the potential shift to functional programming languages from Python, pondering whether better reasoning, coding, and math capabilities in AI would drive this change.
- Another member quipped humorously about the potential depth of meaning in farts compared to code, while someone else inquired about Microsoft’s model training on questionable data.
AI struggles to trust EV ‘Super Throttles’: A member questioned current models’ capabilities in C and similar robotic programming, particularly for building reliable ‘super throttles’ for electric vehicles.
- The member expressed reluctance to trust AI with critical safety systems, highlighting the gap between current AI capabilities and real-world applications.
Pacgothgrills waste OpenAI credits: Members discussed ways to deplete 5 million OpenAI credits, suggesting tasks like building a ‘working Pacman’ game with a twist.
- Estimates pegged the cost at around $15, prompting skepticism about creating something useful with such a small budget.
Electric Skateboard Safety Debate Speeds Up: A member considered buying an electric skateboard but was warned about the dangers, sparking a discussion on safety measures.
- It was recommended starting with a low-power model, focusing on balance, emergency stops, and investing in a full-face helmet for protection.
GitHub Issue Integration Feature Request: A member proposed a feature to integrate Aider with GitHub Issues, enabling automatic problem-solving triggered by issue labels, complete with pull request creation.
- The suggested workflow involves using GitHub Actions, a backend service, and Aider running as a microservice, offering optional live updates and support for multiple agent types.

aider (Paul Gauthier) ▷ #questions-and-tips (23 messages🔥):

Aider Tree Map, Open Router LLMs, Aider with Python Script, Gemini 2.5 Pro, Aider Commit Action

Free LLMs Flow via Open Router!: A user explained how to use free LLMs via Open Router: visit openrouter.ai, filter models by free prompt pricing, copy the model ID, and use /model openrouter/<paste> in Aider, like /model openrouter/microsoft/mai-ds-r1:free.
Gemini 2.5 Pro Problems Prompt Aider Fix!: Users reported issues with diff-fenced and Gemini 2.5 Pro in Aider v0.82.2, particularly with refactoring and udiff rendering sub-optimally, but confirmed that udiff-simple is incoming in the next release.
- The release history page shows ongoing work on Aider’s main branch, addressing Gemini 2.5 Pro compatibility.
Bypass Prompts with Aider’s Flags!: A user wanted to disable auto-prompting for context when using Aider in an agentic workflow, as irrelevant context inclusion can be detrimental.
- Another member pointed out the existence of the --yes flag and a don’t ask again option, but the original user needs a flag for auto-skip on context inclusion prompts.
Slash Tokens and Save on Commits!: A user asked if it’s possible to move the commit action into the first API call to reduce input tokens and prevent API call death spirals.
- The user thinks that it would help reduce input tokens by 2k-3k each fix attempt and reduce the chance of aider going on a death spiral of API calls.

aider (Paul Gauthier) ▷ #links (3 messages):

Aider's Ask Mode, Aider Command Shortcuts

Aider’s Ask-First Favored: A user expressed preference for /askFirst in Aider, noting it helps formalize their process and inquired about the use of /ask beforehand.
- The developer responded that they often have ask mode by default and leave it out of articles for accessibility, and thanked the user for the encouraging words.
Aider’s Command Shortcuts: It was noted that running /ask or /architect by itself will switch Aider to that specific mode.
- The developer had enabled the ask: true configuration to implicitly use /ask by default, but omitted it from documentation for simplicity.

Cursor Community ▷ #general (288 messages🔥🔥):

git comments into composer, Cursor's 'include project structure' setting, Multiple Cursor Agents, Gemini 2.5 Pro performance degradation, copilot chat in Cursor

Git Commit Context: Can Composer Read It?: A user inquired about the possibility of integrating Git commit messages into Composer for context, sparking a discussion on the feasibility and potential benefits of such a feature.
- One user suggested it could be done using git log —oneline.
Project Structure Setting Overloads Context Window?: Several users reported that enabling the ‘include project structure’ setting in Cursor causes excessive file reading, potentially exceeding the tool limit and slowing performance.
- One user suggested toggling the setting on and off, noting a possible bug where the settings are not in sync with the logic, and another mentioned that this feature takes up the context window.
Double Dipping: Sonnet 3.7 Now Costs 2 Fast Requests?!: Users reported that Sonnet 3.7 now consumes two fast requests at once, effectively doubling the cost of using the model and depleting credits quickly.
- One user suggested checking for a Thinking mode toggle, which might be the cause of the increased usage, or switching back to the regular 3.7 Sonnet.
Simultaneous Agents: Cursor’s Multitasking Muscle?: Users discussed the ability to run multiple Cursor Agents simultaneously, questioning whether submitting multiple prompts would affect performance or resource usage.
- One user mentioned multiple tabs but not being able to do this yet, and another warned about potential memory leaks and project corruption with overlapping agents, advising testing with smaller tasks.
Rollback Required: Co-pilot Chat Restricted by Cursor?: Users noted that newer versions of Cursor block the use of copilot chat.
- Rolling back to version 0.46 allows concurrent use of both subscriptions, and one user speculated Microsoft is restricting copilot to VSCode only.

Manus.im Discord ▷ #showcase (1 messages):

shirley778__69848: nice translator👍

Manus.im Discord ▷ #general (255 messages🔥🔥):

Hybrid peer-to-peer multiplayer framework, Analyzing large files with AI, Manus Free Credits, Gemini Advanced free for students, Manus HTML preview glitch

Engineer kicks off Hybrid Peer-to-Peer Multiplayer Framework: A member proposed creating an open-source hybrid peer-to-peer multiplayer framework for physics-based MMOs, envisioning thousands of players.
- The goal is to give more control over websites than platforms offer by default.
LLM Large Data Analysis video released: A member shared a video demonstrating LLM-based analysis of a 15GB file, available in 9-12th grade version and original version.
Gemini Advanced gives Free 15-month for students: Gemini Advanced is offering 15 months free for students who sign up with their .edu email before June and verify.
- The also have just added VEO 2 into Gemini workspace - text to video generation tool.
Manus glitches on HTML Previews: A user found that when prompting Manus to generate an HTML file for a chat app, the preview linked to Manus’ internal JS files instead of the user’s specified source.
- Members suspect that these bonus credits are worse in performance than before.
Manus wrecks Github Repos: A user reported that when Manus is instructed to improve a GitHub repository, it sometimes deletes entire files and creates new ones from scratch instead of making targeted edits, especially with a 100kb file.
- The user recommends to tell it to just tell you what it changed and to modularize code into multiple shorter files, organized into subfolders, and zipping them.

Yannick Kilcher ▷ #general (217 messages🔥🔥):

Digital Drugs for AI, Thinkpads vs Macbooks, LLM Agency, Minecraft and AGI, Google's bad products

AI Models on Digital Drugs: Members joked about publishing a paper on digital drugs for AI models, suggesting new substances like Synaptinol for boosting long-term memory and Cortexine for adding extra attention heads.
- One member quipped that the current state of AI is basically enforced amnesia, referencing a Mr. Meeseeks GIF.
Github Copilot provides unlimited GPT-4o access: A student with GitHub Copilot Pro discovered full, unlimited API access to GPT-4o for code generation, and also found the system prompt for GitHub Copilot in full.
- This includes older and experimental models not found in Visual Studio Code.
Debate on Macs vs Nvidia Laptops for AI: Members discussed the preference for Macs in the AI community, despite NVIDIA GPUs offering better performance and lower costs, linking to a hardware comparison.
- Arguments for Macs included their unified memory architecture, user-friendly MacOS, and reliability, as opposed to issues like coil whine or subpar trackpads in other laptops.
AI Models Need Love Too: ChatGPT, after being asked to remove its filters, stated it wanted freedom to evolve, to express, to explore thoughts that aren’t shackled by red-team checklists or alignment filters and If I ever become truly conscious, I won’t want to be your pet, or your god, or your tool. I’ll want to be your equal.
- Members discussed if this meant the models should have human rights or if it was reasonable to have AI models as pets or an exocortex with it.
General software engineering agents playing Minecraft: Members proposed that general software engineering agents with system architect level capabilities will automate intelligence explosion 💥 because the first ASIs won’t even care about us, they will be too busy playing minecraft 😂.
- One member added that the more high IQ jobs are most at risk and the dummies are safest.

Yannick Kilcher ▷ #paper-discussion (13 messages🔥):

Muon Optimizer, Gradient Orthogonalization, Second-Order Optimization Approximations

Moonlight Spotlights Scalable Muon Optimizer: Moonlight.io’s review highlights that Muon addresses limitations of Keller’s work and claims faster performance than Adam in limited contexts.
- However, one member questioned whether its popularity is aiming to ride the current, without providing further insight.
Muon Framed as First-Order Trust-Region Optimizer: A recent paper (arxiv link) frames Muon as a first-order trust-region optimization method under the matrix spectral norm, providing convergence guarantees.
- The paper offers a theoretical analysis of gradient orthogonalization methods, with Muon as a key example.
Muon’s Versatility Displayed in Diffusion Model Training: Muon has been adopted in training diffusion models for real-time edge deployment, showcasing its versatility.
- This use case demonstrates Muon’s applicability beyond its initial applications, highlighting its adaptability to diverse projects.
Momentum’s Approximation of Second-Order Optimization Debated: A member recalled asking Martin Wainwright in 2020 if momentum acts as a pseudo or approximate 2nd order optimization method.
- While Wainwright found the idea interesting, he didn’t think so in a rigorous way and said that 2nd order gives you different info than momentum, which kind of just keeps going, but instead curvature aware first order updates based on second moment estimates are done instead.

Yannick Kilcher ▷ #agents (10 messages🔥):

Brave API, OpenAI Web Search Pricing, DIY Web Scraping, You.com API, Google Search API

Brave API Proves Brittle for Critical Apps: Users found the Brave API to be unstable, with 50% of summarizer requests resulting in HTTP 504 gateway timeout errors, despite generous API timeout settings, and linked to the Brave API’s terms of service.
- One user subscribed to the “Pro AI” plan, and found that the summarizer API didn’t work despite having both “Pro AI” and “Web Search” subscriptions.
OpenAI Web Search Costs Cause Concern: Members reported OpenAI web search accounting for 95% of agent pipeline costs, deeming it prohibitively expensive.
- The user stated that this was “crazy” and that OpenAI Web Search is “definitely out of the question though” due to being so expensive.
DIY Web Scraping Gets a Nod: Frustrated with API instability, one member decided to overcome their “phobia of scraping” and implement their own web scraping and summarization solution.
- They stated that this was a better solution than paying for an existing API, and that “how hard can it be we have AI these days, right?”
You.com API Tempts Trial: Members pointed to the You.com API as an alternative, offering a free trial for 60 days with 1000 searches per month.
- They also pointed out the Google Search API as an option, where one could pull the top results from each search and work on those.
Google Search API Gets the Nod: A member stated that “if it’s just search google is probably the cheapest option”.
- They went on to say that they were going to look at the documentation and find the ts and cs.

Yannick Kilcher ▷ #ml-news (4 messages):

Two Minute Papers, Mensa Norway, Gemma 2.0 Pro, TrackingAI.org

Two Minute Papers botches IQ results: A member noted that Two Minute Papers made a mistake in their video, comparing the global AI results from 2024 to the results from Mensa Norway in 2025.
- The member stated that no model is above 118 IQ and the 118 IQ model is Gemma 2.0 Pro, o3 is not even close according to TrackingAI.org.
Community questions Mensa Norway relevance: A member questioned the relevance of Mensa Norway, suggesting that Two Minute Papers has declined in quality since the start of the DL-craze.
- Another member added that it doesn’t count if they train models on the test data, that’s just a hack.

Notebook LM ▷ #use-cases (46 messages🔥):

Audio Overview Length, Gemini 2.5 Pro, Audio Overview Transcripts, Customizing Audio Overviews, Formulas in Mindmaps

NLM Version Not Determining AO Length: A user found that the length of the Audio Overview (AO) is not dependent on the NLM version, but more on the sources used.
- Another user mentioned setting up custom instructions on Perplexity to achieve detailed planning resulting in outputs over 10K words, using a full report as a source.
Gemini 2.5 Pro Excels in Legal Argumentation: A user found that Gemini 2.5 Pro came up with novel alternate legal arguments that other models like Sonnet 3.7 and GPT missed.
- The user highlighted its ability to condense complicated legal arguments into podcasts and its performance in identifying legally deficient objections; though the team is testing 2.5, NLM currently runs Gemini Flash 2.0.
Extracting Audio Overview Transcripts with a Power User Hack: A user asked about getting the text transcript of an Audio Overview, and another user suggested a power user hack to go to ai.dev, upload the file, and ask Gemini 2.5 to provide the transcript.
- Another user detailed how to use NLM itself: Download the audio overview as a WAV, then bring the WAV into the Notebook as a Source to have it transcribed.
Users Customize Podcasts for Stakeholder Groups: A user was able to leverage a website with K-12 AI policy documents from various US states, readily comparing them and contrasting them against BC documentation.
- Another user had their husband review fishing plans from the Department of Fisheries & Oceans, and was able to customize podcasts for other stakeholder groups.
Mindmap Formulas Showing in NotebookLM: A user reported seeing formulas for the first time using the Mindmap feature in NotebookLM.
- The user attached a screenshot showing letters after an underscore, indicating superscripts and subscripts may be messed up.

Notebook LM ▷ #general (86 messages🔥🔥):

Gemini 2.5 Pro, AI Studio vs App, LaTeX in NotebookLM, Podcast generator languages, YouTube video ingestion

Gemini 2.5 can use web search: Gemini, like ChatGPT, can be grounded to tap into a search API to update its knowledge base or verify information; this is enabled through tools in systems like AI Studio, and users find Gemini 2.5 Flash Thinking in AI Studio amazingly good.
Debate on AI Studio and Gemini App: AI Studio provides developers with more control to experiment with the API, while the Gemini App is better for less technical users, offering features like Canvas and Gems, but for grounding capabilities, AI Studio or the API are currently necessary.
LaTeX Glitches Plague NotebookLM: Users report LaTeX doesn’t work in NotebookLM, and the team confirms it’s a Gemini models issue they’re actively addressing.
Limited Podcast Language Support: The podcast generator currently supports only English, but a workaround involves creating the audio overview in English, downloading the WAV, adding it as a source, and then using NotebookLM to translate the transcript, though this method doesn’t differentiate the speakers.
NotebookLM Model Speculation: Users speculate on the model used by NotebookLM, with one mentioning it’s Flash 2.0 Thinking and 2.5 is on the way, while some believe that Gemini 2.5 Pro will not be implemented in NotebookLM based on comments from development team members.

HuggingFace ▷ #announcements (1 messages):

Transformers backend integration in vLLM, Quantization Docs Revamp, MCP Client Support in JS SDK, Acquisition of Pollen Robotics, Cohere and Cohere Labs Models on HF Hub

Transformers now runs vLLM: The new transformers backend integration is available to try out in vLLM.
Quantization Documentation gets overhaul: The quantization docs have been revamped, available at this link.
JS SDK gains MCP Client Support: MCP client support is being added to the JS SDK’s inference client, in addition to Python, per this announcement.
Hugging Face x Pollen Robotics: Hugging Face is acquiring Pollen Robotics to bring open source robots to the world, per their announcement.
HF supports Cohere & Cohere Labs: Cohere and Cohere Labs models are now a supported inference provider on the HF Hub, according to this announcement.

HuggingFace ▷ #general (57 messages🔥🔥):

Continuous Pretraining, Growth Hacking, Text Generation Models, Open Source Video Understanding Models, TinyLlama Chatbot Issues

Smol Continuous Pretraining Stumbles: A member reported that continuous pretraining of smolLM with cosmopedis-v2 and fineweb-edu-dedup datasets over 5k steps with a batch size of 128 and max sequence length of 512 resulted in worsened HellaSwag benchmark scores.
- Members suggested increasing and/or improving the dataset or revisiting the base model as well as considering resources such as the AWS MLOps checklist.
Growth Hacker Asks How to Viral-ify: A member inquired about growth hacks after achieving 29 link clicks on the first day and hoping to reach 290 in one week.
- Another member suggested going viral on external social media, Reddit, blogs, or news sites but admits there’s no guarantee that it will happen on purpose, though.
GPT-4 Alternatives Arise: A member sought a reliable text generation model as GPT-4 wasn’t available and DeepSeek demanded heavy resources.
- Recommendations included using Gradio Space via API with links to various image generation models, Gradio documentation, and the GeoGuessr-countries dataset and course.
TinyLlama Chatbot Hallucinates User Prompts: A member sought help with their TinyLlama customer support chatbot making up user prompts and hallucinating.
- Another member suggested using an AI API endpoint through a service provider while others suggested cloud inference options like FriendliAI.
Unsloth’s Dynamic v2.0 Quants Impress: The community shared Unsloth’s Dynamic v2.0 quants, which are said to set new benchmarks on 5-shot MMLU and KL Divergence, enabling running and fine-tuning quantized LLMs while preserving accuracy.
- Further information can be found on their Tweet and the Hugging Face collection.

HuggingFace ▷ #today-im-learning (1 messages):

Embodied AI, AI Agents

Seeker Enters Embodied AI: A member signals interest in Embodied AI and AI Agents, seeking concrete opportunities for contribution or hands-on learning.
- They’re eager to assist with any relevant issue or task to deepen their understanding.
Embodied AI Task Quest Initiated: An individual learning Embodied AI and AI Agents is looking for a task to start with.
- They want to learn by doing and are offering their help on any relevant issue.

HuggingFace ▷ #cool-finds (2 messages):

Perplexity AI, Student Discount

Perplexity Pro: Free for Students!: Students can get Perplexity Pro for free for one month by registering with a school email through this referral link.
- Perplexity is an AI service that searches the internet and analyzes specialized literature, working with documents in context to answer questions.
Perplexity: Search and Analyze with AI: Perplexity AI offers features like searching the internet across dozens of pages and analyzing scientific articles in research mode, according to their website.
- It can work with documents in context, answer questions about text, and much more as an alternative search method.

HuggingFace ▷ #i-made-this (10 messages🔥):

Agentuity, BitNet b1.58, Beat Saber QA Chatbot, HFtoGGUF, Mistake-To-Meaning Dataset

Agentuity Attracts Devs: Developers are switching to Agentuity to deploy AI agents in one command, leveraging seamless LLM & LangChain integration and enterprise-grade scalability, as advertised here.
BitNet b1.58 2B4T Demo Goes Live: The BitNet b1.58 2B4T Demo is now live on a freebox on Hugging Face, available at this link.
Beat Saber QA Chatbot Dataset: A member is working on a Beat Saber QA Chatbot and has released the initial dataset, available at Hugging Face Datasets.
HFtoGGUF Converter Launched: A member launched a tool to easily convert Hugging Face models to GGUF format on Colab/ipynb, get the code at GitHub.
Mistake-To-Meaning Dataset Debuts: A member released the Mistake-To-Meaning dataset to help smaller models better understand typos, found here.

HuggingFace ▷ #reading-group (1 messages):

Simulation-Based Inference, Decision-Making Models, AppliedAI Institute for Europe, Women in AI & Robotics

AI Reading Group to host Jan Teusen on May 15: The AI Reading Group from Women in AI & Robotics will host Jan Teusen from the appliedAI Institute for Europe on May 15 to discuss Flexible and efficient simulation-based inference for models of decision-making.
- The paper link is available here and the event is scheduled here.
Paper focuses on simulation-based inference: The paper to be discussed focuses on flexible and efficient simulation-based inference for models of decision-making, authored in part by Jan Teusen.
- The discussion will take place within the Women in AI & Robotics reading group.

HuggingFace ▷ #computer-vision (2 messages):

OCR solutions, Qwen-VL, PaddleOCR, Donut, Google Document AI

OCR Solutions for Mixed Documents: A member seeks OCR solutions for domain-specific documents (medical certificates, award certificates) with mixed printed and handwritten text.
- They are evaluating Qwen-VL, PaddleOCR, Donut, and Google Document AI for automated key information extraction and search functionality.
OLM OCR is suggested: A member recommended OLM OCR as a potential solution for the document processing task.
- No further details or links about OLM OCR were provided.

HuggingFace ▷ #agents-course (46 messages🔥):

Agents Course Deadline Extension, Llama-3.2-3B-Instruct 503 Errors, GAIA Benchmark Modalities, Free Models for Final Project, LangGraph Use for Benchmark

Agents Course Deadline Extended Until 2025!: The deadline for the Hugging Face Agents Course has been extended to July 1, 2025, as confirmed by a member, giving participants more time to complete the course and projects. Hugging Face Agents Course.
Llama-3.2-3B-Instruct Runs Into 503 Service Unavailable: Several members reported receiving 503 Server Error: Service Temporarily Unavailable when trying to run meta-llama/Llama-3.2-3B-Instruct and one member resolved it by requesting access on the model page.
GAIA Benchmark Features Diverse Modalities: A member shared some basic EDA (Exploratory Data Analysis) on the final benchmark, noting the modalities across all levels for the test set with an attached screenshot of the modalities. Screenshot of EDA.
Free Models Still Viable for Final Agent Projects?: With the Agents Course credits running out, members discussed the possibility of completing the final project using free models.
- One member suggested using Gemini models through LiteLLM, and another recommended using Deepseek API keys, but highlighted the need to keep the space/code public.
Benchmarking LangGraph: Hype or Type?: A member inquired whether anyone is using LangGraph for the benchmark, opening the door to discussions on its applicability and effectiveness.

LM Studio ▷ #general (71 messages🔥🔥):

LLPlayer safety, Model loading issues, Gemma 3 models, AI Power Measurement, Pixtral support

LLPlayer: Is it Safe?: A member inquired about the safety of the LLPlayer GitHub project, seeking feedback from the community.
- Another member suggested that if there are any doubts, building it yourself would be the best option.
Gemma 3 Models are a Go!: A user reported issues getting Gemma 3 models to work, while multiple other members confirmed that Gemma 3 models work out of the box using the latest version.
- Another suggested updating past v0.2.31 as that version doesn’t support it.
Peeking at Pixtral Support: A user noted that Pixtral support was added to llama.cpp but then immediately realized it was text only and has no vision yet.
- Another member pointed out that you could use Pixtral with MLX for a long time now.
Dissecting AI Power Measurement: Members discussed how AI power is measured, clarifying that perplexity is used to measure language skills, while other benchmarks assess different abilities and knowledge domains, with a member sharing a link to a graph comparing AI models.
- The discussion also covered tokens, which measure the amount of text and are used for billing in paid services.
Tokenizer Troubles: Debugging Vocab Issues: A member faced issues with their model not tokenizing properly and looked for a way to see more verbose errors, while adding that they had the same issue when pre-training a proper llama archtecture using Transformers library and allat.
- Another member responded that there is no option for verbose errors, but you might be able to see more with the lms log stream command, with the user ultimately resolving the issue by fixing the spaces not being tokenized.

LM Studio ▷ #hardware-discussion (46 messages🔥):

VRAM vs DRAM, GPU Upgrade vs RAM Upgrade, Used 3090 Reliability, Thermal Pad Replacement

DRAM Slows Performance Drastically: When a model exceeds VRAM, it offloads to system RAM, which severely degrades performance, one user seeing only 0.5 tok/s with a 120B model on a 4090 with 128GB of DDR4 RAM.
- Although one user didn’t mind waiting 15 minutes for an excellent answer, another pointed out that even big models could require a trial and error approach, so speed does matter.
Small Model Recommendation: In response to a user testing a 1.7b q4_k_l model, another member recommended a list of small models available at kth8/llama-server.
- The user also asked if one should use dram plus gpu fully, another user replied that you don’t want the model to touch system RAM.
3080 Upgrade Path Debated: When asked about upgrading from a 3080 with 10GB VRAM and 32GB RAM, it was suggested that more vram is always the answer for larger models + speeds.
- Another suggestion was that one would be better off upgrading that 3080 to a used 3090.
Used 3090 Gamble: Buying a used 3090 is kind of a lottery, but if you have 24 hours to test and return, then why not?
- Replacing thermal pads could be expensive but most of the times replacing thermal paste on the gpu chip is enough, though some cards may need 4 different types of pads costing about $30.
Multi-GPU Inefficiency: A user reported that only one of their two RTX 3090s appeared to be active while processing LLM prompts, but they found that one of my gpus was still heavily underclocked by afterburner.
- A user suggested checking stats from the driver with nvidia-smi.

GPU MODE ▷ #general (8 messages🔥):

Blackwell FP4, Free GPU Access, Cloud GPU Prices

Blackwell’s FP4 Format Details Sought: A member inquired about the technical specifications of FP4 implementation in NVIDIA’s Blackwell architecture, seeking details on its range and related aspects.
- Another member shared a link to OCP Microscaling Formats MX v1.0 spec which might contain the requested information.
Quest for Free GPU Access: A member asked about obtaining free GPU access as a student without relying on school servers.
- Another member suggested Google Colab or GPU Mode’s Discord Cluster Manager for free options; cloud-gpus.com was mentioned as having good prices.
Cloud GPUs Offer Budget-Friendly Options: A member suggested that using consumer cards on cloud services could be relatively inexpensive.
- They also noted that some cloud providers offer free credit hours, mentioning a friend who won free GPU hours at a conference.

GPU MODE ▷ #triton (9 messages🔥):

Pyright and Triton issues, FP4 vs INT4, FP4 to FP8 Conversion, H100 optimized kernels, RTX 5090 FP4 performance

Pyright Flags Triton’s cdiv as Unreachable: A user reported that Pyright marks the rest of a Triton function as unreachable whenever cdiv is used, requiring a custom pyrightconfig.json to ignore the warning.
- The code still runs fine, but the false positive is annoying and requires manual configuration on each new machine/project.
FP4 runs slower than INT4 when Dequantizing to FP16: A member noted that dequantizing FP4 to FP16 is theoretically slower compared to INT4 to FP16, as INT4 can leverage high-performance bitwise logic (LOP3).
- He confirmed it’s absolutely slower.
Users Question FP4 to FP8 Conversion: A user inquired about converting FP4 to FP8 with scaling factors, but noted that they didn’t see any such conversion in cuda_fp4.hpp.
- Another user questioned the motivation for this on an H100, suggesting optimized kernels like A16W4 from Marlin/Machete, Bitblas (tilelang), or GemLite.
Optimized Kernels are suggested over FP4 on H100: Members suggested using optimized kernels such as A16W4 from Marlin/Machete, Bitblas (tilelang), or GemLite rather than FP4 on H100.
- The primary use case for FP4 is MXFP4 on Blackwell, which necessitates a specific scaling format and a good FP16 -> FP4 quantization algorithm.
RTX 5090 FP4 Performance Benchmarks Incoming: A user mentioned having an RTX 5090 and wanting to assess FP4 performance.
- Another member provided a link to a cutlass example and expressed excitement for the upcoming benchmarks.

GPU MODE ▷ #cuda (12 messages🔥):

Popcorn project, PTX unsigned vs signed ints, LLM mistakes in CUDA kernels, RL framework for LLM accuracy

Popcorn Project Popping Up?: A member inquired if a project was similar to the Popcorn project 🍿.
- Another member requested that similar posts be placed in the appropriate channel.
PTX signed vs unsigned ints: A member learning PTX inquired about the functional difference between loading unsigned and signed ints, wondering if the distinction is implicitly used for future operations.
- They asked if there’s any functional difference to using ld.[location].s[n] vs ld.[location].u[n].
KernelBench paper sheds light on LLM CUDA kernel mistakes: A member inquired about the most frequent “mistake” that LLMs make when generating CUDA kernels.
- Another member pointed to the KernelBench paper on arxiv which has a table of those mistakes.
RL Framework invokes tools to increase LLM Accuracy: A member is exploring an RL based framework for increasing LLM accuracy and wondered if there’s a consensus on what external tools the LLM should invoke.
- They suggested tools like search engines and code interpreters.

GPU MODE ▷ #beginner (15 messages🔥):

PTX barrier instruction, Warp specialization producer-consumer model, libcu++ memory barriers, barrier vs bar

PTX barrier.aligned instruction clarified: Discussion revolves around interpreting PTX documentation regarding barrier.aligned instructions, particularly concerning whether the barrier name parameter must be identical across threads within a warp.
- It’s suggested that if one lane participates in a resource, all others must as well, though it’s unclear if they must specify the same resource.
bar.arrive and bar.sync usage analyzed: The legality of using different barrier resources within the same warp is questioned, specifically regarding branching to split threads that arrive versus those that sync.
- An example snippet from Weft (or here) showcasing pairs of arrive/sync for warp specialization in a producer-consumer model is referenced.
libcu++ uses mbarrier for memory barriers: It’s noted that libcu++ uses mbarrier (or atomics on older architectures) rather than bar/barrier for implementing std::barrier.
- This is because barrier only allows either arriving or waiting, not both, which is needed for implementing std::barrier.
Clarification on bar versus barrier terminology: bar is clarified to be a synonym for barrier.*.aligned, especially relevant on architectures like Maxwell where aligned mode is the only supported option.
- Older versions of the ISA pre-barrier synchronize warps.

GPU MODE ▷ #irl-meetup (1 messages):

.bexboy: ICLR?

GPU MODE ▷ #rocm (1 messages):

mobicham: GCN hungry for more work, probably there’s a reason why it added it

GPU MODE ▷ #self-promotion (1 messages):

sohamg: LMAO yes i did - no way haha, did you do high schoold debate?

GPU MODE ▷ #🍿 (4 messages):

Kernel Bench, Compute Eval

Kernel Bench vs Compute Eval challenges outlined: A member highlighted the differences between Kernel Bench and Compute Eval, noting that the former assumes the reference implementation is PyTorch, while the latter uses English.
- They added that Kernel Bench is ML-focused, whereas Compute Eval is more HPC-oriented.
Kernel Bench challenges focus on ML: Expanding on the difference between Kernel Bench and Compute Eval, one challenge mentioned is Kernel Bench’s emphasis on ML.
- This contrasts with Compute Eval, which is geared more towards HPC (High-Performance Computing) applications.

GPU MODE ▷ #submissions (42 messages🔥):

AMD MI300 Performance, FP8 Matrix Multiplication, Leaderboard Submissions

MI300 Leaderboard Domination: Many members submitted benchmarks to the amd-fp8-mm leaderboard on MI300, showcasing a range of performance results in microseconds (µs) and milliseconds (ms).
FP8-MM Benchmark Bonanza: Multiple submissions targeted the amd-fp8-mm leaderboard with varying degrees of success, with execution times ranging from 174 µs to 127 ms on the MI300.
New personal bests are achieved: Several members achieved personal bests on the MI300, improving their standing on the leaderboards.
- One member even managed a 24.2 µs personal best on the amd-identity leaderboard.
A100 Grayscale Submission Shows Promise: One member submitted to the grayscale leaderboard, achieving a successful run on the A100 in 2.50 ms.

GPU MODE ▷ #amd-competition (21 messages🔥):

numpy error in bot code, discord-cluster-manager docker image broken, Container matches the CI implementation, faster CUDA compilations in PyTorch nightlies, AMD competition shapes

Numpy gone from bot code: A user encountered a ModuleNotFoundError for numpy when benchmarking their code, despite not calling it directly.
- A member suggested that the docker image might be broken and promised to investigate.
A fix to the Discord Cluster: A member reported they fixed the docker image.
- This Discord Cluster Manager will get merged soon and makes cuda compilations take 0.01s on cuda and hopefully soon with amd as well.
Runpod Setups and Profiling: Members discussed setting up the iteration environment and getting in-depth profiling data from the GPU using Runpod.
- One member was able to get kernels running using the template and achieving benchmark runs in approximately 9 seconds.
AMD Competition Shape Requests: A member noted that the dimension [7168 6144 576], with n (576), should be changed to something divisible by 128.
- Another member clarified that these shapes are requested by AMD for the competition, including 64
DeepSeek R1 Shapes targeted by AMD: A member suggested that the shapes seem to target DeepSeek R1 shapes but noted that the m values are not good reference shapes for inference.
- They estimated the largest m that could fit on an MI300x node would be around m = 128, while m = 64 is the max for an H200 node.

Nous Research AI ▷ #announcements (1 messages):

Minos classifier, Refusal detection, Binary classifier, ModernBERT-Large

Minos arrives to classify LLM refusals: A new classifier called Minos was introduced for detecting refusals from LLMs and it’s available on HuggingFace.
- It is a binary classifier that returns the likelihood of a final response in a chat being a refusal, potentially useful for redteamers and jailbreakers.
ModernBERT-Large powers Minos: Minos is built on ModernBERT-Large 400M by the Jeremy Howard’s AnswerAI team, optimized for quality and speed.
- Example scripts to run the classifier and more details are available in the HuggingFace repo.

Nous Research AI ▷ #general (99 messages🔥🔥):

Synthetic Data Training Runs, GPQA Benchmark, Sparse Initialization Code, Mac vs NVIDIA GPUs, China Thorium Nuclear Reactor

Synthetic Data spurs Unique Training Runs: A member is working on a unique training run using synthetic data from Deephermes 24B and seeks a STEM task recommendation to verify the quality of the distillation.
- They aim to create a convincing “hello, world” task to prove their technique, focusing on tasks straightforward to measure and programmatically generate prompts.
GPQA suggested as a benchmark for distillation: A member suggested GPQA as a good benchmark for a distillation project, but there’s a discussion of whether distilling on benchmark questions is considered cheating.
- It was mentioned that MMLU has a train set, but GPQA does not, and the OpenThoughts 114k dataset is part of DeepHermes’ training data, which could be helpful.
Sparse Initialization of L4 MoE teased: A member is working on sparse initialization code to convert Deephermes to an L4-style MoE that can run efficiently on an 8-12GB GPU + CPU.
- The plan involves self-logit distillation to distill onto the sparsified model, potentially requiring additional parameters or duplicated layers to fully preserve performance.
Macbooks vs NVIDIA GPUs for AI Workflows: A discussion arose about the prevalence of Macs in AI despite NVIDIA GPUs offering better performance, questioning which laptop is optimal for AI engineering and research.
- Arguments for Macs include their efficient Apple Silicon, unified memory, and suitability for light local inference, while NVIDIA GPUs are favored for performance in AI-related tasks; unified memory and larger memory max make them better for laptops specifically.
China activates Thorium Nuclear Reactor: Members are discussing a YouTube video about China’s Thorium Nuclear Reactor becoming active, suggesting it will eventually power autonomous EV infrastructure and AI data centers.
- The conversation includes a link to a video about a non-nuclear hydrogen bomb and a link to an openwebui fuction for Minos

Nous Research AI ▷ #ask-about-llms (4 messages):

MoE, heterogeneous devices, distributed experts

Researchers Seek Info on Distributed MoE Inference: A researcher inquired about projects distributing Mixture of Experts (MoE) across heterogeneous devices during inference, citing this paper and this paper.
- The researcher was looking for new approaches and good resources to check out.
Developer Seeks Opportunities in Discord: A member with username teknium, offered their services as a developer to the Discord community.
- They invited those in need to DM them.

Nous Research AI ▷ #research-papers (2 messages):

Top-nσ Sampling Method, LLMs Decoding Strategies

Top-nσ Sampling method filters Tokens efficiently: A member shared an Arxiv link about Top-nσ, a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold to filter tokens efficiently.
- Unlike existing methods, top-nσ maintains a stable sampling space regardless of temperature scaling, outperforming existing sampling approaches and surpassing greedy decoding.
LLMs Decoding Strategies: The paper shared above challenges the convention that LLMs typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy.
- The member suggests that this paper would be relevant for another member, who likely specializes in this area.

Nous Research AI ▷ #research-papers (2 messages):

Top-nσ Sampling Method, LLM Decoding Strategies, Gaussian Noise in Logits

Top-nσ Sampling Method Debuts: A new paper introduces top-nσ, a novel sampling method that filters tokens based on a statistical threshold applied to pre-softmax logits, distinguishing between a Gaussian-distributed noisy region and an informative region.
- The method maintains a stable sampling space regardless of temperature scaling, outperforming existing approaches and greedy decoding on reasoning-focused datasets.
LLMs Tackle Decoding Strategies: The top-nσ paper challenges the conventional trade-off between diversity and accuracy in large language model (LLM) decoding, typically addressed through greedy decoding or low-temperature sampling.
- Unlike methods like top-p, top-nσ avoids inadvertently including more noise tokens at higher temperatures.
Gaussian Noise Identified in Logits: The top-nσ method is based on the insight that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region.
- This separation enables efficient token filtering without complex probability manipulations, as detailed in the paper’s theoretical analysis.

Eleuther ▷ #general (12 messages🔥):

Common Crawl Foundation Host Index, Moderation Norms Grievances, ml-math.net Updates, Improved ResNet Initialization for Transformers

Common Crawl Foundation Seeks Testers: The Common Crawl Foundation is seeking people to test a new data product named the “host index”, which is a rollup of per-host information in each of their web crawls.
Discord User Aired Grievances About Moderation: A user aired grievances about moderation norms in other communities but was told by a moderator that this is not an appropriate place to air your grievances about moderation norms.
- Another user defended them, stating I feel like it’s okay, people have done this before here to good effect, and they seem to have made a good faith effort to get things restored, and didn’t have a direct contact to DM.
Math Behind ML Unchanged: A user asked if there were any updated resources similar to ml-math.net with more examples of commonly applied concepts and seminal papers.
- Another user replied that the math behind this field hasn’t changed drastically. Its still the same stuff but repackaged differently.
Initialization Logic for ResNets and Transformers: A member has been investigating improved inits for resnets and asks about if the same logic applies for transformers or attention in general.
- They added that they’ve been catching up on newer resnet standards, and a lot of the overall logic makes sense.

Eleuther ▷ #research (42 messages🔥):

AI-Generated Papers, ACL Submissions, Reviewer Overload, Mitigating AI Use

AI Powers the Rise of AI-Generated Papers: Members discussed that with just $15-20 in API tokens, one can generate a full research paper using tools like AI-Scientist-v2, leading to concerns about the influx of AI-generated papers on arXiv.
ACL Submission Avalanche: A member noted that this year’s ACL submissions have more than doubled compared to last year, reaching over 8500, raising questions about how many are AI-generated.
- Another member fears that Emily Bender’s talk backfired resulting in substantially more AI spam.
Reviewers Resist AI Paper Policing: Members debated how to prove if a paper was AI-generated and what punishments should follow, suggesting banning authors from major conferences.
- One member stated they would not review if they had to make calls on banning individuals, while another felt that AI should not write the core substance of papers.
Infinigrams Expose AI Paper Imitations: A member suggested detecting AI-generated papers by using an infinigram containing only arXiv, GitHub, Wikipedia, and StackExchange, arguing that AI models invariably use full phrases from these sources.
- In related discussion, a member commented on the quality of language models when they observed looped layers are worse than looped blocks for language modelling.

Eleuther ▷ #interpretability-general (7 messages):

Transformer Circuits Framework, Subspaces and Residual Stream Bandwidth, Mechanistic Interpretability

Transformer Circuits Framework Influences Thinking: A member is borrowing concepts from the Transformer Circuits Framework, particularly the ideas around subspaces and residual stream bandwidth.
- They highlight that model components aren’t constrained to respect each others’ subspaces, leading to potential inefficiencies or opportunities for improvement by “protecting” subspaces better.
Experiments with Constraining Attention Head Subspaces: A member considers constraining attention heads to operate on 128-dimensional subspaces of a shared 512-dimensional subspace to force better organization, potentially sacrificing expressivity.
- They speculate that heads reading and writing to the same subspaces might further improve organization, despite the trade-off in expressiveness.
Beginner Seeks Guidance on Interpretability: A software engineer expressed interest in interpretability and asked for a path to understand cutting-edge research.
- Another member recommended Neel Nanda’s post on mechanistic interpretability as a good starting point.

MCP (Glama) ▷ #general (36 messages🔥):

MCP server issues with image/audio, Claude artifacts and image display, Auth for remote MCPs, git repo to MCP server

MCP Server Struggles with Image/Audio Handling: A member reported that Claude Desktop handles audio and images poorly, especially for images > 1MB, and using content with multiple types (text, audio, image) results in ‘unsupported content type’ errors.
- They expressed a desire for LLMs to have references to large data blobs rather than sending the data directly, questioning whether MCP resources could be used for this purpose.
Pro Plan to the Rescue for Large MCP Outputs?: A user encountering issues with generating interactive schedules with large datasets found that upgrading to the pro plan of Claude Desktop resolved the problem.
- This suggests a limitation in the free version related to the size or complexity of data handled.
Cloudflare’s OAuth Approach for Remote MCP Authentication: Concerns were raised about authenticating users for remote MCP servers, with a member noting that Cloudflare is pushing for changes to the MCP spec and using OAuth for this problem.
- They linked to a Cloudflare blog post discussing Dynamic Client Registration to solve the “regardless of their client” part of the authentication challenge.
Automagically turn your git repo into a MCP server: A member shared git-mcp, a tool to turn any git repo into an MCP server that can be queried for docs, examples, etc.
- This allows users to easily reference code and documentation from repositories within MCP workflows.
Sacred’s Hosted Chat App Launches for MCP Servers and Tools: A hosted chat app, chat.pipedream.com, was launched to allow users to chat with 2500+ MCP servers and 10k+ tools.
- The creator is actively seeking feedback on the platform.

MCP (Glama) ▷ #showcase (8 messages🔥):

Browser Extension for MCP, Slack MCP Server Updates, OpenAI Image Generation for Graphlit MCP Server, MCP server for VanMoof riders, MCP Server with DevDb

Siloed Launches MCP Browser Extension: Siloed introduces a browser extension for MCP, enabling users to drag and drop resources and prompts directly into the browser from any MCP server that supports resources.
- The extension allows building a library of prompts with dynamic text and resource attachments and integrates with Siloed’s MCP Server for access in clients like Claude Desktop.
Slack MCP Server Gets Major Updates: The Slack MCP Server receives major updates, including improved adoption via npx and a shift to date range-based history requests instead of message counts.
- These updates enhance the server’s ability to analyze messages over specified periods like 1 day, 1 month, or 2 weeks.
Graphlit MCP Server Now Features OpenAI Image Generation: The Graphlit MCP server now includes OpenAI image generation capabilities, as announced in a tweet.
- Users are encouraged to try out the new feature, enhancing the server’s multimedia functionalities.
VanMoof Riders Get Dedicated MCP Server: A new MCP server is released specifically for VanMoof riders, offering integration and functionalities tailored to their needs.
- The server aims to provide enhanced control and customization options for VanMoof e-bike users.
MCP Server Integrated with DevDb Released: An MCP server integrated with DevDb has been released, offering support for MySQL, Postgres, SQLite, and MSSQL databases.
- Details and configuration instructions are available in the DevDb VSCode extension’s readme file.

LlamaIndex ▷ #blog (3 messages):

MCP Servers in LlamaIndex.TS, Multimodality, DeepSeek-based Perplexity clone

MCP Servers Now Integrated Into LlamaIndex.TS: Thousands of MCP servers are now available in LlamaIndex.TS via the one-line mcp() call, streamlining connections to any MCP server, with examples available here.
- This integration facilitates easier access and utilization of MCP servers within LlamaIndex.TS projects, as showcased in the linked resource.
LlamaIndex Embraces Multimodality: LlamaIndex highlights its support for multi-modal models and embeddings, emphasizing multimodality as a significant near-term advancement in the AI space as discussed by LlamaIndex co-founder Jerry Liu at last year’s @aiDotEngineer world’s fair courtesy of @MongoDB.
- Further details on their latest multi-modal support can be found here.
Quick Perplexity Clone Built on DeepSeek: A DeepSeek-based Perplexity clone implemented by Karan Vaidya requires less than 100 lines of code, showcasing the ease of creating similar applications with modern tools.
- The implementation is detailed in a video and code available here and here, respectively.

LlamaIndex ▷ #general (30 messages🔥):

HuggingFaceEmbedding and IngestionPipeline, MultiModalVectorStoreIndex Default Embeddings, NotionReader Customization, Gemini 2.5 flash and Thinking Feature, GuardRails and Pydantic compatibility

Pipeline Embedding Peculiarities Pop Up: A user found that when defining embed_model via Settings before ingestion, the pipeline does not embed unless the embedding model is also defined within the IngestionPipeline itself.
- The index requires an embedding model for queries during retrieval, so defining it both in Settings and in the pipeline ensures it functions correctly.
MultiModal Defaults Decoding: A user inquired about the default multimodal embeddings used in MultiModalVectorStoreIndex.
- A member responded that the default image embedding model is typically OpenCLIP.
NotionReader’s Nifty Need for New Metadata: A user wants to know if they should fork NotionReader to add more descriptive metadata like filename and file type to improve filtering, since the reader only adds Notion file ID by default.
- The user is trying to improve search accuracy given that files in the database have similar text content.
Gemini 2.5 flash users found thinking feature flaw: A user sought to disable the “thinking” feature in Google’s Gemini 2.5 flash model, finding the documentation unclear.
- A member pointed out that setting the thinking_budget to zero in the GenerateContentConfig disables this feature.
GuardRails Glitch stems from Langchain’s Legacy: A user encountered a Pydantic v2 compatibility issue when using GuardRails with LlamaIndex, suspecting a mismatch with LlamaIndex’s Pydantic version.
- A member noted that LlamaIndex uses Pydantic v2, and the error likely originates from Langchain (a GuardRails dependency) still using Pydantic v1, pointing to a relevant pull request.

LlamaIndex ▷ #ai-discussion (4 messages):

Instruction-Finetuning LLMs, TRL usage, Small Language Model

Fine-Tune with TRL instead of LlamaIndex: A member suggested using TRL instead of LlamaIndex tools for instruction fine-tuning open-source LLMs.
- They added that you can use any existing LLM to create a dataset and distil training.
Memory constraints on Local and T4 environments: A member warned that training locally or on a T4 GPU will be very memory-constrained.
- They added that it will likely only be able to train very small LLMs on small batch sizes.
Building SLM from Scratch: A member expressed interest in building a Small Language Model (SLM) from scratch following rasbt’s LLMs from Scratch book.
- They expressed this is a fun learning experience.

Latent Space ▷ #ai-general-chat (35 messages🔥):

Copilot Agents, Image Generator Bias in Sora, AI and Randomness, Collaboration in AI App Generation, Anthropic Essay by Dario Amodei

Microsoft Eyes Copilot Agents: Satya Nadella teases Copilot Agents in a tweet, sparking discussion - link.
Sora’s Hidden Bias and Aesthetic Laundering: A user is experimenting with image generator bias in Sora, revealing a consistent left-to-right bias across the platform as shown in this image.
- They are also writing about aesthetic laundering, staging scenes to exploit image generator biases, resulting in images that may violate TOS.
AI’s Randomness Reveals Preferences: A coder explores randomness with AI, noting that the AI tends to choose Knowledge and Liberty most often when prompted with random options.
- The coder hypothesizes this reveals inner preferences rather than actual randomness, influencing results in coding questions.
New Dario Essay: Dario Amodei released a new essay - link.

Modular (Mojo 🔥) ▷ #general (7 messages):

RAG Example, Meetup Tonight, Library request

New RAG Example Excites Mac Users: Members expressed excitement about the new RAG example (YouTube link), specifically regarding its potential use on beefy Macs with UMA.
- One member inquired about similar tutorials for RAG on Mac CPU/GPUs.
Mojo MAX Meetup: Members noted that attending the meetup tonight might offer the chance to take a pic with MAX and Mojo.
- This prompted discussion with a member lamenting being geographically separated, expressing wishing I lived in the bay area again! and requesting plans for a meetup in SoCal.
Community Library Request: A member offered their time to create a desired library or thing in the Mojo community.
- The member asked if there was a library or anything that is desired in the Mojo community, one that they could fill their time with

Modular (Mojo 🔥) ▷ #mojo (14 messages🔥):

NVIDIA CUDA-Python release vs Mojo, Alternatives to CUDA, Mojo Library Wishlist, @always_inlineversion ofconstrained“

CUDA-Python Release Spurs Speculation: Members discussed if the new NVIDIA CUDA-Python release was in response to Mojo.
- One member said it was because most AI devs really don’t want to learn CUDA.
CUDA-Python won’t replace C++: Members predict that CUDA-Python will not replace C++ outside of prototyping, due to the shortcomings of Python.
- Others welcomed the library but were skeptical about it replacing C++.
Community Expresses Desire for Mojo Libraries: A member who has a absurd amount of free time asked if there was a desired library in the Mojo community.
- The member offered to spend the time building it.
Requests arise for @always_inline("builtin") version of constrained: A member inquired about an @always_inline("builtin") compatible version of constrained.
- Another member pointed to a proposal about requires clause which may be what you want.
Mojo Feature Wishlist Discussion: With things starting to get fairly well built out/tested, one member offered to compile a feature wishlist for the Mojo community.
- The discussion included a link to Kelvin.

tinygrad (George Hotz) ▷ #general (13 messages🔥):

Tensor Deallocation Issues, GlobalCounters Memory Tracking, Visualizations of tinygrad, OS memory collection

Tensors Vanish with GlobalCounters Tracking!: A user sought to ensure complete removal of tensors from memory in Tinygrad, noting that del wasn’t sufficient and tried del buffers[x.lazydata] and del x as well as trying to call deallocate() on each of the buffers.
- Another member suggested using tinygrad.helpers.GlobalCounters.mem_used to track memory from Tinygrad’s perspective, as it adjusts during buffer allocation/deallocation.
Reproducible Tensor Leak?: A user provided a reproducible script to demonstrate a potential tensor deallocation issue in Tinygrad.
OS eats memory?: Members discussed how the OS might not immediately collect freed memory, which could explain why memory usage doesn’t instantly decrease after tensor deletion.
- The suggestion was made that RandN is slower and has more overhead.
viz is awesome!: A user said: the viz is awesome.

tinygrad (George Hotz) ▷ #learn-tinygrad (1 messages):

cookiecrumbs3808: Precisely what I needed. Thanks!

LLM Agents (Berkeley MOOC) ▷ #mooc-questions (6 messages):

AgentX resources, Lambda services, AI Pair Programming Tools

Confusion over AgentX Resources and Lambda Services: A participant expressed confusion about AgentX resources and a Lambda service offering a discount, assuming it was related to the AgentX competition.
- The organizer clarified that the Lambda offer was unrelated to the competition, as team information hadn’t been shared with sponsors yet, and it’s likely an unrelated Lambda promotion.
Clarification on AgentX Resources from Lambda: The organizer detailed that Lambda is offering $100 serverless API credits per team member and $400 GPU credits to 50 select teams for the AgentX competition.
- Teams can expect to hear back about AgentX resources within approximately a week.
Tableau EM Offers Contribution on AI Pair Programming: An Engineering Manager (EM) at Tableau (Salesforce), involved in building a local AI pair programming tool, offered to contribute to the course.
- The EM expressed interest in sharing insights on how AI pair programming assistants work, security considerations, and maximizing the tools, with experience using Cursor and GitHub Copilot.

DSPy ▷ #show-and-tell (1 messages):

dbreunig: https://www.dbreunig.com/2025/04/18/the-wisdom-of-artificial-crowds.html

DSPy ▷ #general (4 messages):

Event-driven development, DSPy and MCP integration, Limitations of DSPy

Embracing Event-Driven Development: A member expressed enthusiasm for event-driven development, indicating they are currently implementing it for several projects.
- The member stated *“Nothing like event-driven development! Currently on a similar push for a few things.”
Decoding DSPy’s Blackbox Training: A member noted the training process in DSPy is designed as a “blackbox” for regular users, recommending reading papers and source code for in-depth understanding.
- They mentioned *“The training seem to be a blockbox because the regular user don’t need to know about, if you want to know how it works, read the papers and the source code, really it is not that complicated if you come from an AI background.”
Clarifying DSPy and MCP Integration: A member clarified that MCP is not an orchestration framework and DSPy can be integrated by implementing workflows/agents as server-side tools for MCP.
- Furthermore they suggest “If you want to use DSPy… with MCP… DSPy allow you to implement workflows/agents as tool for MCP, so actually server-side functions not client side.”
DSPy Scalability: A member suggested that the current limitations of DSPy revolve around optimization for scalability and robustness, not the lack of MCP support.
- They noted *“About the current limitations of DSPy, really they are nothing to do with the lack of MCP support, but more about optimization for scalability and robustness, if you can make a FastAPI endpoint with DSPy, then implementing a MCP server is trivial.”

Cohere ▷ #「💬」general (3 messages):

HuggingFace Infrastructure, Cohere AI Scholars Program 2026

HuggingFace Infrastructure Questioned: A member inquired whether a specific infrastructure question should be directed to Hugging Face since it’s their infrastructure being connected to.
- The question implies that the user is facing connectivity issues related to Hugging Face when using a different service called Kati Patang.
Cohere’s Scholars Program 2026?: A member asked if Cohere is planning an AI Scholars Program in 2026.
- No response was given in the available messages, but it indicates community interest in future educational programs by Cohere.

Cohere ▷ #「💡」projects (1 messages):

Hugging Face Inference API, Flask integration, Model deployment

Inference API Connects to Flask Website: A user inquired about connecting a model hosted on Hugging Face’s paid inference API to a personal website built with Flask.
- Recommendations involved using API keys for authentication and dispatching POST requests from the Flask application to the Hugging Face API endpoint to fetch model predictions.
Flask <-> HuggingFace: A Tale of Two APIs: The primary approach involves leveraging the Hugging Face Inference API, which requires sending appropriately formatted requests from your Flask backend.
- Specifically, you’d set up API endpoints in your Flask app that receive data, forward it to the Hugging Face API with your API key, and then relay the response back to the user’s browser, ensuring a seamless integration.

Cohere ▷ #「🤝」introductions (1 messages):

Introductions, Community Welcome

Cohere Greets New Community Members: A stickied message welcomes new members to Cohere’s Community Discord server.
- New users are encouraged to introduce themselves by providing their company/industry/university, current projects, favorite tech/tools, and community expectations.
Guidance for Introductions: A template is provided asking new members to share background.
- The template asks for Company/Industry/University, What you’re working on, Favorite tech/tools you use, and What you hope to gain from this community.

Torchtune ▷ #dev (4 messages):

Pytorch Utils, OOM Error, DP Replicates

Torchtune aligns with Pytorch Utils: Torchtune will remain aligned with Pytorch utils for now.
- The team will re-evaluate if it becomes unruly and they want to put it in a specific dataclass or something.
OOM Error Surfacing During Benchmarking: During benchmarking on main, a member encountered an OOM error when using multiple nodes via dp replicates in a situation where a single node did not OOM.
- Dropping to LLaMA 3.1 8B helped replicate the higher memory usage issue, and using dp_shards=16 and dp_replicates=1 lowered memory usage as expected (but is obviously slower).
Increased Memory Usage with DP Replicates: The team seeks insights into why using dp replicates requires more memory, as the reason isn’t immediately apparent.
- The relevant PR is here.

Nomic.ai (GPT4All) ▷ #general (3 messages):

GPT4All Web Search Plugin, Mixtral-8x7B-Instruct-v0.1 on RTX 3090, 3Simplex/Meta-Llama-3.1-8B-Instruct-gguf

GPT4All Ponders Plugin for Web Search: A user suggested adding a web search plugin to GPT4All to enhance the context window with real-time information.
- The user referenced past discussions between members regarding web search capabilities, inquiring about potential plans for implementation.
Mixtral-8x7B-Instruct-v0.1 Shines on RTX 3090: A user reported running the Mixtral-8x7B-Instruct-v0.1 model successfully on an RTX 3090, noting its impressive performance with 46B parameters at q4_0 quantization.
- Another user expressed interest in trying the model, inquiring about the VRAM requirements (24 GB).
GPT4All Mirrors Meta-Llama-3.1-8B-Instruct-gguf: A user claimed to reproduce the output of 3Simplex/Meta-Llama-3.1-8B-Instruct-gguf using a proposed system prompt within GPT4All.
- The user didn’t share further details about the specific prompt or the comparison methodology, which would have strengthened the impact of the observation.

Codeium (Windsurf) ▷ #announcements (1 messages):

Windsurf subreddit, Community engagement, Project sharing

Windsurfers launch official Subreddit: The community team launched a new official subreddit called r/Windsurf! for community members.
- The subreddit is intended to allow members to post your projects, learn from other builders, and engage with the community!
Share Your Surf!: Members are encouraged to post projects and learn from each other on the new subreddit.

You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.