Gemini is all you need.

AI News for 5/19/2025-5/20/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (215 channels, and 7031 messages) for you. Estimated reading time saved (at 200wpm): 622 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

Twelve months ago we covered Google I/O, but if we’re being honest Gemini wasn’t quiiite frontier yet and it was somewhat overshadowed by 4o’s launch.

Six months ago we wrote that Google wakes up with Gemini 2.0, and that began an epic multi-month run of increasing Gemini dominance (even adopting the AINews chart):

gemini

and today confirmed by official numbers from Gemini (though much of this helped by having the most generous free tier in the world):

gemini

The AI Twitter recap below does a pretty good job of recapping the major launches so we won’t really bother redoing it, but we’d definitely say it missed the launch of Jules (Gemini’s Codex/Devin competitor) because Jules was somewhat pre-leaked.

As always the Verge does a great job of condensing the 3 hour keynote into 30 mins:

verge

AI Twitter Recap

Google I/O 2024 Event and Announcements

Google I/O Event Details and Keynotes: *philschmid provided a thread summarizing the links needed to watch and participate online. The Main Keynote was on May 20th at 10am PT, with the Developer Keynote at 1:30pm PT. The AI Stage events were scheduled for both May 20th and 21st. Also,* philschmid listed the times of events. The events would start with the Main Keynote at 10 am PT, 7pm CEST.
Google’s AI Progress Since Last I/O: @Google noted that since the last Google I/O, they have announced over a dozen models and research breakthroughs, and released over 20 major AI products and features. They had also unveiled Project Astra last year (@Google). @GoogleDeepMind had shared.
AI-Driven Transformations and Developer Opportunities: @Google quoted Sundar Pichai, noting the significant opportunities with AI and the role of developers in ensuring its benefits reach many people. _philschmid stated that they shipped two minor QoL Updates for AI Studio: a built-in usage dashboard and a new GenMedia “Playground”.
AI Overviews & Search Improvements: @Google announced the launch of AI Mode to everyone in the U.S., building on the success of AI Overviews, which have led to happier users and more frequent searches. Also, Gemini 2.5 is coming to Search (@Google). Google Search is bringing generative AI to more people than any other product in the world (@Google). @Google previewed what’s coming soon to AI Mode, such as personalized suggestions, complex analysis, deep search, agentic capabilities, and Search Live.
Gemini Updates: @demishassabis said that their ultimate vision for the GeminiApp is to transform it into a universal AI assistant, a key milestone on the road to AGI. Some users are experiencing problems upgrading (@hkproj).
Gemini 2.5 Pro and Flash Models: @GoogleDeepMind announced Deep Think in 2.5 Pro, an enhanced reasoning mode using parallel thinking techniques. @jack_w_rae highlighted that Deep Think marks the progression to greater test-time compute and stronger reasoning capabilities in Gemini. @GoogleDeepMind mentioned stronger security and increased transparency for what the model is thinking. It was found that Gemini 2.5 is now the leading model for learning (@GoogleDeepMind). @omarsar0 noted that Gemini 2.5 Flash is now crushing with style. Also, the Flash version is using fewer tokens for the same performance (@GoogleDeepMind).
Gemini Diffusion Model: GoogleDeepMind announced Gemini Diffusion (@GoogleDeepMind), and the model can generate 5x faster than 2.0 Flash Light (@omarsar0). It is currently available as an experimental demo to help shape future models (@GoogleDeepMind).
Veo 3 Video Generation Model: @GoogleDeepMind introduced Veo 3, a new generative video model that adds soundtracks to clips you make, allowing you to create talking characters and include sound effects. timbrooks shared it.
Imagen 4 Image Generation Model: @GoogleDeepMind announced Imagen 4, an image generation model with richer images, nuanced colors, intricate details, and superior typography. It can create comics, stylized stamps, packaging, and more with improved spelling (@GoogleDeepMind).
Project Astra and Gemini Live: @GoogleDeepMind reviewed the improvements to Project Astra, including better voice output, memory, and computer control, making it more personalized and proactive. @GoogleDeepMind confirmed that Gemini Live camera and screen sharing in GeminiApp is available on Android and rolling out to iOS.
Agent Mode: @Google noted that they are starting to integrate agentic capabilities throughout their products, including GoogleChrome, Search, and GeminiApp. Agent Mode in GeminiApp will let you delegate complex planning and tasks to Gemini (@Google).
Google Beam (fka Project Starline): The new AI-first video communication platform uses a state-of-the-art AI video model to transform 2D video streams into a realistic 3D experience (@GoogleDeepMind).
Android XR and Partnerships: Google announced glasses with Android XR are lightweight and designed for all-day wear (@Google). They are partnering with Samsung to create the software and reference hardware (@Google).
Pricing and Availability: A new Google AI Ultra subscription tier will give access to Gemini 2.5 Pro Deep Think, Veo 3, and Project Mariner (@scaling01). Google is upgrading their AI subscription plans (@Google).
Contrarian viewpoints on announcements: c_valenzuelab found it uninspiring, saying it elicited a yawn.

AI Model Releases, Evaluation, and Analysis

DeepSeek V3 details: Research from @deepseek_ai clarifies how DeepSeek-V3 works using its key innovations.
Hugging Face’s Tiny Agents: Hugging Face has released Tiny Agents into its own NPM package, featuring lightweight composable agents built on Hugging Face’s Inference Client and MCP stack (@_akhaliq).
Model Merging in Pre-training: @iScienceLuvr highlighted a study on model merging during LLM pre-training, demonstrating that merging checkpoints from the stable training phase produces consistent and significant performance improvements.
Adjoint Sampling by Meta: Meta AI introduced Adjoint Sampling, a new learning algorithm that trains generative models based on scalar rewards, which can become the foundation for further research into highly scalable sampling methods (@AIatMeta).
KernelLLM 8B by Meta Exceeds GPT-4o in Single-Shot Performance: Meta released KernelLLM 8B on Hugging Face, and it exceeds models such as GPT-4o and DeepSeek V3 in single-shot performance (@reach_vb).
NVIDIA’s Cosmos-Reason1-7B: NVIDIA released Cosmos-Reason1-7B, a new vision reasoning model for robotics. It is the first reasoning model for robotics based on Qwen 2.5-VL-7B (@mervenoyann).
AniSORA Model: Bilibili dropped AniSORA on Hugging Face, an anime video generation model, Apache 2.0 licensed (@reach_vb).
Stability AI Releases Stable Audio Open Small: mervenoyann noted the release of stable-audio-open-small new text-to-audio model.
MMLongBench for Long-Context Vision-Language Models: MMLongBench benchmarks long-context vision-language models effectively and thoroughly (@_akhaliq).
Marin: Open Lab for AI Development: @percyliang introduced Marin, an open lab, to fulfill the vision for open-source AI, and @TheAITimeline shared. Marin repurposes GitHub, which has been successful for open-source software, for AI (@percyliang).
OMol25 and UMA: Meta AI released Open Molecules 2025 (OMol25) (@AIatMeta). @ClementDelangue said that Meta AI just released OMol25 on @huggingface.
Insights into DeepSeek-V3: This paper introduces insights into DeepSeek-V3’s hardware for AI Architectures (@TheAITimeline). DeepSeek is now the benchmark for Nvidia (@teortaxesTex).
LMEval Leaderboard: A new version of Gemini-2.5-Flash climbs to #2 overall in chat (@lmarena_ai). It was also found that Mistral Medium 3 makes a strong debut with the community (@lmarena_ai).
Code Generation Models Leaderboard: DeepCoder-14B-Preview, a code generation model that competes with top reasoning models like OpenAI’s o1 and DeepSeek-R1, but at a fraction of the size (@DeepLearningAI).
Runway References and Image Generation: Here is a new workflow for Gen-4 References: Element extraction and composition (@c_valenzuelab).
Improving Assembly Code Performance with LLMs via RL: This paper discusses how to improve assembly code performance with LLMs via reinforcement learning (@_akhaliq).
Group Think paper: The paper is about multiple concurrent reasoning agents collaborating at token level granularity (@_akhaliq).
Improving Factuality in LLMs: Scaling Reasoning can Improve Factuality in Large Language Models (@_akhaliq).
Study of Data Augmentation: TeortaxesTex noted another Seed paper, now on data augmentation.

AI in Robotics, Agents, and Automation

NVIDIA’s Physical AI models reasoning: Nvidia open sourced Physical AI models reasoning models that understand physical common sense and generate appropriate embodied decisions (@reach_vb).
Project Mariner: @GoogleDeepMind made updates to Project Mariner, their research prototype that can interact with the web and get things done.
DreamGen for Robot Learning: NVIDIA GEAR Lab introduced DreamGen, a new engine that scales up robot learning not with fleets of human operators, but with digital dreams in pixels (@DrJimFan).
Agentic DevOps with GitHub Copilot: GitHub Copilot now supports the entire software development lifecycle – from planning and implementation to updates, tests, and debugging (@TheTuringPost).
Azure AI Foundry Agent Service: Azure AI Foundry Agent Service is now generally available, and it comes with first-class LlamaIndex support (@llama_index).

Company Partnerships, Investments, and Business Applications

Cohere’s Partnerships with Dell and SAP: Cohere announced partnerships with Dell to offer Cohere North on-premises (@cohere) and SAP to power enterprise automation (@cohere).
Sakana AI and MUFG Bank: Sakana AI and MUFG Bank, the largest bank in Japan, have signed a comprehensive partnership agreement (@SakanaAILabs). Mitsubishi UFJ Financial Group’s banking unit has hired startup Sakana AI to power its systems with AI (@hardmaru).
Klarna and Open AI: @kevinweil noted a partnership with OpenAI and Box.

Techniques, Tools, and Tutorials

Data Quality and LLM Training: @cwolferesearch shared a practical guide for debugging an LLM’s training dataset, emphasizing the importance of data quality.
Hugging Face Hub Enhancements: The Hugging Face Hub now auto-magically formats chat/reasoning messages in an interactive viewer (@_lewtun).
LlamaIndex Updates: The LlamaIndex team is hosting its first Discord office hours session (@llama_index).
Microsoft’s Open-Sourcing Efforts: Microsoft open-sourced GitHub Copilot in Visual Studio Code, Natural Language Web (NL Web), TypeAgent, Windows Subsystem for Linux (WSL), and Edit command-line text editor (@TheTuringPost).
Together AI’s Code Execution Products: Together AI launched two new products to bring code execution & dev environments to AI apps: Together Code Sandbox and Together Code Interpreter (@togethercompute).
Hugging Face Hub and MLX Integration: Two new MLX + Hugging Face hub integrations make it easier than ever to get started running models locally (@awnihannun).
structured outputs in the API: In LLM APIs, structured outputs just got even more structured - including support for regex! (@stevenheidel).
LangGraph Platform Support for MCP: LangGraph Platform Now Supports MCP. Every deployed agent on LangGraph Platform now exposes its own MCP endpoint (@LangChainAI).
Am I the only one wishing there was a method to call these two in one line?: @gabriberton pointed to the common need to optimizer.step_and_zero_grad(), for efficiency.
AI Agents with Google Gemini: @_philschmid points to a blog on how to get started building AI agents with Google Gemini.
Learn how to build lightweight, real-time AgnoAgi agents for medical and legal tasks: @qdrant_engine tutorial covers modular agents and techniques to keep resource usage low

Political, Ethical, and Philosophical Musings

US Negative-Sum Game with China: TeortaxesTex discusses how the US is trying to play a negative-sum game with China, denying them resources to scale, while the US loses money and talent. The goal is a sufficient relative disparity, at sufficient absolute capability level, that the US makes its supremacy durable.
Openness of AI: America’s historical technological leadership wasn’t built on protectionism and closed systems, but on creating a dynamic marketplace of optionality, including open platforms the world could build upon (@ClementDelangue).

Humor and Miscellaneous

“no discourse”: @EigenGender shared they are obsessed with the idea of saying “no discourse” like the 2025 version of “no homo”.
“overcooked burger scorches the cheese”: - ancient Chinese proverb about showing grace to losers, probably, says @teortaxesTex.
The leaderboard that measures the number of times they’ve said “AI” in the keynote: Looks like they’ve got a new front-runner (@Google).
@jxmnop shares a funny story about a friend who worked at a French LLM startup.
“«says basic, sensible things instead of being a drooling catechetical midwit. «Invest in R&D to innovate». «Hire smart people». «Silicon Valley is cool, eh innovates and isn’t afraid of anything». «Wash your hands»> The Madman 🤯 how can America be expected to compete 😩” writes @teortaxesTex.
“It’s funny that the political compass is actually more like a sphere” according to @teortaxesTex
@arankomatsuzaki jokes that needing caffeine is a skill issue.
@demishassabis says cooking up something tasty for tomorrow…

AI Reddit Recap

/r/LocalLlama Recap

1. Gemma 3n Model Announcements and Community Reactions

Gemma 3n Preview (Score: 333, Comments: 92): Google’s new preview release of the Gemma 3n family (Hugging Face release) introduces efficient multimodal models specifically designed for edge and low-resource devices. They leverage selective parameter activation (akin to Mixture-of-Experts, or MoE), enabling models to function with an effective parameter count of 2B or 4B—despite a larger total parameter set—for optimized inference on constrained hardware. The models support inputs across text, image, video, and audio, are instruction-tuned, and cover over 140 languages; further technical detail is hosted on the Gemma 3n documentation page. Commenters note that the architecture deviates from typical transformer layouts, speculate similarity to the Gemini architecture (especially regarding multimodality and extended context), and highlight potential for privacy-preserving local deployments (e.g., HomeAssistant alternatives).
- Gemma 3n uses ‘selective parameter activation technology’, a form of Mixture of Experts (MoE), to allow edge-device-ready deployment with resource footprints equivalent to models with effective parameter sizes of 2B and 4B, despite having higher total count (e.g., E2B runs with ~5B active parameters). This technology is intended to balance efficiency and performance, enabling multimodal capabilities—including text, image, video, and audio input—and operation in over 140 languages.
- The architectural design appears heavily inspired or directly based on the Gemini Nano series, which is noted for its strong multimodal and extended context handling. Technical speculation suggests Gemma 3n splits its MoE gating not just by general-purpose but potentially by modality (i.e., activating expert sub-networks per input type), which could improve efficiency and specialization for processing diverse input types.
- Official Gemma 3n docs confirm E2B’s typical active parameter count and reinforce the model’s unique modality-based structure. The model’s edge-oriented design (for systems like Home Assistant or DIY Alexa) emphasizes on-device privacy and data localization, unlike conventional cloud models.
Announcing Gemma 3n preview: powerful, efficient, mobile-first AI (Score: 159, Comments: 25): Google has announced the preview of Gemma 3n, an AI model architected for efficient, real-time on-device inference on mobile-class hardware. Utilizing a 5B/8B parameter design but leveraging DeepMind’s Per-Layer Embeddings (PLE), KVC sharing, and advanced quantization, Gemma 3n achieves RAM efficiency comparable to smaller models, supports dynamic submodels (MatFormer), and handles multimodal tasks (audio, image, video, text). Benchmarks show strong multilingual results (e.g., 50.1% ChrF on WMT24++), and the privacy-first design allows for offline use on Android and Chrome; early access is provided via Google AI Studio and Google AI Edge. Official blog. Top technical comments express skepticism about Google’s benchmarking claims against Claude Sonnet 3.7 and question Arena Score reliability for such comparisons. Others discuss real-world use cases like local smart speaker inference, emphasizing the importance of latency over accuracy and the potential to further reduce response time and even eliminate the need for separate speech-to-text (STT) components like Whisper.
- Skepticism exists about the claim that the 4B parameter Gemma 3n preview matches Claude Sonnet 3.7. Commenters highlight the need for independent benchmarks, referencing previous Google demos they felt were misleading regarding real-time performance and statistical representation.
- A user reports using Whisper plus Gemma3 4B for local smart speaker applications and achieving approximately 3-second response times, emphasizing that speed is prioritized over ultimate accuracy in this context. They speculate the new Gemma 3n could further improve latency or enable direct inference without a separate speech-to-text (STT) component like Whisper, potentially streamlining edge deployment.
- Some criticize the use of Chatbot Arena scores as a quality indicator, suggesting “comparing [Gemma 3n preview] to Claude Sonnet 3.7 is ridiculous” due to presumed performance disparities, and argue Gemma 3n’s real advantage may be in mobile or edge applications rather than head-to-head capability with leading frontier models.
ok google, next time mention llama.cpp too! (Score: 142, Comments: 25): The image is from a Google presentation showcasing ‘Gemma 3n In preview’. Subtitles on the screen refer to ‘Ollama, UnSloth and others,’ apparently giving public acknowledgment to these popular frameworks/tools in the open-source language model ecosystem. The post’s context is that while Google gave shoutouts to Ollama and UnSloth—both widely used for running and fine-tuning models like Gemma—the omission of ‘llama.cpp’, another major inference library, sparked discussion. The technical significance is the perceived importance of community-driven open-source tooling and Google’s recognition of specific projects during mainstream announcements. Commenters emphasize that UnSloth’s mention is well-deserved due to its utility and developer quality, while others debate the lack of acknowledgment for llama.cpp, reflecting ongoing discussion in the community about recognition and credit among open-source projects.
- Some participants note the omission of llama.cpp from Google’s acknowledgments despite its technical importance, suggesting possible reasons related to project visibility or industry relationships, and highlighting how it underpins significant inference advancements for running LLMs locally and efficiently.
- There is technical discussion about the inclusion of Unsloth by Google, with users highlighting that Unsloth enables considerably easier and faster model training on consumer hardware, making it a standout tool for practitioners looking to fine-tune models.
- A question is raised regarding the availability and labeling of gemma 3n in Ollama, with scrutiny over checkpoint terminology and clarity for users seeking to utilize the latest models, indicating a need for better versioning transparency on the platform.

2. Gemma 3 Technical Updates and Optimizations in llama.cpp

Sliding Window Attention support merged into llama.cpp, dramatically reducing the memory requirements for running Gemma 3 (Score: 469, Comments: 76): The latest llama.cpp PR merges Sliding Window Attention (SWA) support, significantly reducing KV cache memory requirements for models like Gemma 3. Implementation introduces llama_kv_cache_unified_iswa, splitting KV caches by SWA/non-SWA layers, with the SWA cache aggressively pruning stale tokens after each batch and limiting advanced cache operations (e.g., context shifting) to minimize token loss. Practical VRAM reduction for Gemma 3 is estimated between 75-80%, making larger context lengths feasible on commodity hardware, but advanced KV-cache features fall back to full-context mode at the cost of higher memory usage. Core KV-cache logic was refactored for modularity and maintainability, and attention methods streamlined across the codebase. Top comments note SWA’s substantial (~75-80%) VRAM reduction and emphasize the performance gain especially for models with high cache usage, like Gemma, though they warn about limited context-shifting capabilities due to iSWA’s inherent design trade-offs—making it best suited to non-long-context-shift workloads (e.g., RAG).
- Sliding Window Attention (SWA) implementation for llama.cpp reduces VRAM requirements for the KV cache by approximately 75-80% (from the original estimate and based on PR comments), making it a major optimization for running large models such as Gemma. This change enables much longer context windows without a proportional increase in memory footprint.
- A critical technical caveat is that the iSWA approach for Gemma currently does not support KV cache context shifting, which may affect tasks requiring dynamic context windows. However, for applications like retrieval-augmented generation (RAG), the reduced memory usage significantly boosts performance.
- Real-world benchmarks show that users can now offload more layers (e.g., from 27 to 39 layers on a 27B q4 quantized model), achieving higher speed and much larger token windows within the same hardware constraints, which substantially enhances usability for high-context workloads.
Google MedGemma (Score: 173, Comments: 51): Google has released MedGemma, a collection of specialized Gemma 3 model variants for medical AI tasks, detailed in their official Hugging Face release. The lineup includes a 4B multimodal model (incorporating SigLIP-image encoder, pre-trained on de-identified medical images from radiology, histopathology, dermatology, and ophthalmology) and a 27B text-only model, both fine-tuned for clinical data and evaluated on a mix of open and curated medical benchmarks. The models are designed for local inference and further fine-tuning, with a technical report pending release. A notable technical discussion centers on trade-offs: the value of smaller, localizable fine-tuned medical models versus using larger, general models for clinical tasks. Commenters highlight the operational reliability and autonomy (e.g., resistance to disruptive upstream changes) as a key benefit for local deployments, even over highest-possible accuracy approaches.
- One commenter weighs the trade-off between deploying smaller, fine-tuned local models like MedGemma versus using the highest-capability models. For medical use cases, reliability and top performance may outweigh concerns like latency or resource cost, although local deployment protects against issues from subsequent model changes or updates.
- Summary details include that Google released MedGemma in two variants: a 4B-parameter multimodal version (with a SigLIP image encoder pre-trained on medical image types) and a 27B-parameter text-only variant specialized for medical text, both reportedly evaluated on open and curated clinical benchmarks. MedGemma supports further fine-tuning for developer-specific healthcare applications, and a technical report is forthcoming.

3. OpenEvolve and AlphaEvolve System Open Source Implementation

OpenEvolve: Open Source Implementation of DeepMind’s AlphaEvolve System (Score: 112, Comments: 9): OpenEvolve is an open-source implementation of DeepMind’s AlphaEvolve, an evolutionary LLM-based agent for discovering and optimizing algorithms across entire codebases. The architecture includes four components: a prompt sampler (context/past-history aware), an LLM ensemble (multiple models via OpenAI-compatible APIs), an evaluator pool (scoring programs with distributed, checkpointed evaluation), and a MAP-Elites-inspired program database. Benchmarks show near-parity with AlphaEvolve on tasks like circle packing (achieving 99.97% of DeepMind’s result) and function minimization, with the evolved programs autonomously discovering advanced algorithmic techniques (e.g., scipy.minimize optimization, simulated annealing). Extensive model benchmarking showed best ensemble results with Gemini-Flash-2.0 and Claude-Sonnet-3.7, and identified Cerebras AI’s API as a significantly faster inference provider for high-generation workloads. Full code and examples are available here. A top comment notes that the approach resembles reinforcement learning at inference time, highlighting the technical novelty versus more static search or pretraining-based code generation. Another technical endorsement notes the framework’s feature completeness compared to prior releases.
- Specific-Rub-7250 notes that OpenEvolve appears to leverage a reinforcement learningstyle approach, but interestingly applies it at inference time, raising questions about how online evolution or adaptation is integrated compared to traditional, offline reinforcement learning pipelines.
- Green-Ad-3964 points out similarities to genetic algorithms, implying that OpenEvolve’s methodology may involve evolutionary or population-based search mechanisms to optimize models, paralleling concepts in neuroevolution and genetic programming.
- SquashFront1303 requests clarification on which specific evolutionary algorithm (or alternative) was implemented in place of DeepMind’s proprietary Evolve algorithm, stressing the technical gap and need for open disclosure on this key architectural component to enable reproducibility and comparison.
Mindblowing demo: John Link led a team of AI agents to discover a forever-chemical-free immersion coolant using Microsoft Discovery. (Score: 359, Comments: 57): John Link and team used Microsoft Discovery, leveraging a coordinated set of AI agents (potentially utilizing microsoft/autogen), to identify an immersion coolant formulation that avoids the use of ‘forever chemicals’ typically found in such fluids. However, technical scrutiny in the comments suggests the proposed solutions are chlorofluorocarbons (CFCs), which are historically known to harm the ozone layer. Commenters question the novelty and safety of the discovered solution, expressing concern that it replicates outdated and environmentally harmful chemistry, raising doubts about the practical advancement and validation of the AI-driven discovery.
- A commenter notes the AI-generated solutions resemble chlorofluorocarbons (CFCs), raising a concern that these compounds are outdated and environmentally harmful due to their well-known negative effects on the ozone layer. This comment questions the novelty and impact of the AI discovery, suggesting the solution might have significant regulatory or environmental drawbacks if CFCs are indeed the primary result.
- Another technically informed contribution highlights that Microsoft Discovery is leveraging the open source ‘autogen’ framework (GitHub link), which enables multi-agent collaboration among AI systems. This details the technological stack powering the demo and may be of interest to those considering similar agent-based approaches for scientific discovery.
Microsoft unveils “USB-C for AI apps.” I open-sourced the same concept 3 days earlier—proof inside. (Score: 337, Comments: 75): llmbasedos, released 16 May under Apache-2.0, is an open-source minimal Linux OS that allows rapid (sub-minute) boot from USB/VM, providing a FastAPI-based MCP (Model Context Protocol) gateway for exposing local system functions to LLMs via JSON-RPC. Its framework allows any script (defined in a 2-line cap.json) to be made callable by LLM apps (ChatGPT, Claude, VS Code), supporting both offline (llama.cpp) and cloud LLMs, and running on Linux, Windows (VM), and ARM devices. Technically, it mirrors Microsoft’s later-announced “USB-C for AI apps” idea, with key features including modular MCP servers (FileSystem, Sync, Mail, Agent), ISO build scripts, and systemd integration; external review confirms support for search/embedding, rclone/job management, IMAP/iCal, workflow execution, and deployment extensibility via Docker/HTTP. Top technical comments note that the “USB-C for AI”/MCP metaphor is not original to Microsoft and has appeared in prior art, with some questioning whether Microsoft used the MCP concept differently. Skepticism is expressed regarding the likelihood and logistics of Microsoft pivoting to an idea from a rapidly released open-source project.
- Multiple commenters discuss whether the underlying concept—bootable or pluggable AI tools and models via USB or similar abstractions—is fundamentally novel, noting this is an idea likely to emerge independently in the field. For example, MCP’s own website already describes itself as “a USB-C port for AI applications,” suggesting that the analogy and approach may be broadly recognized by practitioners.
- Some users question the technical novelty of packaging or hosting LLMs (Large Language Models) within Docker images or similar portable frameworks; this is cited as a well-established practice in both Docker’s ecosystem and AI/ML deployment patterns.
- There is a technical debate as to whether comparing open-source project launch dates with Microsoft’s announcement is meaningful. Several users point out that large organizations like Microsoft would have begun development well before public announcements, making it unlikely that short-term timing or supposed idea copying is relevant from an engineering or product management standpoint.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo

1. Google Gemini 2.5 Pro & Ultra Model Benchmarks and Features

Holy sht (Score: 1202, Comments: 207): The post shares a comparison chart titled ‘Gemini 2.5 Pro Deep Think’ showing Google Gemini 2.5 Pro outperforming OpenAI models in Mathematics (49.4%), Code (80.4%), and Multimodality (84.0%). The benchmark sources are under scrutiny, especially for Mathematics, as commenters note discrepancies with MathArena-reported performance (24.4% for Gemini 2.5 Pro vs the 49.4% claimed). There’s technical skepticism about equivalent testing conditions; MathArena penalizes across several runs, raising questions if Google’s reported results are using a different, potentially more favorable, scoring method. Comments debate the benchmarking methodology, questioning the validity of direct comparisons due to differing test provisions (e.g., multiple runs, penalization on MathArena). There’s interest in alternative benchmarks (e.g., USAMO) and calls for third-party validation such as Paul Allen’s benchmarks.
- A commenter notes a discrepancy in USAMO benchmark scores for OpenAI’s models: while scores are reported from MathArena, OpenAI’s 2.5-pro achieves 24.4% on MathArena, yet another source claims 34.5%. This raises questions about consistency and comparability of benchmark results between different runs and reporting methods.
- Technical discussion highlights how MathArena penalizes inconsistent problem-solving across multiple runs—a model might solve an issue in one run and fail in another, which impacts averaged scores—prompting speculation that published results may reflect either best-run or averaged scores, affecting cross-model comparisons.
- An attached image shows a direct comparison of model benchmark scores on USAMO, suggesting a need for scrutiny of the evaluation methodology and result transparency, especially since a 48% score is described as ‘stunning’ and may indicate significant progress or possible inconsistencies.
New flash. Google won. Don’t know how to feel about it (Score: 738, Comments: 205): The image is a leaderboard ranking AI models by Arena Score, with Google’s Gemini-2.5 models occupying the top spots, surpassing competitors from OpenAI and DeepSeek. This visually highlights Google’s technical progress in large language model (LLM) development, particularly noteworthy as Google’s team originated the transformer architecture underpinning these advances. The comment emphasizing “much more efficient too!!!” implies that Gemini-2.5 achieves superior performance with better computational efficiency, suggesting advances in model scaling and optimization. Comments debate the significance of Google’s win, attributing it to their foundational research (transformers) and scaling abilities, while some express skepticism about Google’s current public reputation due to declining quality in other products (like search).
- Commenters highlight that Google’s latest model performs just below GPT-4 Turbo (referred to as 2.5 Pro), suggesting Google’s advancements have placed them extremely close to or on par with OpenAI’s top offerings in terms of capabilities.
- There is emphasis on the efficiency of Google’s recent models, with some noting a significant improvement in how computational resources are used compared to previous models—specifically, Google’s ability to scale without external dependencies is called out as a major technical advantage.
- One discussion points to Google’s deep roots in transformer architecture research and their internal compute resources as fundamental to their success; having pioneered transformers, they’re able to innovate and optimize at a scale few can match.
Google doesn’t hold back anymore (Score: 225, Comments: 64): The image presents benchmark scores for “Gemini 2.5 Pro Deep Think” versus unnamed OpenAI models across Mathematics (49.4%), Code (80.4%), and Multimodality (84%) tasks, with Gemini outperforming OpenAI in all compared categories. This underscores Google’s substantial progress in model capability, particularly in math and multimodal tasks, but also draws attention to the high cost of Gemini’s offering ($250/month), as noted in the post and comments. Commenters debate practical value, noting that despite superior benchmark scores, OpenAI’s models (especially o3) may deliver more reliable, consistent results in real-world technical planning and documentation. Several users also note the price disparity between Gemini ($250/mo) and OpenAI ($20/mo) models, questioning the fairness of the comparison.
- Several users highlight that while models like o3 show strong performance in technical planning and documentation, benchmarks often prioritize coding competency, potentially overlooking use cases such as structured writing where o3 is reportedly more reliable and consistent than offerings from Google or OpenAI in these scenarios.
- A discussion emerges about the steep pricing disparity, comparing Google’s $250/month to OpenAI’s $20/month, raising questions about value and whether increased subscription cost is justifiable based on performance or capabilities, particularly in non-coding tasks.
- There are accounts of subjective qualitative differences: some find Gemini 2.5 Pro’s outputs lacking depth or creativity compared to o3, with claims that Gemini performs better at coding (though one user disputes this based on personal experience), and that AI Studio is its only standout feature. Concerns are also raised about Gemini’s ability to provide nuanced or in-depth answers in research-oriented queries.
2.5 Pro gets native audio output (Score: 231, Comments: 22): The image documents an official presentation announcing that “2.5 Pro” (presumably a new or upgraded AI model from Google, as suggested by the large ‘G’ logo) will feature native audio output capability. The phrase “Expressive” on the waveform indicates a focus on high-quality, emotionally nuanced text-to-speech or audio synthesis as part of this update. This suggests significant progress toward more natural, integrated, and possibly real-time audio generation by Google’s language models. Commenters are eager about the feature’s release, questioning current access availability and speculating on the potential for highly realistic voices (e.g., comparing it to Scarlett Johansson’s voice), indicating demand for natural and expressive AI audio output.
- The main technical focus in the thread concerns whether the new 2.5 Pro model’s native audio output supports not just synthesized voice output, but additional audio features such as sound effects. One commenter specifically asks if native sound FX generation is possible, indicating interest in broader audio synthesis capabilities beyond standard text-to-speech.
- Another aspect touched on, though less technical, is the model’s ability to provide specific voice outputs—such as those resembling named voices (‘Scarlett Johansson’), suggesting probable technical curiosity about available voice cloning, customization, or adaptation technology within the audio output feature set.
$250/mo Google Gemini Ultra | Most expensive plan in AI insudstry ! (Score: 413, Comments: 192): The image provides a comparison between Google’s newly leaked/marketed ‘Google AI Ultra’ plan and the existing ‘Google AI Pro’ subscription. The ‘Ultra’ plan stands out for offering advanced AI features (presumably leveraging Gemini Ultra) at $124.99/month for the first three months and likely $250/month thereafter, representing the highest pricing tier currently visible in the consumer AI subscription space. Notably, it bundles YouTube Premium (valued at ~$19/mo) and a massive 30 TB of storage, alongside exclusive access to advanced tools like Gemini app, Flow, and NotebookLM with increased usage limits compared to ‘Pro’. This signals Google’s intent to position Gemini Ultra both as a premium AI product and a broader ecosystem value proposition akin to or exceeding OpenAI’s ChatGPT Plus tier. Commenters note the bundled value (e.g., ‘YouTube Premium and storage justify some of the cost’) and make direct comparisons to ChatGPT Pro, but point out accessibility issues due to the high price and the lack of a family plan (which some see as an opportunity for Google to improve adoption).
- Several users highlight that the $250/mo Gemini Ultra subscription includes both YouTube Premium (valued around $19/mo) and 30 TB of storage, positioning the plan similarly to comprehensive offerings like ChatGPT Pro when considering bundled services beyond just AI access.
- One commenter points out a key limitation: the lack of configurability in the plan. They question the value for specialized users (e.g., those primarily needing coding tools rather than video generation), noting the inefficiency of a bundled price for features they won’t use.
$250 per month… (Score: 108, Comments: 95): The image presents a subscription model for “Google AI Ultra” at $249.99/month, bundling advanced Gemini capabilities, increased usage quotas for Whisk and NotebookLM, deep integration with core Google services (Gmail, Docs, Chrome), Project Mariner, YouTube Premium, and 30TB of storage. The offer is US-only for now, with international expansion planned; key value proposition is access to AI-enhanced productivity tools and substantial cloud storage, positioning this as a comprehensive, high-end AI productivity suite by Google. Commentary focuses on perceived value, with skepticism regarding the ROI for individual users (“LLMs make me productive, but not THAT productive”), while some highlight desirable add-ons like YouTube Premium, questioning whether the price justifies the subscription versus alternative spend (e.g., outsourcing tasks).
- A user expresses shock at the data cap, referencing “30TB,” suggesting the service may be offering exceptionally high-volume data usage possibly for AI or media applications, and implicitly questioning the scalability or economic feasibility of such data limits in a $250/month package.
So this basically confirms it (expect a ‘deep think’ toggle - still unsure on ultra) (Score: 471, Comments: 79): The post discusses the expected introduction of a ‘deep think’ toggle in an AI product, likely from Google’s DeepMind based on the context and mention of Demis Hassabis. The image features Hassabis and implies a feature that increases reasoning depth or response quality, aligning with recent marketing by Google to highlight advanced AI capabilities. No confirmation yet on an ‘ultra’ mode, but the ‘deep think’ toggle is heavily implied. Commenters note Google’s shift towards more deliberate AI marketing, referencing OpenAI and Sam Altman as comparators. There’s discussion about the seriousness of DeepMind’s approach and speculation on upcoming features.
- There’s a technical observation about Google’s historical difficulty in marketing their AI products contrasted with their recent shift. The mention of Demis’ (Hasabis) ‘hype posts’ and a supposed ‘deep think’ toggle suggests a new feature or mode coming, reflecting a possible UI/UX or tuning option in future Google AI models. However, speculation remains about how this compares to OpenAI in terms of product strategy or technical user control.

2. Civitai Payment Ban and Community Responses

Civitai banned from card payments. Site has a few months of cash left to run. Urged to purchase bulk packs and annual memberships before it is too late (Score: 674, Comments: 422): Civitai announced it is being banned from card payment processing due to its decision to host NSFW and adult content, as confirmed by an official statement from a representative. The platform currently has only a few months of operating cash left and urges users to purchase in bulk or subscribe to annual memberships while they pursue alternative payment options. Technical challenge centers around compliance with payment processors’ content policies and the associated risk to business continuity. Commenters emphasize the risks of reliance on traditional payment processors for platforms hosting controversial content, raising questions about payment infrastructure resilience and censorship in creator ecosystems.
- A detailed comment clarifies the technicalities of payment processing: There are two layers—payment processors (e.g., Stripe) and credit card companies (e.g., Visa). Even if Civitai moves to an adult-content-friendly processor, all processors must ultimately adhere to the rules set by credit card companies. Adult-content-friendly processors exist but charge higher fees, and technical changes (requiring code adaptation and setup overhead) add further complexity to switching processors.
- Another comment highlights the broader infrastructure vulnerabilities for grey-market or adult-content sites: Besides payment processing, hosting, DNS, and search engine indexing are also subject to terms of service and potential business risk. Even sites going fully underground face discoverability and accessibility challenges for their users—potentially undermining their role as leading resource hubs. Thus, any pivot to alternative or underground structures involves a layered technical and operational risk, beyond just payment systems.
- The original statement from the Civitai team confirms their removal from the payment processor was due to a refusal to remove NSFW content, emphasizing ongoing commitment to support all types of creators and a search for alternative solutions. This underscores the ongoing technical challenge of balancing platform policy, compliance, and continuity of service for communities working with potentially controversial generative AI content.
[LEAKED] the powerpoint slide that convinced the civitai board to reject visa’s demands (Score: 439, Comments: 75): The image is a satirical pie chart allegedly showing a “confidential” breakdown of content on Civitai, with overwhelming proportions labeled as “Porn” (69.3%) and “Also porn lol” (23.8%), and minor segments for “Front page” and an Excel error. This parody visualization comments on a business/operational issue: that the platform’s content overwhelmingly consists of adult/NSFW material, possibly explaining resistance to payment processors’ (like Visa) compliance demands affecting user-generated adult content. The technical discussion centers on content moderation, payment compliance, and the implications for platform business models in the presence of adult content. Commenters debate: 1) the factual basis for Civitai allegedly rejecting Visa’s demands, 2) the observation that a vast majority of the site’s content is indeed NSFW, making drastic policy shifts commercially risky, and 3) why payment processors target platforms like Civitai while established adult sites seem unaffected, questioning the consistency of compliance enforcement.
- A user estimates that approximately 90% of Civitai’s user-generated content is NSFW, suggesting that enforcing Visa’s restrictions would likely result in a major loss of audience and potentially destroy the platform’s business model.
- There is skepticism expressed regarding payment processor inconsistencies: commenters point out that many established porn websites process payments via Visa without apparent issue, raising questions about why Visa targets smaller or specific sites like Civitai for enforcement.
- Another user humorously notes that a large proportion of content (quantified as ‘69%’ or ‘90%’) is NSFW, highlighting that the high volume of adult material on Civitai is central to debates about platform restrictions and payment processor compliance.
Is CivitAI on its deathbed? Time for us to join forces to create a P2P community network? (Score: 286, Comments: 198): The post raises concerns over CivitAI’s financial viability due to payment processing issues and limited operational runway, questioning whether the community should archive models (including LoRA weights) and shift to a peer-to-peer (P2P) sharing architecture. The discussion focuses on technical strategies for rapid archiving, decentralized storage (like torrents), and organizational logistics—such as avoiding duplicated work and establishing community checklists for distributed action. Commenters debate the sufficiency of P2P solutions like torrents, noting that while torrents can store and distribute models cheaply, they lack CivitAI’s features: searchable metadata, structured model presentation, creator profiles, and financial support mechanisms. Concerns are also raised about how to sustain active development and centralized knowledge sharing without a platform like CivitAI.
- Several users discuss the technical differences between hosting AI models on torrents versus a centralized platform like CivitAI. Torrents provide decentralized file sharing but lack functions such as a searchable database, tagging, community features, and direct support for model creators. As noted by Herr_Drosselmeyer, the absence of these metadata services and financial incentives limits torrents’ ability to fully replace a platform like CivitAI for the AI modeling community.
- There is interest in the technical feasibility and cost structure of hosting a CivitAI alternative without inference/generation features. This would potentially allow for cheaper hosting, and some users speculate about funding it through cryptocurrencies. However, the complexity lies not in bandwidth alone but also in implementing features that preserve community, monetization, and discoverability for ongoing model innovation.
- A few users are proactively archiving large AI model datasets (e.g., 1.5TB so far, with 60TB capacity available) in anticipation of potential platform loss, reflecting an emerging grassroots effort to distribute hosting resources across independent infrastructure. Additionally, alternatives like Civitasbay.org are being explored as early-stage P2P seeding solutions, but usability and content discovery are open issues.
**Continuously seeded torrent site for AI models, CivitasBay.org** (Score: 190, Comments: 34): **CivitasBay.org is a torrent-based distribution platform for AI model files (in particular,** safetensors format for SD 1.5, SD 2.0, SDXL, and LoRA fine-tunes), using magnet links to enable decentralized sharing without central hosting. The site indexes models by file size, date, and unique identifier, supporting peer-to-peer access to large generative weights, which is valuable for bypassing hosting or bandwidth constraints. This method leverages existing P2P infrastructure for broad, community-driven distribution of resources such as Stable Diffusion checkpoints and LoRA enhancements. Top comments highlight the lack of metadata (model descriptions, sample images, trigger words, or inference settings), noting that the platform’s utility is currently hampered for practitioners who require additional context for effective use. There is discussion around branding and user experience, with requests for enhanced documentation and UX features.
- Multiple commenters raise technical concerns about model usability stemming from a lack of metadata on CivitasBay.org. They specifically cite the absence of model descriptions, sample images, prompt examples, and recommended inference settings, making it difficult to identify the function, expected output, or use cases of the distributed safetensors files. This results in poor discoverability and hinders users from effectively integrating these models into workflows such as ComfyUI or similar interfaces.
- A comparison is made to other attempts at AI model archiving (e.g., civitaiarchive.com), noting that while continual seeding and backup of models via torrents is valuable for redundancy and access, such sites become ‘just model Limewire’ without proper context and documentation. Properly indexing models with detailed metadata is highlighted as critical for technical adoption and community use.

3. Cutting-Edge AI for Science, Creativity, and Automation

Microsoft Discovery : AI Agents Go From Idea to Synthesized New Material in Hours! (Score: 558, Comments: 83): Microsoft demonstrated AI agents for accelerated scientific R&D, with a specific use case of discovering and physically synthesizing a new, safer immersion coolant for data centers, replacing environmentally harmful ‘forever chemicals.’ The pipeline involved autonomous literature review, experimental planning, code generation, simulation deployment on Azure HPC, and physical synthesis—all accomplished in hours/days versus years, culminating in a live demonstration with the coolant keeping hardware cool during real-world operation. This represents a practical, closed-loop system where AI not only proposes novel compounds but executes end-to-end material discovery and synthesis, paralleling efforts such as Google’s GNoME but with distinct claims of novel, synthesized output. One technically informed commenter draws skepticism from past claims (e.g., GNoME) where AI-discovered materials were later shown not to be novel, and notes failures by AI-driven drug discovery firms (like Exscientia and BenevolentAI) to translate predictions to clinical or market success. Another comment raises the technical point that the breakthrough may not be mere combinatorial search, suggesting more advanced decision-making or generative methods, beyond brute-force exploration.
- A commenter provides critical context on the claimed breakthroughs by referencing the GNoME AI for materials discovery, noting that although Google reported the synthesis of ~40 new materials with its AI, an external analysis (https://doi.org/10.1103/PRXEnergy.3.011002) found none were truly new. Additionally, AI-driven drug discovery efforts (by Exscientia and BenevolentAI) have faced high-profile failures in clinical trials, highlighting a gap between hype and delivery in AI-for-science results.
- GrapplerGuy100 contrasts earlier combinatorial search techniques, like those in AlphaEvolve (which used large search spaces and rapid verification for protein design), with the current approach. They express uncertainty about whether the new Microsoft system uses similar combinatorial methods or represents a fundamentally different, potentially more advanced mechanism for materials discovery.
Google shows Project Astra controlling your Android phone (Score: 254, Comments: 76): Google demonstrated Project Astra featuring advanced voice and visual AI capabilities to perform comprehensive Android device control, as shown in a recent video demo. The implementation highlights real-time understanding and execution of complex user commands, signaling a leap in multimodal on-device agent performance and suggesting near-term deployment in consumer devices. Comments express surprise at the rapid progression from last year’s demo to real implementation, with experts noting the technical leap over both last year’s prototypes and competitors like Apple. Some foresee Project Astra’s inclusion in Android as a major disruptor to the mobile OS competitive landscape.
- A technical concern is raised about Google’s ability to serve Project Astra at scale, given the extremely high compute demands for context-aware, on-device AI, especially features like background third-party negotiation. The commenter questions whether Google’s TPUs and infrastructure can reasonably handle hundreds of millions of concurrent users without degradation, implying significant scaling and service delivery challenges.
VACE Extension is the next level beyond FLF2V (Score: 147, Comments: 27): The post contrasts FLF2V’s standard frame interpolation—conducted pairwise (e.g., 1→2, 2→3), often producing temporal inconsistencies and unnatural motion reversals—with the novel VACE Extension approach, where user-specified frames serve as ordered ‘checkpoints.’ This method, recently implemented in models like Wan2.1 (currently limited to 81 frames), generates a globally temporally consistent video that passes smoothly through all checkpoints, enabling high-quality animation akin to traditional anime in-betweens. Further scalability beyond 81 frames is possible using overlapping techniques, as shown in recent WACE 14b examples. Technical workflow details are shared here (Japanese). A commenter advocates for generating at lower frame rates (e.g., 15fps) and upsampling to 60fps via interpolation for efficiency, expressing concern about models that aim for 24fps natively. Another user inquires about interoperability between VACE and wan LoRAs, questioning if separate training is needed, suggesting active interest in model modularity and workflow integration.
- A user advocates generating video at 15fps for efficiency, noting that this allows rapid synthesis and later temporal upscaling (e.g., using interpolation) to cleaner 60fps outputs, suggesting that targeting 24fps in other models is less optimal for this workflow.
- There’s a discussion about the handling of the ‘last frame bleaching’ artifact in the VACE extension, where color desaturation occurs at the end of sequences. The issue persists despite post-processing efforts, with some suggesting it’s less noticeable in cartoons, which can be re-color graded with fewer quality issues compared to realism-focused workflows.
- One commenter questions compatibility with WAN LoRAs in the VACE model, asking whether existing LoRAs can be used or if new training is required, highlighting an open issue around model extension and LoRA integration.
DeepMind Veo 3 Sailor generated video (Score: 580, Comments: 152): The post discusses a video sample generated by DeepMind Veo 3’s ‘Sailor’ prompt but the external link to the video (https://v.redd.it/us18oc0gpz1f1) is inaccessible due to a 403 Forbidden error, making direct technical analysis of the output impossible. One technical comment highlights that ‘old men with beards’ are a recurrent and potentially restricted generation in Veo, suggesting possible prompt-engineering limitations or content filtering bias in the model. The general discussion speculates on the technological proximity to generating complete AI-directed films from single prompts, indicating rapid progress in generative video models like Veo. Comments note content limitations in Veo (biased towards generating certain subjects, e.g., ‘old men with beards’), while others express a sense of impending technological disruption in film production, but lack detailed technical evaluation due to the unavailable video.
- One commenter predicts that AI-driven video quality will reach near-perfection in 1-2 years, stating it’s currently at least 80% done, suggesting rapid progress in generative video models like DeepMind Veo.
- Discussion emerges around the imminent capability for AI to generate entire movies from a single prompt, implying significant advances in multimodal generative systems and potential disruption to traditional video production pipelines.
Veo 3 (Score: 359, Comments: 98): The post’s title, ‘Veo 3,’ refers to what is likely the latest version of Google’s Veo, an advanced generative video AI model announced by Google at I/O 2024. No technical details or benchmarks are given in the post due to the external link being inaccessible. None of the comments provide new technical information, but one references the rapid advancement since early generative video such as ‘Will Smith eating spaghetti.’ Top comments reflect concern about the disruptive impact of recent generative video models on industries like animation and Hollywood, referencing how quickly the field has advanced and speculating on the effect on established studios (e.g., Pixar, DreamWorks) and traditional Hollywood production.
- A commenter predicts that within 5 years, some short scenes in movies may be AI-generated instead of traditionally rendered CGI, suggesting a shift in VFX workflows towards machine learning-based generative models for certain types of visual content. This could considerably reduce costs and time for specific scene creation tasks.
Veo 3 Standup comedy (Score: 303, Comments: 79): The post discusses a Veo 3-generated video emulating standup comedy, with commenters noting the realism surpasses prior uncanny valley limitations. Technical discussion centers on Veo 3’s ability to convincingly synthesize not only visual and behavioral cues but also subtle audio artifacts, such as ‘breathing-into-the-mic’ sounds, to enhance authenticity. Veo, a video generation model by Google DeepMind (see their research page), is highlighted for its nuanced replication of human performance in generative video contexts. Expert commenters are surprised by the model’s high fidelity, expressing that it could be indistinguishable from real standup footage, with some speculating whether the clip was generated or simply an authentic recording, demonstrating the model’s advancement in media realism.
- Commenters note Veo 3’s generation is highly realistic, capturing nuanced audio details like the ‘breathing-into-the-mic sound’ before laughter, demonstrating advanced temporal coherence and subtle audio-visual synchronization.
- There’s discussion of the seamless integration of speech, laughter, and ambient sound effects, which were not expected in AI multimodal models until at least 2025, indicating a leap ahead of anticipated progress for generative video+audio systems.
- One user highlights the difficulty of distinguishing AI-generated output from real footage, suggesting Veo 3’s results are on the cusp of, or have surpassed, the ‘uncanny valley’ in synthetic media realism.

AI Discord Recap

A summary of Summaries of Summaries by Gemini 2.5 Pro Exp

Theme 1: Google’s AI Blitz and New Model Onslaught

Google Unleashes Gemma 3 and Gemini 2.5 Barrage!: Google dropped its Gemma 3 series, including Gemma 3N (technical report, docs), initially US-only, alongside Gemini 2.5 Flash (OpenRouter link), and teased Gemini 2.5 Pro Deep Think, Veo 3, and Imagen 4 at Google I/O, though many features remain restricted. LMArena members found Google I/O a filler episode, while OpenRouter users debated Gemma 3n-4B potentially matching Claude 3.7 based on Google’s blog on Gemma 3n and Chatbot Arena preferences.
Meta’s Llama 3.3 Makes Exclusive OpenRouter Debut!: Meta’s new Llama provider, featuring a 3.3 8B model, launched exclusively open-access and free-to-start on OpenRouter, as per OpenRouter’s Llama announcement tweet. This move offers developers a cost-effective avenue to experiment with Meta’s latest language model offerings.
IBM and Mistral Quietly Drop Bombshells, Qwen Marches On!: While Google dominated headlines, IBM subtly unveiled Granite 4.0, and Mistral released Mistral Small 3.1, both generating buzz in the Unsloth AI community. Meanwhile, discussions around the Qwen series continued, with performance tweaks like speculative decoding for Qwen 3 in LM Studio and benchmark discussions for Qwen 2 35B in the Aider server, detailed in Paul Gauthier’s blog post about Qwen3.

Theme 2: Revolutionizing AI Tooling and Developer Platforms

Unsloth Steals Spotlight at Google IO, KernelLLM GGUFs Drop!: Unsloth garnered attention after being showcased at Google IO (UnslothAI tweet about Google IO), with the team also releasing KernelLLM GGUFs (KernelLLM GGUFs on Hugging Face) and a fixed-ish Sesame notebook for longer audio generation (UnslothAI tweet on Sesame notebook). These releases empower developers with more efficient tools for training and deploying models.
LM Studio and Modular’s MAX Beef Up Local AI Capabilities!: LM Studio users fine-tune models by adjusting RoPE Frequency Scale and leverage Sliding Window Attention (SWA) via Llama.cpp (Llama.cpp SWA GitHub pull) to slash memory usage, seeing KV cache drop from 1700MB to 348MB. Modular’s MAX platform promises full-stack control for inference, inviting enterprise users to contact [email protected] for large-scale disaggregated inference solutions.
Perplexity and OpenRouter Iterate, But API Quirks Persist!: Perplexity rolled out new features like Perplexify Me (Perplexity May 19th changelog), but users in the Perplexity AI Discord noted its API outputs can be less direct than the Playground and reported Deep Research API requests appearing as sonar-reasoning-pro (Perplexity API GitHub issue #320). OpenRouter streamlined developer experience by adding slugs for providers and quantizations (OpenRouterAI tweet on slugs).

Theme 3: Rise of the AI Agents: Coding, Research, and Beyond

Google’s Jules Agent Enters the Coding Arena, Manus Builds Websites!: Google launched Jules (Jules by Google website), an asynchronous coding agent, stirring excitement and waitlists, while Manus.im debuted an AI agent with its own computer to build websites and conduct research (Manus.im invitation example, Manus use case collection). These agents showcase diverse applications, from software development assistance to automated task completion.
OpenEvolve Replicates DeepMind’s Feats, Evolves Code with LLMs!: The open-source release of OpenEvolve (OpenEvolve on GitHub, OpenEvolve HuggingFace Blog), an implementation of Google DeepMind’s AlphaEvolve, made waves by replicating circle packing results with 99.97% accuracy and evolving random search into simulated annealing. This tool uses LLMs to discover, optimize, and evolve entire codebases, and was discussed across Unsloth AI, Eleuther, Nous Research AI, HuggingFace, and GPU MODE Discords.
MCP Ecosystem Expands with A2A Bridge and Wallet Integration!: The Model Context Protocol (MCP) saw new developments, including an open-source server bridging MCP with A2A protocol Agents (A2A-MCP-Server on GitHub) allowing Claude to interact with A2A agents. Additionally, TokenPocket released Wallet MCP (Wallet MCP on GitHub) for integrating AI clients with encrypted user wallets, discussed in the MCP (Glama) Discord.

Theme 4: Pushing Performance Frontiers: Model Optimization and Evaluation

Speculative Decoding and SWA Supercharge Local Models!: LM Studio users boosted Qwen 3 performance by 12% using speculative decoding after a lot of fiddling with settings like max draft size (3) and draft probability (0.8). They’re also slashing memory via Sliding Window Attention (SWA) in Llama.cpp (Llama.cpp SWA GitHub pull), reducing a Gemma 4B 15K model’s KV cache from 1700MB to 348MB.
NVIDIA and Academics Probe Model Internals and Efficiency!: Discussions around the Physics of Language Models (Part 3.1 Knowledge Storage paper, Part 3.2 Knowledge Manipulation paper) in the Yannick Kilcher Discord highlighted LLM strengths in retrieval but weaknesses in manipulation without CoTs. In GPU MODE, FSDP2 was noted for using less VRAM than FSDP1 but being slower, with specific benchmarks like fsdp2:11.599G, 13.39s/step versus fsdp1:15.801G, 7.3s/step on RTX 4000 Ada x4.
Tinygrad Bounties Drive Hardware Optimization, Cutotune Automates CUDA Tuning!: The tinygrad community pushes hardware limits with bounties for Flash Attention (initially 7900XTX-focused) and BERT training aiming to outperform normal attention on Nvidia/AMD GPUs, as detailed on tinygrad’s X pinned post. Meanwhile, GPU MODE members introduced cutotune, an autotuner for CUDA kernels working with PyTorch compile, designed for easy extensibility.

Theme 5: AI’s Societal Pulse: Ethics, Slop, and Community Dynamics

AI ‘Slop’ Definition and Implications Spark Heated Debate!: Across EleutherAI, the term ‘AI slop’ ignited discussions, with definitions ranging from low effort AI output to content failing the Turing test, referencing the AI slop Wikipedia page and the fragility failures paper. The debate questioned if ‘slop’ is inherently AI-tied or a broader content quality issue.
LLMs Develop Social Lives, Biases, and Drive Change, Researchers Find!: A Science Advances paper on LLM social conventions discussed in Nous Research AI revealed that decentralized LLM populations spontaneously develop social conventions and can exhibit strong collective biases, even without initial individual bias. The study also showed adversarial LLM agents can instigate social change within these populations.
From Echo Chambers to Ugly Ducklings: Navigating AI Discourse!: Latent Space members noted the rise of anti-AI sentiment on platforms like Hacker News and Reddit, pondering if they represent echo chambers. Concurrently, DSPy’s cryptic X post about an “ugly” something teased its core philosophy: If you get this, you get what DSPy is all about, hinting at appreciating the unconventional in AI development.

Discord: High level Discord summaries

LMArena Discord

Special Tokens Spark Thinking Tag Exploration: Members explored using special tokens, comparing them to Anthropic’s thinking tags, noting they’re similar to <antml:thinking>.
- Discussion revolved around understanding their purpose and practical application in AI models.
Google Drops Gemma 3, US Only: Google released Gemma 3, publishing the Gemma 3N technical report which is initially exclusive to US users.
- Initial reactions noted limited availability, causing some frustration among international users.
Google I/O Demos Over Deliveries: Google I/O event disappointed some, seen as a filler episode with limited immediate releases, despite announcements of Veo 3, Imagen 4, and Gemini 2.5 Pro Deep Think.
- Many features are restricted to trusted testers or US users, delaying broader access.
ChatGPT Pro Thrashes Gemini Ultra in Value Fight: The community favored ChatGPT Pro over Google’s new $250/month Gemini Ultra plan, citing a better value proposition.
- Some community members speculated about the possibility of Google going bankrupt by the time Grok 9 is released, indicating skepticism towards Google’s current AI strategy.
Local AI Attempts Zero-Shot Snake Game: Users experimented with local AI, exploring zero-shot generation using Qwen 3 4B/8B in LM Studio.
- Challenges arose with limited VRAM, resulting in slow token generation, with one user reporting probably just 5 or so [tokens] per second.

Perplexity AI Discord

Perplexity releases Perplexifying New Features: Perplexity announced several new features: Perplexify Me, Live Standings and Commentary for F1, F1 Scores in Answers, Improved File Attachments, and Sidebar Shortcuts, documented in their May 19th changelog.
- Specific functionalities and detailed improvements for Perplexify Me, File Attachments, and Sidebar Shortcuts remain unspecified in the announcement.
Gemini 2.5 Is a Google Workhorse: Google’s Gemini 2.5 Flash (May 20th checkpoint) excels in advanced reasoning, coding, mathematics, and scientific tasks due to its built-in thinking capabilities.
- It provides responses with enhanced accuracy and nuanced context handling, positioning itself as a state-of-the-art workhorse model.
Grok is the Sweet Choice for Unlimited Free Reasoning: Grok is emerging as the preferred choice for users seeking strong, free, unlimited reasoning capabilities in AI models.
- Users describe Grok as having a chill black dude energy which may have contributed to its popularity.
Perplexity Playground Outshines API?: Users report that Perplexity Playground outputs are more direct and accurate compared to the API, even after adjusting top_p and top_k values, noting the addition of a sample example message can improve API results but increases token usage.
- Users also discussed whether Perplexity API supports OpenAI JSON schema via the OpenAI Python library, but the results of the investigation were not given.
Deep Research API is a wolf in Reasoning Pro clothing?: A user reported that Perplexity Deep Research requests via the API show up as sonar-reasoning-pro in the dashboard, prompting a GitHub issue.
- Users have also reported experiencing frequent timeout issues when using the Perplexity API, especially when researching target companies, suggesting the need for smaller batch research.

Unsloth AI (Daniel Han) Discord

Unsloth Unveiled at Google IO!: Unsloth was showcased at the Google IO event via this tweet, with community members expressing excitement after discovering Unsloth through the Gemma+Unsloth+Collab demo.
- The Unsloth team also released KernelLLM GGUFs and linked to Hugging Face along with a fixed-ish Sesame notebook update that allows generating audio longer than 10 seconds, accessible via this tweet.
Mistral and IBM quietly make moves: Members noted the Mistral Small 3.1 release, while IBM quietly unveiled Granite 4.0, generating excitement, as well as a member linking to some crazy benchmarks.
- One member linked to the Ollama library for Qwen2.5 VL. A member also mentioned that they are not the first in the space for Gemini Diffusion.
OpenEvolve springs into action!: A member announced the release of OpenEvolve, an open-source implementation of Google DeepMind’s AlphaEvolve system, detailed in this blog post.
- The creator successfully replicated DeepMind’s results on circle packing (99.97% match!) and evolved a random search into a simulated annealing algorithm, using an LLM ensemble approach for improved results and multi-objective optimization.
Members struggle with merged models and PPO: Users reported that merging a 4-bit model and saving as 16-bit can cause performance issues. It was also expressed that using PPO training posed challenges, particularly regarding VRAM usage and reward function limitations.
- A user suggested to “not load 4bit or 8bit when you are trying to merge” and suggested a colab notebook outlining this process.

LM Studio Discord

LM Studio API Access Limited Without Web Frontend: Users discovered that while LM Studio API supports API access for hosting LLMs locally, it lacks a web frontend, unlike Stable Diffusion.
- To achieve web integration, users must use a separate frontend to connect to the LM Studio API.
Configure RoPE Frequency with LM Studio: Members can adjust the RoPE Frequency Scale for loaded models in LM Studio by finding the setting cog next to the model loader.
- It’s also possible to modify the gguf format to include it, as it is supported in ollama and llama.cpp.
Qwen 3 Gets Speed Boost Via Speculative Decoding: Users achieved a 12% performance increase with Qwen 3 using speculative decoding, configuring it after a lot of fiddling by setting the max draft size to 3 and increasing the drafting probability to 0.8.
- Increasing the min draft size only increases electricity usage and generation times.
Model Unloading Unveiled in LMStudio API: The LM Studio API supports unloading models using the model.unload() function as described in the official documentation.
- This function is part of the API and doesn’t require creating a separate function.
Sliding Window Attention Slashes Memory in Llama.cpp: Sliding Window Attention (SWA) is now available in Llama.cpp (github.com/ggml-org/llama.cpp/pull/13194), and potentially soon in LM Studio, to reduce memory usage, particularly for long contexts.
- Enabling SWA reduced KV cache usage from 1700MB to 348MB for a Gemma 4B 15K context model, and further down to 184MB with Q8_0 quantization during initial tests.

OpenRouter (Alex Atallah) Discord

OpenRouter Adds Provider and Quantization Slugs: OpenRouter announced that providers and quantizations now have slugs, enhancing developer experience, according to their tweet.
- This change aims to streamline integration and management for developers using diverse models and quantization methods on the platform.
Google Launches Gemini Flash 2.5 on OpenRouter: Google DeepMind launched Gemini Flash 2.5, available on OpenRouter at google/gemini-2.5-flash-preview-05-20 for testing.
- This release provides developers with early access to Google’s latest model for fast and efficient AI applications.
Meta Provides Llama Exclusively on OpenRouter: Meta’s new Llama provider is now live, exclusively open-access on OpenRouter, and free to start, which includes a new 3.3 8B model, per their tweet.
- This collaboration offers developers a cost-effective way to experiment with and utilize Meta’s language models.
Google’s Gemma 3n model may match Claude 3.7: The new Gemma model, Gemma-3n-4B, is supposedly as good as Claude 3.7 according to this blog post.
- Members in the chat found this claim suspicious, as it came from user preference from the Chatbot arena.

Eleuther Discord

EleutherAI Debates Discord Bot Reacts: Members of EleutherAI considered implementing a Discord bot to combat spam, proposing rate limits and mod pings instead of direct deletion to prevent abuse, and considered a mod-only react to flag content.
- They suggested using a more general react, counting reacts from regulars only, and using a private mod-only channel to ease victor detection, and avoid the irreversible deletes.
AI ‘Slop’ Definition Sparks Debate: Members debated the meaning of ‘slop’ in AI-generated content, with definitions ranging from low effort, poorly comprehended material to content that fails the Turing test, referencing AI slop Wikipedia page.
- The discussion touched on whether slop is inherently tied to AI, with some arguing that human-generated content can also be slop, citing overproduced or non-sensical outputs as examples, and fragility failures.
Yi Ma Throws Shade on Current ML?: Members watched Yi Ma’s ICLR talk and discussed his first-principles approach, which grounds work in information theory and cybernetics, contrasting it with the heuristics prevalent in current models.
- One member found the talk slow initially but appreciated the point that compression alone isn’t enough; organization is key.
DeepMind’s AlphaEvolve Replicated: A member released OpenEvolve, an open-source implementation of Google DeepMind’s AlphaEvolve, which evolves entire codebases using LLMs to discover and optimize algorithms and replicated DeepMind’s results on circle packing.
- The release included a GitHub and Blog post.
Members ask about Gemini Diffusion Prompts: A member opened a discussion about Google’s Gemini Diffusion model.
- He offered to take prompts for anyone without access to Gemini Diffusion.

Cursor Community Discord

Cursor’s Tool Limit Stalls Chats: Users report that hitting the 25 tools limit in Cursor and accepting before continuing terminates the chat session, as noted in a Reddit post.
- This issue may be related to how Cursor handles request limits for non-fast users.
DeepSeek-R1T-Chimera Breaks Prompt Loops: The DeepSeek-R1T-Chimera model, a fine-tune between R1 and V3, is claimed to be the only free model capable of breaking loops in a user’s prompt testing, available on Hugging Face.
- The model’s unique architecture allows it to handle complex prompts without getting stuck in repetitive cycles.
MCP Refreshes Irk Users: Users are expressing frustration with the frequent need to refresh their MCPs due to their status turning yellow, which consumes a significant number of requests.
- The constant refreshing interrupts workflow and reduces the efficiency of using MCPs.
Gemini’s Thought Process Evolving: Users are observing a shift in Gemini’s thinking process, where the model now summarizes its thought process rather than displaying the actual steps, making it challenging to craft effective prompts and project rules.
- This change impacts the ability to fine-tune prompts based on the model’s internal reasoning.
O3 Pro Launch Imminent?: Speculation is mounting regarding the release of O3 Pro, fueled by a tweet from a former OpenAI employee hinting at a potential June release.
- Enthusiasts are eagerly awaiting official confirmation and details about the new features and capabilities of O3 Pro.

Modular (Mojo 🔥) Discord

MAX Promises Full-Stack Domination: When asked about comparisons to competitors like Fireworks.ai, Together.ai, and Groq.com, the Modular team claimed that Max provides end-to-end control over the stack, leading to greater control over performance, latency, accuracy, and programmability.
- They invited users to contact them at [email protected] to discuss an enterprise solution with large-scale disaggregated inference.
Cosmos: A World (Model) Gobbles GPU: Cosmos, described as a world model that combines a physics engine and a video model, demands significant VRAM.
- It was clarified that MAX is focused on the NN model, not the entire data pipeline or the underlying data streams.
fn() -> raises Syntax Debated: Members debated whether the syntax alias func = fn() is a good idea for Mojo, despite being accepted by the parser.
- It’s kept for consistency with other return statements but goes against other languages and their type signature requirements, as seen in this code snippet.
Pixi Replaces Magic for GPU Puzzles: The Modular team deprecated Magic in favor of Pixi for the Mojo GPU puzzles, magic is just a wrapper around pixi, as seen in this commit.
- The Modular team deprecated Magic in favor of Pixi for the Mojo GPU puzzles, magic is just a wrapper around pixi.
False Positives Plague Mojo 25.3: A member reported an increased number of false positive warnings in version 25.3 related to unused assignments, specifically warning: assignment to 'suggested_max_length' was never used; assign to '_' instead, as described in this bug report.
- The Mojo compiler incorrectly flags variables as unused when only their alias values are referenced, this is an issue with def function declarations, confirmed by switching to fn() -> raises which resolved the false positive.

Notebook LM Discord

NotebookLM App Goes Mobile: NotebookLM released a mobile app for Android and iOS, featuring offline Audio Overviews, interactivity, and content sharing.
- Initial feedback suggests the app lacks features compared to the web version, such as Discover Sources and users have raised concerns about the lack of foldable devices support.
Audio Overviews Gain Length Control: Users can now adjust the length of Audio Overviews (English only) with short (~5+ min), long (~20+ min), and default (~10+ min) settings.
- The customization allows users to better tailor the depth and length of AI-hosted discussions to their needs.
Podcast Pronunciation Provokes Problems: A user jokingly reported that despite providing clear instructions on the pronunciation of ‘RAGBRAI’ for a generated podcast (notebook link), the AI pronounced it in four different ways.
- The user was creating podcasts using NotebookLM’s podcast generation capabilities and other members suggested trying official pronunciation notation (IPA) and adding the pronunciation instructions to the prompt.
AI Anti-Hallucination Heroics: A user created custom system instructions (dubbed ‘LSTA Agent’ and ‘Ultimate AI Anti-Hallucination Integrity protocol’) to prevent AI from altering source material.
- This was in response to another user modifying a document by replacing names with Star Wars, Detective Colombo, and Homer Simpson characters (notebook link.
Subscription Tier Pricing Sparks Debate: Users are debating the pricing for NotebookLM subscriptions, mainly concerning the cost of $250/month for the Plus tier after an introductory offer.
- There’s confusion about the offer, with some interpreting it as $125/month for the first 3 months, followed by the full price.

GPU MODE Discord

Cutotune Autotuner Tunes CUDA Kernels: A member introduced cutotune, a simple autotuner for tuning any python code calling kernels on an NVIDIA GPU, and claims that it should be easily extensible to other accelerators, working with PyTorch compile.
- The autotuner allows overriding the autotuning parameters without commenting out the autotuner, and it allows tensor property-based and functional triggers based on stride/dtype, a superset of the key in the Triton autotuner.
FSDP2 uses less VRAM but slower: A member tested FSDP1 and FSDP2, observing that FSDP2 uses much less VRAM but is slower, with results from RTX 4000 Ada x4 showing different performance for full and lora setups.
- The member noted the following performance using FSDP1 and FSDP2 with batch size of 4: #full fsdp1:15.801G, 7.3s/step fsdp2:11.599G, 13.39s/step #lora fsdp1:13.783G, 8.58s/step fsdp2:10.758G, 10.45s/step.
Google Swallows Diffusion Pill with Gemini: Google’s Gemini Diffusion model was mentioned, indicating Google’s entry into the diffusion model space, with link to DeepMind’s models page.
- Separately, the clean codebase for SageAttention was highlighted, with a link to the GitHub repository.
Axolotl’s QAT/PTQ workflow is examined: Members scrutinized an Axolotl QAT/PTQ workflow, comparing their respective configurations for fine-tuning and quantization.
- The key difference between the two flows is the absence of prepare_model_for_qat, while good results with the OpenAssistant/oasst1 dataset were reported using default settings in the same configuration.
NVIDIA and UCL Throw a CUDA Party: NVIDIA and UCL are hosting a CUDA Developer Meet Up on June 5 in London, inviting developers of all levels to connect with CUDA experts and explore Python-native GPU programming.
- The event promises hands-on talks, discovery of new libraries like nvmath-python and cuda-core, networking opportunities over pizza, and chances to win NVIDIA swag (RSVP here).

Nous Research AI Discord

Google’s Code Agent Ignites AI Race: The launch of Google’s code agent (real.azure) is generating excitement and comparisons within the AI community regarding its features and capabilities.
- Speculation is mounting for an official unveiling at Google I/O, as members eagerly await access after being waitlisted.
Gemma 3n Debuts, Targets Edge: Gemma 3n, a generative AI model designed for edge devices, utilizes a new architecture that includes optimizations for parameter-efficient processing (paper).
- Designed to minimize memory usage, the model can manage audio input and integrates text and visual data, potentially powering smart glasses.
Diffusion Models Promise Parallel Text Generation: Diffusion models are being explored for faster text generation via their capacity to process text chunks in parallel and produce text non-causally.
- Members suggest potential for parallelized chain-of-thought reasoning and improved joke writing due to the ability to consider the punchline from the start, though token limits may pose restrictions.
OpenEvolve Evolves Algorithm Optima: OpenEvolve, an open-source implementation of Google DeepMind’s AlphaEvolve, is released, replicating DeepMind’s circle packing results with 99.97% accuracy.
- The agent employs LLMs to discover and refine algorithms, evolve entire codebases, and work with any OpenAI-compatible API, and members shared links to its GitHub repository and blog post.
LLMs Socialize, Show Bias, Cause Social Change: A recent paper reveals that decentralized LLM populations spontaneously develop universally adopted social conventions through local interactions.
- The study also demonstrates that strong collective biases can emerge during this process, even when individual agents show no initial bias, and that adversarial LLM agents can drive social change.

HuggingFace Discord

Bio-Data Expert Eyes Hugging Face Collaboration: A member working in biodata at EMBL-EBI (https://www.ebi.ac.uk/) expressed interest in collaborating with the Hugging Face team on automated curation using LLMs and RLVR with biocuration data.
- Another member suggested contacting a specific user for potential collaboration.
DataTune Transforms Data: Vitalops introduced a new open-source tool, DataTune, that performs data transformations with simple natural language instructions and LLMs.
- The tool aims to alleviate concerns about context length and high API costs associated with data transformations using LLMs.
Agent Course Code Fixes with LiteLLMModel: Members suggest replacing InferenceClientModel with LiteLLMModel in the course code to use Ollama effectively, as the updated notebook from GitHub addresses bugs in the LoopEvent section (notebook link).
- One user reported solving an issue by creating a new clone of the project repo and copying files over, noting that a full rebuild in settings didn’t fix the problem, saying something broken in the spaces that the full rebuild in settings won’t even fix.
Cyberdesk Agent Builds Computer Agent in a Weekend: A member introduced Cyberdesk, a tool to build a computer agent in a single weekend, utilizing any Hugging Face pixel-based model, as launched on ProductHunt.
- Another member introduced OpenEvolve, an open-source implementation of Google DeepMind’s AlphaEvolve system, found on GitHub, using LLMs to discover and optimize algorithms, as outlined in their blog post.

Latent Space Discord

Perplexity’s Free Tier Priced at $33M: Perplexity is spending $33 million per year on its free tier which they are calling R&D, according to this tweet.
- The high cost has sparked discussion about the sustainability and strategic rationale behind offering such extensive free access.
Google Launches Asynchronous Coding Agent Jules: Jules, an asynchronous coding agent developed by Google, has been released as shown on their website.
- Jules promises to change the workflow of coding by assisting engineers with modular tasks and complex software architectures.
Coding Agents Fail React Test: Members are reporting that current coding agents struggle with complex React code, such as producing ‘maximum update depth exceeded’ errors.
- These agents also struggle to refactor existing codebases, indicating that these agents might need help dealing with refactoring tasks.
Anti-AI Echo Chambers Emerge: Discussions highlight the rise of anti-AI sentiment on platforms like Hacker News and Reddit.
- Some members suggest these forums may reflect echo chambers rather than broad societal views and that other platforms are required to allow for more AI friendly discussions.
Gemini Evolved into AI Operating System: At Google I/O 2025, Gemini has evolved into a full AI operating system with a new suite of tools extending beyond chat functionality, according to this thread.
- It seems Google is pushing Gemini to be more than a chatbot, by turning into a one stop shop for AI tooling.

Yannick Kilcher Discord

Google Launches Jules to Challenge Codex: Google released Jules, a competitor to OpenAI’s Codex, but access is currently waitlist-only in Europe (Jules).
- The announcement sparked discussion around Google’s continued investment in AI code generation tools.
LatentSeek Goes Tokenless: LatentSeek employs per token RL on the latent representations right before selecting the next token for decoding.
- Its distinction from COCONUT, which bypasses the tokenization step, was highlighted, with members suggesting LatentSeek might be a benchmark fine-tune over Qwen 2.5.
GNN Novices Ponder torch_geometric Nuances: A member inquired about implementing GNNs using torch_geometric for a node regression task, specifically how to structure the input data for a node regression task.
- Another member shared a code sample class GAT(torch.nn.Module): and offered to assist with the specific implementation question later.
Cracking Open Knowledge Manipulation: Members analyzed Physics of Language Models: Part 3.1, Knowledge Storage and Extraction and Part 3.2, Knowledge Manipulation, focusing on how language models handle factual knowledge and its application to downstream tasks.
- The discussion underscored that while language models excel in knowledge retrieval, they struggle with tasks like classification and comparison, especially without Chain of Thoughts (CoTs).
Google Gabs About Gemma 3N: Google unveiled details for its Gemma 3N model series (Gemma 3N Docs).
- News of Google AI Edge small language models, multimodality, RAG, and function calling was also released (Google AI Edge).

aider (Paul Gauthier) Discord

Qwen Falls Flat on Angular/Typescript Task**: Despite tweaking temperature parameters on the Alibaba Cloud endpoint, a user found Qwen 2 35B unable to generate diffs for angular/typescript code using element.innerHTML.
- Another member had success with Qwen on Rust but failed with Kotlin due to immutable map issues, even with type information readily available, referencing Paul’s blog post about Qwen3.
Aider Shell Execution Requires Human in the Loop**: A user inquired about executing shell commands like listing staged files within Aider, but discovered that the --yes-always option doesn’t bypass the need for manual approval.
- According to issue #3903, this is a deliberate design choice.
Aider Admits Default YAML Configs Work**: When asked about optimized YAML configurations for Aider, users learned that Aider is designed to function optimally with default configurations, with the model choice being the most critical factor.
- A link to sample configs was provided, recommending GLOBAL_DEFAULTS.yml for those wanting to deviate from the standard Aider setup.
Gemma 3n 4B Scores Respectably on Polyglot**: The new Gemma 3n 4B model achieved a 44.4% score on the polyglot benchmark, although the settings used to achieve this are still unknown.
- One user noted that a 4b model has more score than gemini 2.0 pro.

Manus.im Discord Discord

Manus AI Agent Debuts with Dedicated Compute: Manus is an AI agent with its own computer designed to build websites, write reports, and conduct research tasks, accessible via invitation links such as this one.
- User feedback highlights its capabilities and also points to some issues with the credit system/usage and invitation code abuse.
Website Creation confirmed as Manus Capability: Users confirmed that Manus can indeed create websites, with references to Manus use case official collection and Manus use cases from users.
- The agent’s website-building feature stands out as a practical application of its AI capabilities.
Manus Plagued by Network Connection Errors: Users have reported experiencing network connection errors while using Manus.
- The Manus team is actively investigating these issues, requesting account emails and session links to diagnose and resolve the connectivity problems.
Members Inquire about Manus Tech Stack: Community members have shown interest in the tech stack behind Manus.
- The response was that the team is exploring various innovations without specifying exact technologies.

Cohere Discord

Category Theory Interest Catalyzed: Members expressed interest in learning about the intersection of category theory and AI.
- The goal is to explore techniques and methods for prompt engineering tokens related to this combination.
Cohere Labs Launches New Grants Chapter!: A user inquired about the status of the Cohere Research Grants program, as the application link appeared inactive (https://share.hsforms.com/1aF5ZiZDYQqCOd8JSzhUBJQch5vw).
- A Cohere representative clarified that Research Grants are now part of Cohere For AI, known as Cohere Labs.
Cohere Counters Customers’ Control Craving: A user inquired about private deployment options for Cohere models on-prem, seeking full ownership and control due to data/LLM sovereignty interests.
- Cohere confirmed they offer private deployments as a core solution, encouraging contact via [email protected] or [email protected] for details.
Command A Crawls, Confounds Customers!: A user reported slower response times with command A, especially with the structured response parameter.
- A Cohere representative acknowledged the issue, found no known problems, and requested details be sent to [email protected] for investigation.
Vitalops Vamps with datatune: Vitalops released datatune, an open source tool for data transformation via natural language with LLMs, which can be found on GitHub.
- The tool simplifies data transformations, making them more accessible.

MCP (Glama) Discord

Deep Divers Seek MCP’s Best Practices: Members are seeking deep dives into MCP best practices, specifically around tool design and deployment in production, linking to a Windows blog post about securing the Model Context Protocol.
- Discussion arose around browsermcp, allowing an agent to visualize what you are seeing in the browser.
MCP Meets A2A Protocol Agents: An open-source server was released that bridges MCP with A2A protocol Agents, allowing Claude to interact with A2A agents, with code available on GitHub.
- The use case for A2A involves deploying agents to a domain like a tool.
Wallet MCP Enables AI Wallet Integration: The TokenPocket team released Wallet MCP, a lightweight bridge enabling integration between AI clients and encrypted user wallets, supporting multi-chain asset management, transaction signing, and smart contract interactions, with code available on GitHub.
- Joining requires captcha due to the high volume of scammers and spammers.
Public SearXNG MCP Server Addresses Unreliability: A member created a public SearXNG MCP server at GitHub to address the unreliability of public SearXNG servers and the lack of JSON support.
- The implementation randomizes the instance called to avoid DoS attacks, making it suitable for private users who want to sparingly call internet searches.

tinygrad (George Hotz) Discord

AMD’s Enum Decision Irks Developers: A developer expressed frustration over AMD’s decision to change enums in their repo.
- The developer joked that AMD acted like they were gonna run out of numbers or something.
Flash Attention Bounty Limited to 7900XTX initially: The flash attention bounty, is locked and is being tested on both 7900XTX and 9070XT, but currently only supports 7900XTX.
- The developer stated that if RDNA4 added new wmma instructions, they could include them, but they do not have a 9070XT for testing.
BERT Training Bounty Aims for Nvidia/AMD Supremacy: The BERT training bounty seeks flash attention in tinygrad that outperforms normal attention, and it is compatible with any Nvidia/AMD hardware that can run the trainer.
- The bounty poster is using chatgpt to write their AGENTS.md file, but it turned out bad.
tinygrad’s Job Application Process Explained: In response to a job inquiry, a member clarified that the primary route to a job at tinygrad is through bounties, suggesting starting with small PRs.
- They pointed to tinygrad’s pinned post on X for more information.
Control Flow Conundrums in Tinygrad: A member inquired about the existence of control flow operators in tinygrad, similar to those found in jax.lax, such as jax.lax.cond.
- Another member suggested using Tensor.where as a possible alternative to jax.lax.cond, however jax.lax.cond allows you to determine which branch of code to execute, whereas Tensor.where is applied to a tensor specifically, so both paths will still execute.

LlamaIndex Discord

LlamaIndex Hosts Finance Workshop: LlamaIndex is hosting a hands-on workshop in NY on May 29th, with @jerryjliu0 leading sessions on building agent workflows for financial analysis and due diligence, sign up here.
- The event will offer exclusive insights into leveraging LlamaIndex for advanced financial applications.
AWS Backs Multi-Agent Communication Protocol: AWS announced they are joining the MCP steering committee and contributing to its evolution for better inter-agent communication, and collaborating with frameworks like LlamaIndex.
- More details about MCP and AWS involvement can be found here.
Members Request Agent Handoff Examples: A member asked for agent handoff examples, and another member provided a link to the LlamaIndex documentation.
- The documentation details how to implement Agent Workflows.
Llama Parse Encounters Problems: A member reported issues with the Llama Parse service using the Parse with Layout Agent, noting that jobs were taking upwards of 30 minutes and then failing without explanation and is getting stuck trying to load up.
- They also shared a screenshot of the failed job.
Performance Comparison of FAISS: A member inquired about the performance differences between using a VectorStoreIndex and a local FAISS for storage in a RAG model.
- They questioned whether the performance of a RAG model would be degraded by using one over the other.

Torchtune Discord

Recipe Tutorials Enhance Docs: Members concurred that recipe tutorials should enhance documentation, offering end-to-end examples like grpo recipes with improvements on math datasets, with direct links to improvements on specific datasets.
- However, automating recipe tutorials as part of CI was deemed suboptimal due to potential issues with needing to reconsider train and eval datasets for each new model for delta_in_performance > 0.
Llama2 Evaluation Proves Difficult: A member recounted their painful experiences evaluating Llama2 early on, particularly with approach (2).
- They advocated for updating the contributing guide with best practices and including README.md files summarizing evals for each new model to aid sanity checks.
DistCp Conversion Issue Requested: An issue was requested for converting the DistCp format (from async checkpointing) to safetensors.
- A member offered utils to facilitate the conversion, emphasizing its importance to the DCP team.
Async GPRO Tinkerer Notices Dependency Issues: A member experimenting with async_grpo noted that async_rl currently relies on vllm==0.8.4, which in turn depends on torch==2.6.0.
- A plan to update the vllm dependency to vllm==0.9.0 (requiring torch==2.7.0) was discussed, with the caveat that it hasn’t been tested yet, but should work.
Async RL Recipe Marked Experimental: The async RL recipe is currently experimental and pegged to a stable version of vllm.
- Upgrading to the pre-release version of vllm 0.9.0 is not thoroughly tested.

DSPy Discord

DSPy Teases Deeper Meaning Behind Ugly Duckling: A post on X features something ugly, yet the poster expresses a liking for it, teasing, If you get this, you get what DSPy is all about.
- The message hints that understanding DSPy means appreciating unconventional or initially unappealing aspects, pointing to a potentially unique approach in its development and application.
DSPy’s Core Philosophy: Embracing the Unconventional: The recent DSPy X post suggests that appreciating the unconventional is key to grasping the project’s essence, focusing on potentially unique approach.
- This could imply a development philosophy that values aspects often overlooked or considered unappealing in traditional AI development.

LLM Agents (Berkeley MOOC) Discord

AgentX Competition Opens Submissions: The AgentX competition submission forms are now open, featuring an Entrepreneurship Track (form link) and a Research Track (form link); the submission deadline is May 31, 2025, at 11:59 PM PT.
- The Entrepreneurship Track requires a pitch deck, product demo video, and live product link; the Research Track requires a paper, video presentation, and a GitHub repo, with over $150K in prizes awaiting top teams; competition organizers ask for help spreading the word about AgentX on X and LinkedIn.
Students Use OpenAI API Keys for Labs: Students must use their own OpenAI API keys for the lab, but can exclude it for the actual submission.
- The TA <@181105076423753728> can answer if there are alternative approaches that don’t require API calls.
Mastery Tier Downgrade for Certificates: Students can still apply for the Mastery Tier even if they struggle with labs, as they can be downgraded to the Trailblazer Tier if quizzes and articles are completed.
- The downgrade happens at the staff end.

Nomic.ai (GPT4All) Discord

PDF Text Extraction Demands Custom Embedders: Achieving precise text extraction from PDF textbooks requires a specialized embedder model to maintain fidelity.
- Standard models may not suffice for replicating the exact layout and content of complex PDF documents.
GPT4All API Key Installation Snafu Resolved: A user reported issues with the install button in GPT4All not functioning when a lengthy OpenAI API key was pasted.
- The problem was troubleshooted, suggesting a potential issue with how the software handles long API keys during installation.

MLOps @Chipro Discord

DataTune Simplifies Data Transformation: Vitalops has launched DataTune, a new open-source tool designed for data transformation.
- It leverages natural language instructions alongside LLMs, aiming to simplify the data manipulation process.
DataTune Leverages Natural Language and LLMs: DataTune facilitates data transformations using intuitive natural language instructions powered by LLMs.
- This open-source tool from Vitalops aims to streamline and simplify complex data manipulations for users.

The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.

Discord: Detailed by-Channel summaries and links

LMArena ▷ #general (1432 messages🔥🔥🔥):

Special Tokens, Gemma 3, Google I/O, OpenAI versus Google

Thinking tags trigger exploration of special tokens: Members discussed using special tokens and Anthropic’s thinking tags, with one member noting that these tokens are basically the same as antml:thinking.
- They tried to figure out their purpose and usage.
Gemma 3 Released by Google: Google has released Gemma 3, with the Gemma 3N technical report available, although some noted it is initially available only in the US.
Google I/O Event disappoints with Demos over Releases: The Google I/O event, happening now, has been a source of both excitement and disappointment, with members expressing the sentiment that it’s a filler episode and complaining that the IO isn’t releasing anything.
- New features and models like Veo 3, Imagen 4, and Gemini 2.5 Pro Deep Think were announced, but most are limited to trusted testers, US users, or are coming later.
OpenAI versus Google: The community thinks OpenAI is better, with multiple members recommending ChatGPT Pro over Google’s new $250/month Gemini Ultra plan, highlighting the former’s better value proposition.
- There was also discussions about the likelihood of Google going bankrupt by the time Grok 9 is released.
Users Experiment with Local AI Zero-Shot Snake Game: Members discussed the possibility of local AI being able to zero-shot generate No Man’s Sky, and experimented with Qwen 3 4B/8B in LM Studio on their local computers.
- However, the conversation quickly shifts to the challenges of running such models with limited VRAM, with one user noting that they are getting probably just 5 or so [tokens] per second with their setup.

Perplexity AI ▷ #announcements (1 messages):

Perplexity Updates, F1 Standings, Sidebar Shortcuts

Perplexity Ships New Features: Perplexity announced the release of several new features including Perplexify Me, Live Standings and Commentary for F1, F1 Scores in Answers, Improved File Attachments, and Sidebar Shortcuts.
- A full changelog is available at Perplexity’s changelog.
Perplexify Me arrives!: A new feature called Perplexify Me was released.
- No other details were given.
Live F1 Standings now available!: Perplexity launched Live Standings and Commentary for F1 and F1 Scores in Answers.
- This allows users to stay updated on Formula 1 racing directly within the Perplexity platform.
File Attachment Improvements: Perplexity improved File Attachments.
- No other details were given.
Sidebar Shortcuts improve Navigation: Sidebar Shortcuts were added to Perplexity.
- No other details were given.

Perplexity AI ▷ #general (808 messages🔥🔥🔥):

Notebooklm releases on Android, GPTs Agents, OpenAI's sidebars, Perplexity AI Discord chatbot, Grok is sweet

Gemini 2.5 Flash May 20th is a workhorse: The Gemini 2.5 Flash May 20th Checkpoint is Google’s state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks.
- It includes built-in thinking capabilities, enabling it to provide responses with greater accuracy and nuanced context handling.
Grok is new unlimited reasoning: Grok is the preferred choice for users seeking strong, free, unlimited reasoning capabilities in AI models.
- Grok is perceived to have a more casual tone, described by one user as having a chill black dude energy.
Gemini in Chrome is Death to Dia Browser: Google’s integration of Gemini in Chrome is seen as a move that effectively kills Dia Browser, with one user humorously posting a cat explosion GIF in response.
- A member stated that they might be the only one using Dia rn.
Google Deep Search copies Perplexity: Members discussed Google’s new Deep Search feature, with some joking that it looks like they took all of Perplexity’s features and offers the best browsing experience.
- The downside to perfecting something like PPLX did is becoming the baseline, and everyone copies the methodology.
Perplexity AI Discord chatbot channel disabled: It was mentioned that the Perplexity AI Discord chatbot channel has been disabled.
- A member provided a link to a message confirming the channel’s deactivation.

Grok, Data, Github Copilot, India

Grok’s Data Under Microscope: A Perplexity page discusses the data footprint of Grok.
Bias and Implementation Discussed: A Perplexity page covers the bias towards action implementation.
- No additional details were provided.
India’s Diplomacy Highlighted: A Perplexity page showcases India’s diplomatic global outreach.
Github Copilot Goes Open Source: A Perplexity page discusses GitHub Copilot going open source.

Perplexity AI ▷ #pplx-api (14 messages🔥):

Playground vs API Output Quality, Deep Research API Issues, Timeout Issues with Perplexity API, JSON schema via OpenAI Python library

Playground Output is Superior Compared to API?: Members are reporting that Perplexity Playground outputs are more direct, accurate, and less vague than those from the API, even after adjusting top_p and top_k values.
- One user found that adding a sample example message in the conversation history for the assistant improved API results but significantly increased token usage.
Deep Research API Masquerading as Reasoning Pro?: A user reported that Perplexity Deep Research requests via the API are showing up as sonar-reasoning-pro in the dashboard, despite triggering the correct model and receiving Deep Research-like responses; a GitHub issue has been created to address this discrepancy.
- Several other users have confirmed that they are experiencing the same issue.
API Timeouts plague users: Users are experiencing frequent timeout issues when using the Perplexity API, particularly when researching target companies via N8N and Google Sheets; this contrasts with manual browser usage, which works without issues.
- N8N support has confirmed that the timeout problems originate from the Perplexity API side, prompting considerations of batching research into smaller groups as a workaround.
Perplexity API can handle JSON schema: A user asked whether Perplexity supports OpenAI JSON schema via the OpenAI Python library.
- It is unknown from the given messages whether or not the feature works.

Unsloth AI (Daniel Han) ▷ #general (538 messages🔥🔥🔥):

128k context on 32gb, VRAM Calculator, KernelLLM GGUFs, Vision Quants, Training VITS

Running 128k context is difficult on 32GB: A member realized running 128k context is more difficult than expected on a 32GB machine.
- Another member linked to a VRAM calculator that could assist in determining memory requirements for large language models.
Unsloth releases KernelLLM GGUFs: The Unsloth team released KernelLLM GGUFs, providing a link to Hugging Face.
- These releases were announced on the Unsloth subreddit, with a discussion on Qwen3 uploads and fixed chat templates.
Sesame Notebook Receives a Fixed-ish Update: The Unsloth team announced a fixed-ish Sesame notebook update that allows generating audio longer than 10 seconds, accessible via this tweet.
- This update includes setting the max_new_tokens parameter to 125, where 125 tokens is equivalent to 10 seconds of audio.
Google Unveils Gemma 3n for Mobile Devices: Google launched Gemma 3n, designed for efficient execution on low-resource devices with multimodal input capabilities.
- The model uses selective parameter activation to operate at an effective size of 2B and 4B parameters, with a demo app available on GitHub.
Unsloth Featured at Google IO Event: Unsloth was showcased at the Google IO event, as announced via this tweet.
- New community members expressed excitement after discovering Unsloth through the Gemma+Unsloth+Collab demo during the event.

Unsloth AI (Daniel Han) ▷ #off-topic (9 messages🔥):

Mistral Small 3.1, Qwen2.5 VL benchmark, IBM Granite 4.0, Gemini Diffusion, Visual AI Learning App

Mistral Small 3.1 Released!: Members noted Mistral Small 3.1 and Mistral Medium releases, but noted Mistral Medium isn’t open source.
- One member linked to the Ollama library for Qwen2.5 VL and linked to some crazy benchmarks.
IBM Silently Launches Granite 4.0: IBM quietly unveiled Granite 4.0, a tiny preview sneak peek that got users excited.
- One member stated Hell nah that Gemini Diffusion is really cool, but others mentioned that they are not the first in the space.
AI Engineer seeks Marketer for Visual AI Learning App: An AI engineer with several years of experience is seeking a marketer for an app that helps people learn AI in a truly intuitive, visual way.
- The engineer mentioned that countries like China are investing heavily in AI education and that access to intuitive AI learning tools should be global and open to everyone.

Unsloth AI (Daniel Han) ▷ #help (235 messages🔥🔥):

Unsloth Model Merging, PPO Training, GRPO Training, Qwen 3 Models

Unsloth Model Merging Mishaps surface: Users reported that merging a 4-bit model and saving as 16-bit can cause performance issues, but loading the merged model in 4-bit mode further decreases the performance of the merged model.
- A user suggested to “not load 4bit or 8bit when you are trying to merge” and merging can be done in normal RAM to avoid issues with precision. They shared a colab notebook outlining this process.
PPO Training Proves Perilous: A user expressed interest in using PPO training but couldn’t find a relevant notebook, leading to a discussion about the challenges of PPO with LLMs, particularly regarding VRAM usage and reward function limitations.
- It was mentioned that the current trainer setup only supports AutoSequenceForClassification reward models, hindering custom reward functions, however further upgrades are in the works.
GRPO Glitches Grab Attention: A user reported an unnatural spike in KL divergence during GRPO training with a Qwen 3 model, prompting suggestions that a problematic data sample may have been to blame.
- Another user successfully trained a few models with GRPO, but struggles with PPO, and wanted to compare GRPO vs PPO.
Qwen 3 Questions Get Quick Clarification: A user inquired about using Qwen 3 for classification tasks, asking if the notebook could be used for other LLMs.
- It was confirmed that almost any model can be used. The user then encountered an error, which was resolved by removing lm_head from get_peft_model.
Intel GPU Integration In Progress: A user inquired about Unsloth support for Intel GPUs, specifically the Intel Arc Pro B60, and the community pointed to an ongoing pull request that indicates progress in this area PR#2388.
- Other users cautioned that the integration is still a work in progress, and it may be premature to expect full functionality until the related issues are resolved.

Unsloth AI (Daniel Han) ▷ #research (33 messages🔥):

Entropix Pruning, VLM Gemma3 evaluation metrics, OpenEvolve released

Entropix awakens from slumber with Multi-GPU plans: After a period of silence, the author of Entropix announced plans for multi-GPU deployments and testing with larger models, splitting the project into two repos, and shouted out to folks for the compute.
- The author also mentioned they had raised money and were working on a startup, as well as pointing to early results on GSM8k using entropy decoding in optillm.
Attention Pruning may still be worth it, depending on the model?: Members discussed pruning attention layers with negligible performance impact for speedup, but one cautioned against applying techniques from the Llama 2 era to newer models like Llama 3-8B, citing robustness to pruning.
- Another member chimed in and said this is why I get confused when people use llama 3.2 3B, it’s pruned from one of the least prunable models around and thus is ass for its size.
Evaluating PEFT Fine-tuned VLMs requires CERtain benchmarks: A Master’s student asked about standard metrics for evaluating a PEFT fine-tuned VLM (Gemma3), and suggested using CER and WER metrics on OCR tests, along with perplexity.
- Other members recommended lmms-eval and lm eval harness, a collection of 90+ tasks, and emphasized the need for a set of benchmark tests to assess VLM behavior.
OpenEvolve leaps from the primordial code soup!: A member announced the release of OpenEvolve, an open-source implementation of Google DeepMind’s AlphaEvolve system, which evolves entire codebases using LLMs to optimize algorithms, as detailed in this blog post.
- The creator successfully replicated DeepMind’s results on circle packing (99.97% match!) and evolved a random search into a simulated annealing algorithm, using an LLM ensemble approach for improved results and multi-objective optimization.

LM Studio ▷ #general (158 messages🔥🔥):

LM Studio API, RoPE Frequency Scale, Qwen 3 Speculative Decoding, Model Unloading via API, Sliding Window Attention

LM Studio API supports API access, but lacks web frontend: A user inquired about using LM Studio to host LLMs locally for web integration, similar to Stable Diffusion, but found that LM Studio only provides an API server, not a web frontend.
- They suggested using a separate frontend to connect to the API to achieve the desired functionality.
Configuring RoPE Frequency Scale: A user asked about setting the RoPE Frequency Scale when loading a model in LM Studio, and they can find the setting cog next to the model loader.
- Users can modify the gguf format to include it, as it’s supported in ollama and llama.cpp.
Qwen 3 Gets a Speed Boost with Speculative Decoding: Users reported a 12% performance increase with Qwen 3 using speculative decoding, achieved by adjusting the max draft size to 3 and increasing the drafting probability to 0.8, they reported their results after “a lot of fiddling”.
- They cautioned that increasing the min draft size only increases electricity usage and generation times.
Model Unloading Unveiled in LMStudio API: Users discussed how to unload models using the LM Studio API, referencing the official documentation, and find that model.unload() is the correct syntax.
- The discussion clarified that the function is part of the API and doesn’t require creating a separate function.
Sliding Window Attention Arrives in Llama.cpp: Sliding Window Attention (SWA) is now available in Llama.cpp (github.com/ggml-org/llama.cpp/pull/13194), which should later make it’s way into LM Studio, reducing memory usage particularly for long contexts.
- Initial testing shows that enabling SWA reduced KV cache usage from 1700MB to 348MB for a Gemma 4B 15K context model, and further down to 184MB with Q8_0 quantization.

LM Studio ▷ #hardware-discussion (449 messages🔥🔥🔥):

Intel Arc IPEX support in LM Studio, AMD GPU drivers issues, AVX2 support in LM Studio, Dual GPU setup, PCIE5 vs SATA SSD speeds

Intel Arc IPEX Support Questioned for LM Studio: Members wondered if LM Studio will gain IPEX support if Intel Arc becomes popular, citing past driver issues with both mobile and desktop GPUs.
- A member cited issues with mobile drivers on Zen4 systems with >64GB RAM causing black screens and problems with desktop GPUs requiring sideloading of drivers.
AVX2 Requirement Troubles LM Studio on Older Hardware: A user reported issues running LM Studio on an unsupported Core i7 3770K CPU, which lacks AVX2 support, despite having a supported Nvidia GTX 960 GPU.
- A member suggested compiling llama.cpp without AVX2 requirement or trying Ollama, which recently removed the AVX2 requirement, and shared a repo for running llama.cpp on old hardware.
Debate Surfaces on Dual GPU Setup Viability via Thunderbolt: A member inquired about running a dual GPU setup with one GPU in an external Thunderbolt USB4 enclosure to avoid slowing down the primary PCI-e slot.
- Another member confirmed it’s possible but cautioned that bandwidth might be limited to PCIe x4 (or even x1 in poor-quality cases), which may impact performance for tasks beyond LLM inference, also pointing out that Thunderbolt/USB4 requires 4 dedicated PCI-E lanes for the port.
Enthusiasts Compare PCIE5 vs SATA SSD speeds: A discussion ensued about the practical benefits of PCIE5 SSDs over SATA, with a member humorously noting that they didn’t realize SATA had the equivalent of 125 lanes of PCIE 5.0.
- The consensus leaned towards limited real-world advantages for average users beyond SATA to gen3 NVMe, with diminishing returns for faster storage, except for specific creative workloads or datacenters.
Enthusiasts Consider A Consumer PC with Expander for LLMs: A member inquired about performance expectations for running DeepSeek R1 on a consumer version RAM expander with 512 GB of DDR5 on a PCIe slot, questioning if 10 tps would be achievable.
- It was noted that the memory bandwidth through PCIe might not match direct DIMM slots, potentially limiting performance, drawing comparison to the bandwidth of dual-channel DDR4.

OpenRouter (Alex Atallah) ▷ #announcements (3 messages):

Provider slugs, Quantization slugs, Gemini Flash 2.5 release, Llama provider by Meta

OpenRouter adds slugs for providers and quantizations: OpenRouter announced that providers and quantizations now have slugs, enhancing developer experience, see their tweet.
Google releases Gemini Flash 2.5, launches on OpenRouter: Google DeepMind launched Gemini Flash 2.5, and it is already available on OpenRouter; test it out here: google/gemini-2.5-flash-preview-05-20.
Meta provides Llama, available only open-access on OpenRouter: Meta’s new Llama provider is now live, exclusively open-access on OpenRouter, and free to start; includes a new 3.3 8B model see their tweet.

OpenRouter (Alex Atallah) ▷ #general (235 messages🔥🔥):

Gemini 2.5 Pro DeepThink, Veo 3, Imagen 4, Gemma 3n, audio support

New Gemini, Imagen, and Veo Models Announced: Google accidentally rolled out pricing early for Veo 3 video generation model with audio, as well as start/end frames, extending existing videos and camera control for Veo 2, along with Imagen 4 and Gemini 2.5 Pro Deepthink.
- One member shared, There goes another startup trying to solve this problem, which was countered with the opinion that Realistically all the model wrappers were just selling shovels during the gold rush.
Free Gemini 2.5 Flash Preview Available: gemini-2.5-flash-preview-05-20 has been deployed.
- Members discussed access to the free API, with one member confirming it still exists if you had previously spent $10, but another member stated It would be better to just pay as you go.
Google’s Gemma 3n Matches Claude 3.7?: A new Gemma model, Gemma-3n-4B, is supposedly as good as Claude 3.7 according to this blog post.
- One member found this suspicious, while another stated Idk thats Chatbot arena, so just user preference, sounds possible to me.
New LLMs Lack Diffusion Architectures: One member questioned is there a reason why most LLMs aren’t diffusion by now?
- The response suggested that Diffusion requires an extensive architecture rework.
OpenRouter API Key Integrates AI Agents in TS: A member wants to use the OpenRouter API key and base URL to create an agent with the OpenAI API on TS.
- They hope to use the code editing tool with an AI agent in a Nest.js project.

Eleuther ▷ #general (183 messages🔥🔥):

Discord bot for message deletion, Definition of 'slop' in AI, Gemini Diffusion, Mentorship in AI/ML, ARC-AGI performance improvements inspired by compression

EleutherAI considers Discord Bot to Combat Spam: Members discussed implementing a Discord bot that automatically deletes comments containing excessive emojis, but considered rate limits and mod pings instead of direct deletion to prevent abuse, and proposed a mod-only react to flag content.
- Members suggested using a more general react, counting reacts from regulars only, and using a private mod-only channel to ease victor detection, and avoid the irreversible deletes.
Defining AI ‘Slop’ Causes Philosophical Debate: Members debated the meaning of ‘slop’ in AI-generated content, with definitions ranging from low effort, poorly comprehended material to content that fails the Turing test, referencing AI slop Wikipedia page.
- The discussion touched on whether slop is inherently tied to AI, with some arguing that human-generated content can also be slop, citing overproduced or non-sensical outputs as examples, and fragility failures.
AI Mentorship Plea Leads to Open Source Research: A member requested mentorship and guidance in AI/ML, citing interest in agentic AI but lacking an OpenAI API balance, and was directed to online labs and open source research opportunities.
- The mentor suggested focusing on out of distribution detection.
Novelty Search Drives ARC-AGI performance: Members discussed the connection between novelty and slop, suggesting that slop lacks novelty and is strongly tied to a generative model’s inability to extrapolate beyond its training data distribution.
- They then touched on a certain methods from ARC-AGI which could be used in normal text generation that leverages compression techniques.
Members ask about Gemini Diffusion: A member opened a discussion about Google’s Gemini Diffusion model.
- He offered to take prompts for anyone without access.

Eleuther ▷ #research (50 messages🔥):

Yi Ma's talk on Intelligence, Autoencoders and Compression, SSL Methods like DINOv2, Paper Code Releases, OpenEvolve release

Yi Ma’s First Principles of Intelligence: Members watched Yi Ma’s ICLR talk and discussed his first-principles approach, which grounds work in information theory and cybernetics, contrasting it with the heuristics prevalent in current models.
- One member found the talk slow initially but appreciated the point that compression alone isn’t enough; organization is key.
Autoencoders Don’t Capture All: It was argued that the latent space of autoencoders might not capture all information from the original input because multiple compressed representations can suffice for reconstruction.
- As one member put it, Not all autoencoders that do perfect reconstruction contain the same information.
Minimal vs. Complete Codebases: The Great Paper Release Debate: Members debated the merits of releasing minimal vs. complete codebases, with one advocating for a minRF-style approach: a minimal version alongside a more extensive research codebase.
- The consensus seemed to be that the ideal codebase would allow users to build on it, integrate it with existing infrastructure, and easily swap out components.
The OpenEvolve Emerges!: A member released OpenEvolve, an open-source implementation of Google DeepMind’s AlphaEvolve, which evolves entire codebases using LLMs to discover and optimize algorithms and replicated DeepMind’s results on circle packing.
- Check it out: GitHub and Blog post.

Eleuther ▷ #lm-thunderdome (3 messages):

VLM Evaluation, Text-Only Evaluations, Codebase Conditionals

VLM Text-Only Evals Spark Debate: A member inquired about evaluating VLMs on text-only evaluations such as coding and math using the hf-multimodal or vllm-vlm models.
- They questioned whether the provided PR reasonably implements a fallback to hf/vllm behavior when requests lack images.
Opaque Conditionals Prompt Refactoring: Another member critiqued a conditional statement if requests and len(requests[0].args) < 3 for being opaque and prone to future breakage.
- They suggested directly checking the condition of interest instead of relying on such indirect checks, and the author offered to move it into a utility function.
Proposed Condition Simplification Floated: The original poster suggested a refined conditional if not (requests and len(requests[0].args) >= 3 and "visual" in requests[0].args[2]): to improve clarity.
- The aim is to trigger the text-only code path when the request lacks both sufficient arguments and the presence of a visual attribute.

Cursor Community ▷ #general (207 messages🔥🔥):

25 tools limit kills the chat, DeepSeek-R1T-Chimera model breaks loop, MCPs refresh frequently, Gemini's thinking process changed, O3 Pro coming soon

Cursor has a tool limit bug: Hitting the 25 tools limit and accepting before continuing kills the chat, according to a Reddit post.
DeepSeek-R1T-Chimera Breaks Free of Endless Loops: The DeepSeek-R1T-Chimera model, fine-tuned between R1 and V3, is reportedly the only free model that managed to break the loop in a user’s prompt testing method and is available on Hugging Face.
MCP Refreshes cause frustration: Users are experiencing frequent need to refresh their MCPs because their status goes yellow, consuming most of their requests.
Gemini thinking differently: Users are reporting a change in Gemini’s thinking process, with Gemini now summarizing what it is thinking about instead of showing the actual thinking process, and it’s making it harder to write good prompts and project rules.
O3 Pro might be just around the corner: There’s anticipation for the release of O3 Pro, with some speculation based on a tweet from someone who worked at OpenAI hinting that it was coming soon, with the possibility of it being released in June.

Modular (Mojo 🔥) ▷ #general (24 messages🔥):

Running models without CUDA, MAX and HF models, Robotics models and data streams, Cosmos 'world model', Porting from PyTorch to MAX

MAXing Out Models Sans-CUDA: Users inquired whether models like GR00T-N1-2B or pi0 could run without the CUDA stack using MAX and Mojo.
MAX Graphs Needing Re-Implementation: While MAX interoperates with Hugging Face for tokenizers, preprocessing, weights, and hyperparameter configuration, the core model architecture needs to be built in a MAX graph.
Robotics Models’ Data Stream Demands: Wiring up robotics models requires specific data streams that are difficult to provide in a generic way, demanding piles of sensor inputs.
- It was clarified that the focus was solely on the NN model, not the entire data pipeline.
Cosmos: A World (Model) of GPU Consumption: Cosmos is described as a world model that combines a physics engine and a video model, but it demands significant VRAM and is thus taxing on the GPU.
Modular Mapping PyTorch to MAX: Modular is documenting the porting process from a PyTorch core architecture to a MAX graph, with tutorials and reference documentation planned for docs.modular.com.

Modular (Mojo 🔥) ▷ #mojo (174 messages🔥🔥):

False positive warnings in 25.3, Unused variable warnings in Mojo, fn() -> raises syntax, IO API design with parametric traits, DMA-based APIs

False Positive Warnings Appear in Mojo 25.3: A member reported an increased number of false positive warnings in version 25.3 related to unused assignments, specifically warning: assignment to 'suggested_max_length' was never used; assign to '_' instead, as described in this bug report.
Mojo gives Unused variable warnings: The Mojo compiler incorrectly flags variables as unused when only their alias values are referenced, this is an issue with def function declarations, confirmed by switching to fn() -> raises which resolved the false positive.
- For example, var r2 = r1.simplify() triggers the warning even when r2.N and r2.D are used later, the fix is to change the def statement to fn() -> raises.
fn() -> raises syntax is accepted by the Mojo parser: The parser accepts the syntax alias func = fn(), which is syntactic sugar for fn() -> None, some members find this is a bad idea, it’s kept for consistency with other return statements but goes against other languages and their type signature requirements, as seen in this code snippet.
Parametric Traits Spark Debate for IO API Design: Members discussed designing an IO API using parametric traits to handle buffer ownership and genericity over owned or ref types, io_uring requires passing ownership of buffers.
- Concerns were raised about designing around features not yet available and the potential for creating overly complex or restrictive APIs, with some advocating for a pragmatic approach that addresses current capabilities while acknowledging future improvements.
Modular transitions GPU Puzzles from Magic to Pixi: The Modular team deprecated Magic in favor of Pixi for the Mojo GPU puzzles, magic is just a wrapper around pixi, as seen in this commit.

Modular (Mojo 🔥) ▷ #max (7 messages):

Max vs Fireworks.ai/Together.ai/Groq.com, vLLM comparison, Optimize Max for lower latency and higher throughput, Max imports source code visibility, Enterprise solution with large scale disaggregated inference

Max Claims Full-Stack Control: A user inquired about how Max compares to Fireworks.ai, Together.ai, or Groq.com in terms of AI inference performance and uptime, especially given their claims of superior speed and platform maturity.
- The response emphasized that Max provides full control over the stack, enabling custom work with performance, latency, accuracy control, and the ability to program at every layer, while also rivaling Dynamo across compute.
Debunking Perf-Marketing Claims on vLLM: It was noted that comparing vLLM to endpoint providers isn’t an apples-to-apples comparison, and users should be wary of perf-marketing claims, particularly those comparing against older versions of vLLM.
- The message cited examples from Together.ai and Fireworks, with links (Together.ai, Fireworks.ai) where comparisons are made against older versions of vLLM, and different configurations.
Optimization Focuses on Latency and Throughput: A user indicated that their primary goals are lower latency and higher throughput, aiming to minimize costs, asking what the strongest selling points were for Max/Mojo.
- The modular team invited them to discuss their enterprise solution, highlighting a large-scale disaggregated inference solution and suggested contacting them at [email protected].
Max Imports Code Remains Partially Hidden: A user inquired about the source code for Max imports, visible in example code, requesting to see where to find them (referencing an attached image).
- It was clarified that the source code isn’t fully open-sourced but can be inspected in the user’s environment, with some parts calling into C++ code.

Notebook LM ▷ #announcements (3 messages):

NotebookLM mobile app release, Audio Overviews customization, Google I/O Keynote summary, Video Overviews feature preview

NotebookLM’s Mobile App Drops!: The NotebookLM mobile app is now officially live with an MVP feature set, prompting users to provide feedback and feature requests; learn more on the Google Blog.
Audio Overviews Now Adjustable!: Users can now control the length of Audio Overviews (English only) with short (~5+ min), long (~20+ min), and default (~10+ min) settings to customize the depth and length of AI-hosted discussions.
Google I/O 2025 Keynote Recapped: A notebook summarizing everything from this year’s #GoogleIO keynote has been made available for those who missed it; check out the summary here.
Video Overviews Teased!: A preview of the new Video Overviews feature was posted on X.

Notebook LM ▷ #use-cases (23 messages🔥):

Pronunciation issues in podcasts, Exporting timelines to Google Calendar, Integrating NBLM into Discord, AI Protocols to prevent source alteration, NotebookLM mobile app

Podcast Pronunciation Problems Plague Patrons: A user jokingly reported that despite providing clear instructions on the pronunciation of “RAGBRAI” for a generated podcast (notebook link), the AI managed to pronounce it in four different ways.
- Other members suggested trying official pronunciation notation (IPA) and adding the pronunciation instructions to the prompt.
Preventing Protocol Problems: AI Anti-Hallucination Heroics: A user created custom system instructions (dubbed “LSTA Agent” and “Ultimate AI Anti-Hallucination Integrity protocol”) to prevent AI from altering source material.
- This was in response to another user modifying a document by replacing names with Star Wars, Detective Colombo, and Homer Simpson characters (notebook link.
NotebookLM mobile app arrives: Google launched the official NotebookLM mobile app (announcement link) for both iOS and Android, featuring offline Audio Overviews, interactivity, and seamless content sharing.
- The app is designed to make NotebookLM more accessible and useful for a variety of users, including students, professionals, and anyone who works extensively with information.
Language Limitations loom: Users asked if resources could be uploaded in languages other than English. and the answer was that only English and French are officially supported.

Notebook LM ▷ #general (153 messages🔥🔥):

NotebookLM Android app feedback, Podcast generation, File size limits, Output language options, Sharing notebooks

NotebookLM Launches Android App: NotebookLM launched an Android app, but initial feedback suggests it lacks features compared to the web version, such as Discover Sources and notes, and users are wondering about compatibility for foldable devices.
- Some users prefer the mobile website experience which, on Android, can be saved as an “app” for easy access.
Podcast Feature Generates Buzz: Users are exploring NotebookLM’s podcast generation capabilities for creating content on new topics, but are facing issues with audio length limits, where it seems like there is currently 6-8 minutes limit.
User Asks About File Size Limits for Uploads: Users are requesting an increase in the file size limit for uploads from 200MB to 400/500MB.
Language Options Limited: Users have noted that changing the output language is only possible on the web version and are looking forward to this feature being added to the app.
- Some users pointed out that Chat responses and Audio Overviews will be generated using your device language, unless you have set the language override on the NotebookLM website.
Subscription Tiers Pricing Discussed: Users are discussing the pricing structure for NotebookLM subscriptions, with concerns raised over the cost of $250/month for the Plus tier after an introductory offer.
- There is confusion about the offer, with some interpreting it as $125/month for the first 3 months, followed by the full price.

GPU MODE ▷ #general (12 messages🔥):

cutotune autotuner, FSDP1 vs FSDP2, Liger-Kernel, multihead GRU layers in cute-kernels

Cutotune Autotuner Tunes CUDA Kernels: A member introduced cutotune, a simple autotuner for tuning any python code calling kernels on an NVIDIA GPU, and claims that it should be easily extensible to other accelerators, working with PyTorch compile.
- The autotuner allows overriding the autotuning parameters without commenting out the autotuner, and it allows tensor property-based and functional triggers based on stride/dtype, a superset of the key in the Triton autotuner.
FSDP2 uses less VRAM but slower: A member tested FSDP1 and FSDP2, observing that FSDP2 uses much less VRAM but is slower, with results from RTX 4000 Ada x4 showing different performance for full and lora setups.
- The member noted the following performance using FSDP1 and FSDP2 with batch size of 4: #full fsdp1:15.801G, 7.3s/step fsdp2:11.599G, 13.39s/step #lora fsdp1:13.783G, 8.58s/step fsdp2:10.758G, 10.45s/step.
Liger-Kernel Performance: A member tested the application of Liger-Kernel to both FSDP1 and FSDP2, where FSDP2 consistently resulted in slower performance.
- The results were: #full liger fsdp1:5.426G, 9.03s fsdp2:3.92G, 15.41s #lora liger fsdp1:3.351G, 9.2s fsdp2:2.639G, 10.74s.
Multihead GRU Layers Added to Cute-Kernels: A member announced the addition of multihead GRU layers written in Triton to cute-kernels, enabling parallelization across SMs, and linked to the relevant directory in the cute-kernels repository.
- The addition of multihead GRU layers allows for parallelization across SMs, enhancing the efficiency of kernel execution.

GPU MODE ▷ #triton (2 messages):

Triton CPU Support, TRITON_INTERPRET API, CPU Parallelism Limitations

Triton Lacks Direct CPU Parallelism: A member clarified that Triton doesn’t directly support CPUs for parallelism, limiting its multicore capabilities.
- The user suggested that at least it is not efficient for parallelism with multi cores.
TRITON_INTERPRET API Provides Sequential CPU Simulation: The discussion highlighted using the TRITON_INTERPRET=1 API as an alternative for CPU execution, which mimics the parallel scheme sequentially.
- It was implied that this approach, while sequential, imitates the parallel scheme almost perfectly for certain purposes.

GPU MODE ▷ #cuda (6 messages):

CUDA Usage, CGO Impact, GPU Utilization

CUDA Usage Plummets to Zero: A member reported that CUDA usage showed around 0% when running code, but switching to CUDA increased it to 100%.
- They sought help understanding the initial problem, showing 0% utilization, even with CUDA enabled.
CGO Suspected of Causing Issues: A member mentioned their program appears as a C program instead of C+G and is run through CGO, suspecting it may be the root cause.
- They reported that the program shows up as C (not C+G), suggesting CGO might be interfering with proper GPU detection.
GPU Utilization Displays Anomalous Results: A member noted that the system displays N/A for GPU utilization across all processes, including explorer.exe.
- The member also stated their program displays 0% utilization for everything.

GPU MODE ▷ #torch (1 messages):

CUDA graph model capture, Distributed operations in models

CUDA Graph Captures Model’s Distributed Ops: A member reported that when a model is captured by CUDA graph, all distributed operations within the model are placed in their own stream.
Stream Placement Inquiry: They inquired whether this behavior is expected, hinting at potential implications for CUDA graph model execution.

GPU MODE ▷ #cool-links (3 messages):

MAXSUN Arc Pro B60 Dual, SageAttention, Gemini Diffusion

MAXSUN Arc Pro B60 Dual Review: A YouTube review of the MAXSUN Arc Pro B60 Dual was shared.
SageAttention’s Clean Codebase: The clean codebase for SageAttention was highlighted, with a link to the GitHub repository.
Google Swallows Diffusion Pill with Gemini: Google’s Gemini Diffusion model was mentioned, indicating Google’s entry into the diffusion model space, with link to DeepMind’s models page.

GPU MODE ▷ #torchao (4 messages):

Axolotl QAT/PTQ Workflow, Llama3.2 Quantization, OpenAssistant/oasst1 Dataset Evaluation

Axolotl QAT/PTQ Workflow Examined: Members scrutinized an Axolotl QAT/PTQ workflow, comparing their respective configurations for fine-tuning and quantization.
- The key difference between the two flows is the absence of prepare_model_for_qat.
Llama3.2 Quantization Commands Revealed: The commands for quantization, starting from the Llama3.2 config are axolotl train config.yaml, and then axolotl quantize config.yaml.
- The same QAT config is passed to the quantize CLI to ensure an identical PTQ schema is applied.
OpenAssistant/oasst1 Dataset Shows Good Results: A member reported achieving good results with the OpenAssistant/oasst1 dataset using default settings in the same configuration.
- The discussion suggests this dataset is a viable option for evaluating model performance.

GPU MODE ▷ #off-topic (6 messages):

Microsoft Build Conference, Network Connection Issues, LB broken

Build Conference Attracts Insane Aura: A member mentioned attending the Microsoft Build conference in Seattle and noted the speaker had an insane aura.
- It was unclear which speaker they were referencing, but they seemed impressed by the presentation.
Spotty Connection During Keynote: A user joked about having a 2kbps connection during a keynote presentation.
- This implies they were experiencing technical difficulties while trying to follow the event remotely.
LB broken: A user mentioned looking at all the broken stuff on their LB.
- It is unclear what LB refers to, but likely Load Balancer.

GPU MODE ▷ #irl-meetup (1 messages):

CUDA Developer Meet Up, NVIDIA, UCL, London, Python-native GPU programming

NVIDIA & UCL throw CUDA Developer Meet Up in London: NVIDIA and UCL are hosting a CUDA Developer Meet Up on June 5 in London, inviting developers of all levels to connect with CUDA experts.
- The event promises hands-on talks on Python-native GPU programming, discovery of new libraries like nvmath-python and cuda-core, and networking opportunities over pizza with chances to win NVIDIA swag (RSVP here).
Connect With CUDA Experts at London Meetup: Attend the CUDA Developer Meetup in London to connect with CUDA experts from NVIDIA and UCL.
- The event will also explore Python-native GPU programming through hands-on talks and introduce new libraries like nvmath-python and cuda-core.

GPU MODE ▷ #self-promotion (1 messages):

OpenEvolve release, Evolutionary coding agents, LLMs for algorithm optimization

OpenEvolve Open-Sources Algorithm Evolution: A member released OpenEvolve, an open-source implementation of Google DeepMind’s AlphaEvolve system.
- It’s an evolutionary coding agent that uses LLMs to discover and optimize algorithms; the author successfully replicated DeepMind’s results on circle packing (99.97% match!).
Evolve Codebases with OpenAI-compatible APIs: OpenEvolve evolves entire codebases (not just single functions) and works with any OpenAI-compatible API.
- It uses a LLM ensemble approach for better results and multi-objective optimization; check it out on GitHub and in the blog post.

GPU MODE ▷ #🍿 (2 messages):

Reasoning Models, Pass @K

Reasoning Model Heuristics: A member suggested that pass @20 is the poor man reasoning model.
- Another member agreed, stating the original idea was that pass @20 is the equivalent of a reasoning model.
Pass @K Discussion: The discussion revolved around using a higher pass value (@20) as a heuristic for improving model reasoning.
- It was implied that this approach could serve as a simplified substitute for more complex reasoning models.

GPU MODE ▷ #thunderkittens (1 messages):

simran9493: https://www.youtube.com/watch?v=xcpEl0cGCC4

GPU MODE ▷ #reasoning-gym (1 messages):

rasdani: awesome! looking forward to the paper 🙂

GPU MODE ▷ #submissions (47 messages🔥):

MI300 Leaderboard Updates, AMD-FP8-MM performance, Histogram Leaderboard, MLA Decode Results, Mixture of Experts Leaderboard

MI300’s AMD-FP8-MM Race Heats Up!: A flurry of submissions hit the amd-fp8-mm leaderboard, with one member snagging 1st place on MI300 at 121 µs and another at 132 µs for 2nd place.
- Several other submissions also landed successfully on MI300 at various speeds, showcasing a competitive push for optimization in the amd-fp8-mm category.
Histogram Hits the Heights: One member achieved 1st place on the histogram leaderboard across multiple platforms, clocking in at 36.3 µs on A100, 23.5 µs on H100, and 68.5 µs on L4.
MLA Decode Domination: A user secured 1st place on the amd-mla-decode leaderboard with a blazing 7351 ms on MI300, while others achieved 2nd and 3rd place at 7574 ms and 8875 ms, respectively.
- These results highlight the ongoing advancements and competition within the amd-mla-decode space.
AMD Mixture of Experts Excellence Emerges: Submissions to the amd-mixture-of-experts leaderboard showed promise, with one user taking 1st place on MI300 in 9.45 ms and a 3rd place at 9.70 ms.
- Further reports of successful submissions at 128 ms and 25.4 ms show a dynamic playing field.
Copy-Paste Sanity Check Implemented: After a member tested an issue, the team confirmed a “sanity check” is now in place to prevent accidental code exposure.
- They thanked the user for the notification and stated they “will fix” any loopholes, ensuring code security.

GPU MODE ▷ #status (6 messages):

Leaderboard Explanations, Histogram Submission Error

Call for Leaderboard Solution Explainers: Organizers requested that top leaderboard participants write short explainers of their solutions for educational purposes, intending to share them on the GPU Mode site and at a workshop with AMD.
- The goal is to aid others in learning to write fast GPU code, emphasizing that this competition is an excellent way to facilitate such learning.
Histogram Submission on L4 GPU Hits Snag: A member reported receiving an “unexpected error” when submitting a solution to the histogram leaderboard on the L4 GPU.
- Admins requested more details, asking whether the user checked for common issues (like presence of \ in their code), and whether they used Discord or the CLI to submit.

GPU MODE ▷ #factorio-learning-env (14 messages🔥):

FLE Use-Cases and Evaluation, Colab Server for Agent Prototyping, Factorio TAS Generator, Gym Interface for FLE, Meeting Time Coordination

FLE Use-Cases and Evaluation Areas Expand: The initial goal is to understand the interesting use-cases and evaluation areas that an (unbounded) environment unlocks.
- The original lab and open-play were the obvious starting points but members expressed they are just scratching the surface over here.
Colab Server Enables Agent Prototyping: A member will create a Colab server this week to make agent prototyping easier, citing tbench.ai as inspiration for easy contribution/testing of agents.
- They also suggested checking OpenAI Gymnasium (gymnasium.farama.org) for ideas, referencing Script World (github.com/Exploration-Lab/ScriptWorld) which uses OpenAI gym.
Factorio TAS Generator Draws Interest: A member is seeking a full-time project and this could be it, referencing Factorio TAS Generator (FTG) (github.com/MortenTobiasNielsen/Factorio-TAS-Generator/issues/76).
- The member created FTG to provide an interface for Tool Assisted Speedruns and used it for two world records, and is now used in the tas subchannel for speedrunners on the Factorio Discord server.
Gym Interface is in the Works for FLE: FLE originally supported Gym, but moved away from an Markov-Decision Process approach towards LLMs.
- While FLE moved away from Gym, a member created an issue (github.com/JackHopkins/factorio-learning-environment/issues/197) to consider exposing it, despite it not needing to use Gym directly.
Meeting Time Coordination Averted: A member expressed initial scheduling conflicts with a meeting at 16:00 UTC.
- They realized they had misinterpreted UTC vs Copenhagen time, and their meeting is actually from 13:00-16:00 UTC, so they might be able to join while on their way home.

GPU MODE ▷ #amd-competition (32 messages🔥):

MLA decode kernel, File Size Submission Limit, FP8-GEMM issues, MoE Submission down

MLA Decode Kernel Released: The MLA decode kernel is now available, with one user humorously claiming first place and challenging others to surpass their solution, as announced in this image.
File Size Submission Limit Troubles: Users encountered issues submitting code, with solutions exceeding 50KB, but managed to submit by stripping out comments and unused code. This image shows a screenshot of the error.
- It was determined the file size limit was around 35KB, and a developer mentioned it was due to a GitHub Action payload limit and promised to push a fix to raise the limit tomorrow.
FP8-GEMM Submission Errors Plague Users: Several users reported issues with FP8-GEMM submissions, receiving errors even on previously working submissions, but benchmarking still reported OK. This screenshot details the error received.
- A developer offered to investigate, requesting the failing files to reproduce the issue and find the root cause.
MoE Submission Totally Down?: Some users reported that MoE submissions were completely non-functional, and a developer hadn’t yet tried MoE submissions to confirm the issues, but stated they were out of time to investigate at that moment.

GPU MODE ▷ #cutlass (7 messages):

Cutlass DSL Python Windows support, CUTLASS thread tiling error, CUTLASS GTC slide outdated

CUTLASS Thread Tiling Troubles: A user encountered an error with thread tiling in CUTLASS, specifically an AssertionError arising from cute.local_partition within the transpose_naive_kernel function as described in this GitHub Gist.
- This issue is recognized as a bug and is currently being addressed by the CUTLASS developers (issue #2314).
Outdated CUTLASS GTC Slide Alert: A user noted that a CUTLASS GTC slide is outdated, directing users to reference updated examples in the final release.
CUTLASS Yearns for Windows: A user inquired about planned support for CUTLASS DSL Python on Windows.
- A developer confirmed that Windows support is needed but did not provide an estimated time of arrival (ETA).

GPU MODE ▷ #singularity-systems (2 messages):

Picograd, Rust implementation, Pedagogical Resource

Picograd: From Zero to Hero: The singularity systems: zero -> hero course follows up from karpathy’s neural networks: zero -> hero turning micrograd into picograd.
- The project’s short term goals are to implement picograd that can train and inference basic networks (ffn, rnn, lstm, gpt) and record “from scratch line by line spelled out” video lectures.
Rust Rules: Python Implementation cut for Picograd: The project’s goal is purely on the single rust implementation for now.
Pedagogical Picograd: The Learning Resource: If done right, picograd will become the pedagogical resource for pytorch, the same way clang has chibicc, LLVM has qbe, linux has xv6 etc.

Nous Research AI ▷ #general (95 messages🔥🔥):

Google's Code Agent, Google I/O Announcements, Gemma 3n Model, Gemini Diffusion Model, Decentralized AI

Google’s Code Agent sparks Competition Frenzy: Google is launching a new code agent (real.azure), sparking excitement and comparisons within the AI community about its capabilities and potential.
- Some members have been waitlisted, eagerly anticipating access, while others speculate on a formal announcement at Google I/O.
Gemma 3n debuts new Architecture: Gemma 3n, a new generative AI model optimized for edge devices, uses a new architecture as detailed in this paper, with innovations in parameter-efficient processing.
- The model is designed to reduce memory footprint, includes audio input handling, and integrates text and visual data, marking it as a true edge model for smart devices, possibly for smart glasses.
Gemini 2.5 Flash and Gemma 3n Accessible in AI Studio: Members can now access Gemini 2.5 Flash 0520 and Gemma 3n within Google’s AI Studio, though some features, like Deep Think, are in closed beta.
- A member noted that the Gemini diffusion model seems a bit shy, only generating functioning code in 1 out of 8 trials.
Decentralized AI cyberdreams: A member expressed a vision of a cyberpunk future enabled by Nous and Prime, where RTX 5090 could work with RTX 3060 for training and inference.
- They envisions a global AI that exists everywhere at any time, controlled by decentralized systems, and could not be shut down.
Diffusion Models aim to accelerate text generation: Diffusion models are being explored for their ability to process chunks of text in parallel and generate text in a non-causal way, potentially speeding up text generation.
- Members suggest diffusion models could enable parallelized chain-of-thought reasoning, and that they should theoretically be better at joke writing as they can consider the punchline from the outset, although the models may be limited by token limits.

Nous Research AI ▷ #ask-about-llms (3 messages):

Restricting model domains, AI models in education, Gemini Flash, AI as a teaching assistant

Model Domain Restrictions Proving Difficult: A member is researching which model is currently the best at closely following instructions, but another member stated that it’s very hard to restrict models to a domain or a task.
AI Teaching Assistants as a Solution: A member wants to create an environment where students can leverage the power of AI models to enhance their development and accelerate their learning, while still maintaining a healthy balance with human instruction.
- Their goal is to restrict the model, at the very least, to avoid solving problems directly and instead act as a guide or assistant, much like a teacher who supports the student in reaching their own conclusions through individual effort.
AI Models used as teacher substitutes: One member observed a widespread issue in education today, specifically that students are increasingly using AI models as full substitutes for teachers, which is making in-person education seem dispensable.

Nous Research AI ▷ #research-papers (1 messages):

LLMs spontaneously generate social conventions, Collective biases in decentralized LLM populations, Adversarial LLM agents driving social change

LLMs Spontaneously Socialize and Generate Conventions: According to a new paper, “Social conventions spontaneously emerge in decentralized LLM populations” universally adopted social conventions spontaneously emerge in decentralized LLM populations through local interactions.
LLMs Show Collective Bias: The paper also notes that strong collective biases can emerge during this process even when agents exhibit no individual bias.
Adversarial LLMs as Social Change Agents: The study found that committed minority groups of adversarial LLM agents can drive social change by imposing alternative conventions once they reach a critical threshold.

Nous Research AI ▷ #interesting-links (3 messages):

OpenEvolve Release, Evolutionary Coding Agents, Google DeepMind's AlphaEvolve, LLMs for Algorithm Optimization, Matformer architecture

OpenEvolve is unleashed on GitHub: An open-source implementation of Google DeepMind’s AlphaEvolve system, called OpenEvolve, has been released, replicating DeepMind’s results on circle packing with a 99.97% match.
- The agent uses LLMs to discover and optimize algorithms, evolving entire codebases and working with any OpenAI-compatible API, with multi-objective optimization and an LLM ensemble approach, linked to its GitHub repository and blog post.
Matformer arch and Gemma 3n models focus renewed: With the new release of Gemma 3n models and a renewed focus on Matformer architecture, there is promising potential in algorithm optimization.
- More on this, see this tweet.

Nous Research AI ▷ #research-papers (1 messages):

LLM social conventions, Collective biases in LLMs, Adversarial LLM agents

LLMs Spontaneously Adopt Social Conventions: A new paper finds that universally adopted social conventions spontaneously emerge in decentralized LLM populations through local interactions.
LLMs Exhibit Strong Collective Biases: The same paper demonstrates that strong collective biases can emerge during this process even when agents exhibit no individual bias.
Adversarial LLMs Drive Social Change: The paper shows that committed minority groups of adversarial LLM agents can drive social change by imposing alternative conventions once they reach a critical threshold.

HuggingFace ▷ #general (52 messages🔥):

Xet file size limits, HuggingFolks role, LLM recommendations, Training Data Errors, Hugging Face collaboration

Xet Team Addresses File Size Limits with Aplomb: A Xet team member responded to a question about file size limits, indicating that while limits will exist, they are still determining the specifics, aiming for at least 200GB to accommodate 70B models.
- They highlighted the ability to upload and download files >50GB using Xet-enabled paths (like this example), but full support on the web side is still in progress.
HuggingFolks Role Hunt Heats Up: A member asked if a user could be given the HuggingFolks role, pending permissions from Ben to set it up.
- The user getting the role also wanted to know how to get a cat above their avatar (it’s an avatar decoration).
Members Seek Speedy LLM Suggestions: A member requested recommendations for a fast, light, and strong LLM to run locally.
- Another member suggested using Ollama with GGUF versions of models from leaderboards, social media, or anonymous forums (huggingface.co/docs/hub/ollama, huggingface.co/models?sort=trending&search=gguf, huggingface.co/collections/open-llm-leaderboard/the-big-benchmarks-collection-64faca6335a7fc7d4ffe974a).
Training Data Error Troubles Training: A user reported a consistent error: Maximum retry attempts exceeded for page when fetching training data, seeking advice on the cause and resolution.
- A member advised to check the LLM and RLHF courses, mentioning the use of Streamlit for the GUI.
Bioinformatics Expert Eyes Hugging Face Collaboration: A member working in biodata at EMBL-EBI (https://www.ebi.ac.uk/) expressed interest in collaborating with the Hugging Face team on automated curation using LLMs and RLVR with biocuration data.
- Another member suggested contacting a specific user for potential collaboration.

HuggingFace ▷ #today-im-learning (1 messages):

Research Paper Reading Workflow, Summarization tools, YouTube Video Explanations, Paper Selection Criteria

User Seeks Efficient Research Paper Workflow: A user is seeking a more efficient workflow for reading research papers, especially for individuals who are not fast readers or mathematically inclined, and is using Gemini 2.5 Pro to summarize pasted article PDFs.
- The user’s current workflow involves asking Gemini to explain each variable and refining understanding through iterative questioning, but is looking for alternatives like YouTube video explanations.
Leveraging YouTube for Paper Explanations: The user suggests using YouTube videos to understand research papers, while acknowledging the need to sift through low-quality content.
- They believe a great video explaining a specific paper would be valuable and wonders how others choose papers and how long it takes to fully understand them.
Paper Selection Criteria: The user asked how others are choosing what papers to read, and how long it takes to completely understand a paper.
- They also mentioned that they are not super smart at math, or even programming for that matter.

HuggingFace ▷ #i-made-this (12 messages🔥):

Video dropping page, Browser AI tool calls, Data transformations with LLMs, MCP server support, Cyberdesk computer agent

Video Dropping Page Finishes After Two Days: After two days of work, a member successfully completed their video dropping page, which included research for FLAST and a new design.
- The member expressed great satisfaction in completing the task without quitting, and attached a screenshot of the completed video dropping page.
Browser AI Supports Tool Calls: A new version of Browser.AI was released, which now supports tool calls.
- Browser.AI is a prototype browser demonstrating the power of running open-source models on the device, and currently supports chat, tool calls, and embeddings; feedback is welcomed via browser.christophermckenzie.com.
Vitalops Creates Data Transformation Tool: Vitalops introduced a new open-source tool, DataTune, that performs data transformations with simple natural language instructions and LLMs.
- The tool aims to alleviate concerns about context length and high API costs associated with data transformations using LLMs.
Cyberdesk Builds Computer Agent in a Weekend: A member introduced Cyberdesk, a tool to build a computer agent in a single weekend, utilizing any Hugging Face pixel-based model.
- The ProductHunt launch can be found at ProductHunt, where upvotes are appreciated.
OpenEvolve System Released: An open-source implementation of Google DeepMind’s AlphaEvolve system called OpenEvolve was released.
- This evolutionary coding agent, found on GitHub, uses LLMs to discover and optimize algorithms, successfully replicating DeepMind’s results on circle packing and evolving a random search into a simulated annealing algorithm, further explained in their blog post.

HuggingFace ▷ #reading-group (1 messages):

arpitbansal.: By any chance recording available for the recent session??

HuggingFace ▷ #computer-vision (6 messages):

Stanford CS231n lectures, Estimating bathymetry (sea depth) from Sentinel 1 SAR images, Object Detection, Segmentation Model

Stanford CS231n lectures spot-on: A member recommends going through the Stanford CS231n lectures by Andrej Karpathy due to their spot-on intuition, also suggesting classic computer vision and machine learning resources like Andrew Ng’s courses on YouTube.
Model Suggestion: A member suggests choosing a segmentation model like YOLOv11-seg for object outline, which can provide a segmentation mask for drawing outlines using cv2.findContours.
Sentinel 1 SAR Images for sea depth: A member seeks guidance on estimating bathymetry (sea depth) from Sentinel 1 SAR images, using multiple snapshots of the same location at different times as input to deduce water depth, due to varying imaging conditions and tides.

HuggingFace ▷ #NLP (1 messages):

BERT-style model inference, Logit Differences, Candle vs PyTorch

BERT Logit Variance Across Frameworks: A member inquired if it’s unexpected to get significantly different logits when running inference on the same BERT-style model and task in different libraries like Candle and PyTorch, even with identical tokenization and config files.
- They noted that classification results remain largely consistent despite logit variations.
Inference Discrepancies: Candle versus PyTorch: Despite identical tokenization and configurations, running the same BERT-style model in Candle and PyTorch can yield significantly different logits.
- The user highlighted that while logits differ, the final classification outcomes are largely the same.

HuggingFace ▷ #agents-course (30 messages🔥):

GAIA formatting issues, Ollama Setup Help, LiteLLMModel, InferenceClientModel, AI Agent Course Certificate Sharing on LinkedIn

GAIA Formatting Demands Exact Matches: Members discussed that GAIA requires an exact match for answers, which can be tedious, and suggested massaging the system message to improve results.
- It was mentioned that smolagents, without system message modifications, submit in a more GAIA-friendly format.
Troubleshooting Ollama Setup: A course participant sought guidance on the next steps after setting up Ollama, running ollama pull and ollama serve, encountering and debugging an error, and seeing timestamps with CPU configuration.
- Another member suggested replacing InferenceClientModel with LiteLLMModel in the course code to use Ollama effectively.
LiteLLMModel is recommended in agent courses: When users encounter InferenceClientModel in the course code, they should change it to LiteLLMModel to use Ollama.
- A link to updated notebook from GitHub which address bugs in the LoopEvent section has been shared (notebook link).
Cloning Project Repo Fixes Bugs: A user reported solving an issue by creating a new clone of the project repo and copying files over, noting that a full rebuild in settings didn’t fix the problem.
- They said something broken in the spaces that the full rebuild in settings won’t even fix.
Certificate Sharing Dilemma: A user who obtained the final certificate for the agents course inquired about the option to share it on LinkedIn.
- No solution was found in the given messages.

Latent Space ▷ #ai-general-chat (92 messages🔥🔥):

Perplexity Free Tier Costs, AI Builder Survey, Jules: Asynchronous Coding Agent, Coding Agents Comparison, Anti-AI Sentiment on Forums

Perplexity Spends Millions on Free Tier: Perplexity spends $33 million per year on its free tier, labeling the cost as R&D, according to this tweet.
Jules: Google’s Asynchronous Coding Agent Launches: Jules, an asynchronous coding agent from Google, has been rolled out, showcased in a post and on its website.
Coding Agents’ React Skills Put to the Test: Members discussed the limitations of current coding agents in handling complex React code, noting that models often produce ‘maximum update depth exceeded’ errors and struggle with refactoring existing codebases.
Anti-AI Sentiment Runs Rampant on Forums: A discussion arose about the prevalence of anti-AI sentiment on platforms like Hacker News and Reddit, with some members suggesting these forums reflect echo chambers rather than broad societal views.
- One member suggested a need for ‘an AI-friendly Hacker News’ and another proposed that the current Discord channel serves that purpose.
Gemini Evolved into an AI Operating System: At Google I/O 2025, Gemini has evolved into a full AI operating system with a new suite of tools extending beyond chat functionality, according to this thread.

Yannick Kilcher ▷ #general (27 messages🔥):

MLOps Courses, Features trained on single image, GNN implementation with torch_geometric“

MLOps Course Recommendations Sought: A member asked for recommendations for courses to learn MLOps, and expressed appreciation for the work of another member.
- Another member responded that the questions didn’t seem dumb to them.
Single Image Training Sucks: A member stated that features trained on a single image suck and using a method with implicit heavy regularization on the features makes them more smooth.
- This started a brief argument between two members of the channel.
GNN Implementation Question with torch_geometric: A member asked a question about implementing GNNs using torch_geometric for a node regression task, specifically how to structure the input data.
- Another member offered to look at the question later, noting it was a bit specific and they were short on time, providing the following code sample: class GAT(torch.nn.Module):

Yannick Kilcher ▷ #paper-discussion (16 messages🔥):

Physics of Language Models, Knowledge Storage, Knowledge Extraction, Knowledge Manipulation, Out-of-Distribution Buzzword

Physics of Language Models Discussed: Members of the channel discussed Physics of Language Models: Part 3.1, Knowledge Storage and Extraction and Part 3.2, Knowledge Manipulation, focusing on how language models handle factual knowledge and its application to downstream tasks.
Knowledge Manipulation Critiqued: The discussion highlighted that while language models are proficient in knowledge retrieval, they encounter difficulties in tasks like classification, comparison, and inverse search, especially without the use of Chain of Thoughts (CoTs).
- The paper introduces a controlled, synthetic experiment revealing inherent weaknesses in language models, where they struggle to manipulate pre-training data effectively, even when the knowledge is perfectly stored and despite sufficient training.
“Out-of-Distribution” Term Questioned: Participants debated the term “out-of-distribution,” with some suggesting it has become a buzzword PR term due to its ambiguous and often misused definition.
- It was noted that in the discussed paper, the author took an atypical approach, controlling for data contamination more effectively than most studies.

Yannick Kilcher ▷ #ml-news (28 messages🔥):

Alpha Evolve, Google Codex Competitor, Labor Saturation Theory, LatentSeek vs COCONUT, Gemini Diffusion

Google Drops Codex Killer Called Jules: Google launched Jules, a competitor to OpenAI’s Codex, but access is currently waitlist-only in Europe (Jules).
Debate Erupts Over AI’s Impact on Labor: Discussion emerged around the implications of AI on the labor market, with one member suggesting that if labor is a commodity, market theory suggests we should create less labor in case of labor saturation.
- Another pointed out that modern societies are already facing population stagnation or decline, thus naturally regulating the supply side.
LatentSeek Aims for Tokenless Decoding: LatentSeek employs per token RL on the latent (and previous tokens) right before selecting the next token.
- Members pointed out its distinction from COCONUT, which bypasses the tokenization step and runs autoregression in latent space without discretization, noting that LatentSeek improves marginally over Qwen 2.5 and might be a benchmark fine-tune.
Google Unveils Gemma 3N Model Details: Google released details for its Gemma 3N model series (Gemma 3N Docs).
- This was followed by news of Google AI Edge small language models, multimodality, RAG, and function calling (Google AI Edge).

aider (Paul Gauthier) ▷ #general (54 messages🔥):

Qwen MoE 3, Qwen 2 35B Polyglot benchmark, Aider Notifications, Aider as Agent, Navigator PR

Qwen disappoints for Angular/Typescript dev: A member found the Qwen 2 35B model unable to produce diffs for angular/typescript code using element.innerHTML, despite repeated attempts, leading to abandonment of the model for the current task.
- They were using the same Alibaba Cloud endpoint as Paul Gauthier and experimenting with temperature parameters.
Qwen’s performance varies across languages: One member found Qwen to work well with Rust, while another experienced failures with Kotlin, specifically with immutable map issues, even with type information readily available in the code.
- One member linked to Paul’s blog post about Qwen3 for temperature/other settings.
Aider Notifications configured: A user inquired about enabling notifications in Aider to be alerted when coding finishes or when file additions are requested, and were pointed to the configuration options.
- They are working on adding a Navigator PR to extend to Agentic coding.
Gemma 3n 4B achieves 44.4% on polyglot benchmark: The new Gemma 3n 4B model achieved 44.4% on the polyglot benchmark, but the settings used are currently unknown.
- A member noted that a 4b model has more score than gemini 2.0 pro.
Deep Think is new differentiator?: Following the announcement of Gemma 3n 4B model, one member posted a screenshot from the Google blog highlighting that the next model is Deep Think.
- Others commented that it sounds like you just let them spend your money at their will.

aider (Paul Gauthier) ▷ #questions-and-tips (15 messages🔥):

Aider Shell Command Execution, Aider YAML Configuration, Aider Prompt Context, Gemini 2.5 Flash Benchmark

Aider Debates Shell Command Execution: Users discussed the possibility of executing shell commands within Aider, specifically listing staged files, noting that using --yes-always option doesn’t work as expected and the tool just shows the command and exits.
- A link to issue #3903 was shared, indicating this is a design decision to require manual approval.
YAML Configs Explored: A user inquired about optimized YAML configurations for Aider, and it was recommended that Aider is designed to work well with default configs, with the model being the most crucial element.
- A link to sample configs was shared, with a user recommending GLOBAL_DEFAULTS.yml to stray from aider defaults.
Context by File for Prompts Pondered: A user asked about including a prompt preamble from a file as context and debated whether it should be specified in a prompt or if Aider could read the file directly using aider --read your_file.md.
- There was no clear solution, the user was simply encouraged to test and see.
Gemini 2.5 Flash Gets Benchmarked: Users discussed updating benchmarks with the newly upgraded Gemini 2.5 Flash, with a link to benchmark results shared within the Discord channel.
- The specific results and their implications were not detailed in the messages.

Manus.im Discord ▷ #general (64 messages🔥🔥):

Manus AI Agent, Credit System & Invitation Code, Manus Website Creation, Network Connection Errors, Manus Tech Stack

Manus—AI Agent with own computer: Manus is an AI agent with its own computer that builds websites, writes reports, and runs research tasks.
- Users can access Manus via invitation links such as this one.
Credit System and Invitation Code System Discussed: One user expressed that they haven’t had any issues with Manus except for the credit system/usage and invitation code abuse.
- Another user asked, how do you have so many credits?
Manus is able to create websites: A user asked if Manus can create websites, and another user confirmed that it definitely can, and linked to Manus use case official collection and Manus use cases from users.
Network Connection Errors plague users: A user reported getting network connection errors for Manus.
- A Manus employee asked them to share their account email and a session link to help investigate the issue.
Inquisitive user seeks Manus Tech Stack: A member inquired about the tech stack being used, asking What tech stack are you leaning on atm for this?
- Another member responded, I’m trying all kinds of innovations right now.

Cohere ▷ #💬-general (28 messages🔥):

Category Theory and AI, Cohere Research Grants Program, Private Deployment Options at Cohere, Command A and Structured Responses Slowdown, JSON Output Hanging Issues with Command-R

Category Theory Craze Ignites AI Interest!: Members expressed a desire to learn more about the intersection of category theory and AI.
- The conversation aims to explore techniques and methods for prompt engineering tokens related to this duo.
Cohere Labs Keeps Grants Program Going!: A user inquired about the status of the Cohere Research Grants program, noting that the application link appeared to be inactive (https://share.hsforms.com/1aF5ZiZDYQqCOd8JSzhUBJQch5vw).
- A Cohere representative clarified that Research Grants are now part of Cohere For AI, which is currently called Cohere Labs.
Cohere caters to Customers Control Cravings: A user inquired about private deployment options with full ownership/control rights for Cohere models deployed on-prem, driven by data/LLM sovereignty interests.
- Cohere confirmed that they offer private deployments as a core part of their solutions, with flexible deployment options and encouraged the user to contact [email protected] or [email protected] for more information.
Command A Crawls, Confounds Customers!: A user reported slower than usual response times with command A, particularly when using the structured response parameter.
- A Cohere representative acknowledged the issue, confirmed there were no known issues on their end, and requested the user send details to [email protected] for investigation.
JSON Jams with Command-R, Just Needs Time!: A user reported that specifying json_object as output in requests hangs when using command-r-plus-08-2024, although it works fine in the web playground without explicitly specifying JSON output in the UI.
- The user later clarified that the requests do complete, but take an excessively long time (almost 2 minutes), while text-based JSON output is much faster.

Cohere ▷ #💡-projects (1 messages):

Vitalops datatune, Open source data transformation tool

Vitalops releases datatune: Vitalops just created a new open source tool called datatune that does data transformations with simple natural language instructions and LLMs.
- The creator encourages people to check it out on GitHub and hopes it’s useful.
Datatune simplifies data transformations: Datatune is an open-source tool by Vitalops that uses LLMs for data transformation via natural language.
- It aims to simplify how data transformations are performed, making it accessible to more users.

Cohere ▷ #🎯-private-deployments (2 messages):

Private Deployment, Data Sovereignty, LLM Sovereignty, Cohere models on-prem

Customers eye private, on-prem Cohere model deployment: A customer inquired about options for private deployment of Cohere models on-prem, seeking full ownership and control due to data/LLM sovereignty interests.
- Cohere team members suggested contacting them at [email protected] or [email protected] to discuss further.
Cohere Sales and Support teams ready to help: Cohere team acknowledged the inquiry and offered assistance.
- They asked the customer to reach out via [email protected] or [email protected] to explore the possibilities.

MCP (Glama) ▷ #general (20 messages🔥):

MCP best practices, MCP and Cursor, crawl4ai mcp server, A2A protocol Agents, Wallet MCP

Deep Dives into MCP’s best practices sought: A member is looking for any deep dives on best practices around use of MCP, specifically around tool design and deployment in production.
- They linked to a Windows blog post about securing the Model Context Protocol and building a safer agentic future.
MCP & Cursor workflows shared: A member shared a writeup on reddit of how they personally utilize MCP and Cursor for workflows to do things like build websites, make games in unity, and other similar tasks.
- They also mentioned that browsermcp is cool because it allows an agent to actually view what you are seeing in the browser, screenshot/visualize, etc.
crawl4ai MCP Server Implementation: A member is seeking reference implementations of an MCP server in Docker that can work with out-of-container local files to crawl (ingest and embeds) local markdown files, not just URLs.
- They have a crawl4ai mcp server for winsurf that crawls URLs and embeds for latest coding context knowledge querying, and are running it with SSE transport port in a Docker container.
Bridging MCP and A2A: A member released an open-source server that bridges MCP with A2A protocol Agents, allowing Claude to interact with A2A agents seamlessly, and shared a GitHub link.
- Another member asked for a use case for A2A, and another member explained A2A’s use case is to deploy your agents to your domain, like a tool.
Wallet MCP Release: A team from TokenPocket released Wallet MCP - a lightweight bridge that enables seamless integration between AI clients and encrypted user wallets, which supports multi-chain asset management, transaction signing, and smart contract interactions, and shared a GitHub link.
- It was explained that the reason there is a captcha to join the server is because there are too many scammers and spammers.

MCP (Glama) ▷ #showcase (9 messages🔥):

MCP-GraphQL issues, Public SearXNG MCP server, AI-friendly Data API

Fetch Error Plagues GraphQL Testing: A member testing the mcp-graphql server with Claude desktop encountered a “Failed to introspect schema: ReferenceError: fetch is not defined” error.
- Another member suggested that this may be caused by an out-of-date Node version.
Public SearXNG MCP Server Emerges: One member created a “public SearXNG” MCP server at GitHub to address the unreliability of public SearXNG servers and the lack of JSON support.
- They randomized the instance called to avoid DoSing the host, making it suitable for private users who want to sparingly call internet searches.
AI-Friendly Data API Debuts: A member announced a serverless-friendly Data API at dapi-sandbox.adiom.io that creates an instant database backend with secure semantic endpoints.
- It supports MCP, gRPC, and Connect RPC over MongoDB and PostgreSQL, offering a free sandbox limited to 50 active users.

tinygrad (George Hotz) ▷ #general (12 messages🔥):

AMD enum changes, 7900XTX vs 9070XT flash attention, RDNA4 wmma instructions, BERT training bounty, tinygrad gemm optimization

Annoyance over AMD Enum Changes: A member expressed frustration over AMD’s decision to change enums, complaining on their repo.
- They joked that AMD acted like they were gonna run out of numbers or something.
Flash Attention Bounty: 7900XTX Only Initially: The flash attention bounty is locked and is being tested on both 7900XTX and 9070XT, but currently only supports 7900XTX.
- The developer stated that if RDNA4 added new wmma instructions, they could include them, but they do not have a 9070XT for testing.
BERT Training Bounty Targets Nvidia/AMD: The BERT training bounty, seeking flash attention in tinygrad that outperforms normal attention, is compatible with any Nvidia/AMD hardware that can run the trainer.
- The bounty poster is using chatgpt to write their AGENTS.md file, but it turned out bad.
tinygrad GEMM Optimization Interest: A member inquired about ongoing work on GEMM optimization, referencing past discussions by George Hotz on leveraging tiles on AMD GPUs.
- They asked who can I speak with about contributing.
tinygrad Job Application via Bounties: In response to a job inquiry, a member clarified that the primary route to a job at tinygrad is through bounties, suggesting starting with small PRs.
- They pointed to tinygrad’s pinned post on X for more information.

tinygrad (George Hotz) ▷ #learn-tinygrad (7 messages):

tinygrad control flow, jax.lax.cond equivalent, Tensor.where

Tinygrad Control Flow Questioned: A member inquired about the existence of control flow operators in tinygrad, similar to those found in jax.lax, such as jax.lax.cond.
- The member noted that such control flow is essential for many Monte Carlo algorithms.
Tensor.where suggested as alternative: Another member suggested using Tensor.where as a possible alternative to jax.lax.cond.
- The original poster responded that jax.lax.cond allows you to determine which branch of code to execute, whereas Tensor.where is applied to a tensor specifically, so both paths will still execute.

LlamaIndex ▷ #blog (2 messages):

Financial Analysis Workshop, Multi-Agent Communication Protocol (MCP), AWS joins MCP steering committee

LlamaIndex Hosts Hands-On Financial Analysis Workshop: LlamaIndex is hosting a hands-on workshop in NY on May 29th, with @jerryjliu0 leading sessions on building agent workflows for financial analysis and due diligence, sign up here.
- The event will offer exclusive insights into leveraging LlamaIndex for advanced financial applications.
AWS Commits to Multi-Agent Communication Protocol (MCP): AWS announced they are joining the MCP steering committee and contributing to its evolution for better inter-agent communication, and collaborating with frameworks like LlamaIndex.
- More details about MCP and AWS involvement can be found here.

LlamaIndex ▷ #general (11 messages🔥):

Agent Handoff Examples, Llama Parse Service Issues, VectorStoreIndex vs Local FAISS

Agent Handover Handbook: A member asked for agent handoff examples, and another member provided a link to the LlamaIndex documentation.
Llama Parse Pile-up?: A member reported issues with the Llama Parse service using the Parse with Layout Agent, noting that jobs were taking upwards of 30 minutes and then failing without explanation and is getting stuck trying to load up.
- They also shared a screenshot of the failed job.
FAISS Faceoff: A member inquired about the performance differences between using a VectorStoreIndex and a local FAISS for storage in a RAG model.
- They questioned whether the performance of a RAG model would be degraded by using one over the other.

Torchtune ▷ #general (2 messages):

Recipe Tutorials, Automated CI, Llama2 Evaluation

Recipe Tutorials Complement Documentation: Members agreed that recipe tutorials should complement the documentation, showcasing end-to-end examples like grpo recipes with improvements on math datasets.
- They suggested linking these tutorials to showcase improvements on specific datasets.
Automated CI Avoided for Recipe Tutorials: There was agreement that automating recipe tutorials as part of CI would not be optimal.
- One member noted that creating tests that pass for delta_in_performance > 0 might lead to issues where the appropriate train and eval datasets need to be reconsidered for each new model, and that it feels like too much ML work.
Llama2 Evaluation Proves Painful: A member mentioned that they had tried a version of approach (2) with Llama2 in the early days and it was quite a pain.
- They suggested updating the contributing guide with best practices and adding README.md files with summaries of different evals for each new model to help others sanity check.

Torchtune ▷ #dev (2 messages):

DistCp, Safetensors, Async Checkpointing

DistCp to Safetensors conversion: Issue Requested: An issue was requested regarding converting the DistCp format (produced by async checkpointing) to safetensors.
- A member responded, encouraging the creation of such an issue and offered utils to facilitate the conversion, which would provide valuable signal for the DCP team.
User requests DistCp to Safetensors conversion, member replies with utils to help: A user inquired about converting DistCp format (from async checkpointing) to safetensors.
- In response, a member suggested creating an issue for tracking and offered utilities to assist with the conversion, signaling its importance to the DCP team.

Torchtune ▷ #rl (3 messages):

async_grpo, async_rl, vllm dependencies, torch version compatibility

Experimenting With Async GPRO: A member started experimenting with async_grpo and noticed that async_rl currently depends on vllm==0.8.4, which in turn depends on torch==2.6.0.
Upgrading VLLM Dependency To 0.9.0: A member plans to update the vllm dependency to the pre-release version vllm==0.9.0, which requires torch==2.7.0.
- The member asked about potential issues when running async_rl with this setup, and was told that it hasn’t been tested yet, but should work.
Async RL Recipe Remains Experimental: The async RL recipe is still pretty experimental at this point, so it was pegged to a stable version of vllm.

DSPy ▷ #general (3 messages):

DSPy X post, DSPy is all about, Getting what DSPy is all about?

DSPy’s Cryptic X Post: A post on X features something ugly, yet the poster expresses a liking for it.
- The author teases, If you get this, you get what DSPy is all about, suggesting a deeper understanding of the project’s essence.
Deciphering the DSPy Enigma: The X post’s enigmatic message implies that grasping DSPy involves appreciating unconventional or initially unappealing aspects.
- It hints at a potentially unique approach or philosophy behind DSPy’s development and application.

LLM Agents (Berkeley MOOC) ▷ #hackathon-announcements (1 messages):

AgentX Competition, Submission Forms, Judging Panel, Entrepreneurship Track, Research Track

AgentX Competition Submission Forms are OPEN!: Submission forms are now open for the AgentX competition, featuring a distinguished judging panel from top VCs and AI companies; submission deadline is May 31, 2025, at 11:59 PM PT.
- Links for the Entrepreneurship Track and Research Track are now available.
Requirements for AgentX Submission Detailed: The Entrepreneurship Track requires a pitch deck, product demo video, and live product link while the Research Track requires a paper, video presentation, and GitHub repo.
- Over $150K in prizes awaits top teams.
Help Promote AgentX on Social Media: Competition organizers requested help spreading the word about AgentX on social media platforms.
- They provided a link to X and a link to LinkedIn.

LLM Agents (Berkeley MOOC) ▷ #mooc-questions (2 messages):

OpenAI API Keys, Trailblazer Tier, Mastery Tiers

Students Provide OpenAI API Keys: Students must use their own OpenAI API keys for the lab, but can exclude it for the actual submission.
- The TA <@181105076423753728> can answer if there are alternative approaches that don’t require API calls.
Downgrading Mastery Tier for Certificates: Students can still apply for the Mastery Tier even if they struggle with labs, as they can be “downgraded” to the Trailblazer Tier if quizzes and articles are completed.
- The downgrade process happens on the staff end.

Nomic.ai (GPT4All) ▷ #general (2 messages):

PDF text extraction, GPT4All OpenAI API Key installation

PDF text extraction needs special embedder model: To extract an exact copy of parts from PDF textbooks, a special embedder model is needed.
GPT4All API key installation troubleshooted: A member reports install button not working in GPT4All when pasting a long OpenAI API key.

MLOps @Chipro ▷ #general-ml (1 messages):

DataTune, Data Transformation, Open Source Tool, Natural Language Instructions, LLMs

Vitalops Launches DataTune for Easy Data Transformation: Vitalops introduced DataTune, a new open-source tool designed for data transformation.
- It leverages natural language instructions alongside LLMs, aiming to simplify the data manipulation process.
DataTune Simplifies Data Transformation with LLMs: DataTune facilitates data transformations using intuitive natural language instructions powered by LLMs.
- This open-source tool from Vitalops aims to streamline and simplify complex data manipulations for users.