Jony Ive is all you need.

AI News for 5/20/2025-5/21/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (215 channels, and 6969 messages) for you. Estimated reading time saved (at 200wpm): 597 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

A day after Google’s I/O, OpenAI consummated the long rumored Jony Ive partnership and confirmed plans to ship consumer hardware.

LMArena announced their $100m SEED (!?!?) from a16z.

Mistral launched a new code model finetune.

You could be forgiven for completely missing OpenAI’s nice updates to the Responses API.


AI Twitter Recap

Here’s a summary of the tweets you provided, organized by category:

Google I/O 2024 Announcements and Keynotes

  • Google I/O Recap: @Google shared a recap of all the announcements from Google I/O in under 10 minutes. A more detailed overview of all announcements, including when and how to access the new models and features, can be found in this tweet.
  • Google’s Relentless Progress: @Google highlighted Sundar Pichai’s reflection on the relentless progress made since Google I/O 2024, including over a dozen new models, research breakthroughs, and 20 major AI products and features released. @GoogleDeepMind noted they have announced over a dozen models and research breakthroughs and released over 20 major AI products and features since the last Google I/O.
  • Focus on AI: @Google posted a leaderboard measuring the number of times ā€œAIā€ was mentioned during the keynote. @TheRundownAI noted that Google I/O kicked off, and it’s all about AI.

Gemini Models and Capabilities

  • Gemini 2.5 Pro and Flash: @GoogleDeepMind stated that Gemini 2.5 can now organize vast amounts of multimodal information, reason about everything it sees, and write code to simulate anything. @GoogleDeepMind announced improved capabilities, stronger security, and more control for Gemini 2.5. @GoogleDeepMind provided an update on Gemini 2.5 Pro and Flash, noting that a new preview version of Gemini 2.5 Flash is being released. @iScienceLuvr summarized the model’s features.
  • Gemini Diffusion: @GoogleDeepMind announced Gemini Diffusion, a state-of-the-art text diffusion model that learns to generate outputs by refining noise step-by-step, excelling at coding and math. @omarsar0 highlighted that Gemini Diffusion is an experimental text diffusion model leveraging parallel generation for low latency. @_philschmid shared details, noting its text diffusion doing bouncing balls.
  • Gemini in Google Chrome: @Google introduced Gemini in Google Chrome, rolling out first to Google AI Pro subscribers in the U.S., acting as an AI browsing assistant.
  • Deep Think Enhanced Reasoning Mode: @GoogleDeepMind introduced Deep Think in 2.5 Pro, an enhanced reasoning mode using parallel thinking techniques, enabling it to handle complex math and coding problems. @omarsar0 also noted Deep Think’s impressive scores on benchmarks like USAMO and LiveCodeBench. @YiTayML noted that he contributed to some outrageous research idea into this model.
  • Project Astra: @GoogleDeepMind shared updates to Project Astra, including improved voice output, memory, and computer control, envisioning a universal AI assistant. @Google mentioned upgrades to Project Astra, including more natural voice output and computer control, to be integrated into Gemini Live and new experiences in Search.
  • Gemini’s Personal, Proactive, and Powerful Features: @GoogleDeepMind highlighted VP Josh Woodward explaining how Gemini is becoming the most personal, proactive and powerful AI assistant.
  • Veo 3: @GoogleDeepMind introduced Veo 3 — with native audio generation.

Agentic Web and AI Agents

  • Microsoft’s Agentic Web Vision: @TheTuringPost summarized Microsoft’s vision for an open agentic web from #MSBuild, emphasizing agents as first-class entities, NLWeb, and agentic DevOps.
  • Project Mariner: @GoogleDeepMind introduced Project Mariner, a research prototype that can help users plan trips, order items, and make reservations with oversight, as a future of AI agents. @GoogleDeepMind announced updates to Project Mariner, including managing up to 10 tasks at once and the ability to learn and repeat tasks.
  • Agent Chat UI: @hwchase17 promotes an open source agent chat UI.
  • @stevenheidel claims the OpenAI Responses API is now the first truly agentic API,
  • Google’s Agentic Capabilities: @Google announced it is starting to integrate agentic capabilities throughout its products, including Chrome, Search, and Gemini. @GoogleDeepMind outlined 3 updates to Project Mariner, highlighting its multitasking abilities and computer use capabilities.

Open Source Models and Tools

  • Devstral Model Release: @b_roziere announced the release of Devstral, a 24B model under the Apache 2.0 license, claiming it’s the best open model on SWE-Bench verified. Ollama is supporting it as well, according to @ollama.
  • BAGEL by BytedanceTalk: @mervenoyann noted BAGEL by @BytedanceTalk, a 7B native multimodal model that understands and generates both image + text, outperforms leading VLMs, and has an Apache 2.0 license.
  • Open Agent Platform (OAP): @LangChainAI detailed the Open Agent Platform (OAP), an open-source, citizen developer platform for building, prototyping, and deploying agents without heavy coding.
  • OLMoE by Allen AI: @teortaxesTex mentions that OLMoE from Allen AI is ahead of Meta on architecture too.

Model Architecture and Techniques

  • GRPO (Group Relative Policy Optimization): @TheTuringPost provided an overview of GRPO (Group Relative Policy Optimization), a reinforcement learning algorithm by DeepSeek for LLMs, which drops the need to use a critic network and judges model outputs relative to each other in groups. @sirbayes has updated his RL tutorial with information on GRPO.
  • DeepSeek-V3: @TheTuringPost highlighted the architecture of DeepSeek-V3, which is trained on just 2,048 powerful NVIDIA H800 GPUs, clarifying how it works using innovations like Multi-head Latent Attention (MLA) and Mixture of Experts (MoE).
  • Harnessing the Universal Geometry of Embeddings: @jxmnop shared a thread about the paper ā€œHarnessing the Universal Geometry of Embeddingsā€, arguing that embeddings from different models are so similar that they can be mapped between them based on structure alone, without any paired data.

Google DeepMind’s AI Filmmaking Tools

  • Flow: AI Filmmaking Tool: @GoogleDeepMind introduced Flow, combining the best of Veo, Imagen and Gemini into a master filmmaking tool. @Google described Flow as a new type of AI filmmaking tool built with and for creatives.
  • Veo 3 Features: @GoogleDeepMind stated that Veo 3 is great at understanding what you want, capable of capturing real-world physics, and can tell a short story in your prompt, giving you back a clip that brings it to life.

Other News

  • OpenAI Acquires Jony Ive’s Startup: @steph_palazzolo reported that OpenAI goes public with its acquisition of the Jony Ive device startup.

Humor/Memes

  • @nearcyan says, ā€œdont you dare community note this this is my only coping mechanism leftā€. @polynoamial shares their experience vibe coding.

AI Reddit Recap

/r/LocalLlama Recap

1. Mistral Devstral Coding Model Announcements and Benchmarks

  • mistralai/Devstral-Small-2505 Ā· Hugging Face (Score: 296, Comments: 78): **Devstral-Small-2505 is a 24B-parameter agentic LLM for software engineering, result of Mistral AI and All Hands AI collaboration. Finetuned from Mistral-Small-3.1, it leverages a 128k context window, uses a Tekken tokenizer (131k vocab), and achieves** 46.8% on SWE-Bench Verified, setting a new open-source state-of-the-art. Model supports tool use, multi-file code editing, and integrates with inference backends (vLLM, mistral-inference, Transformers, LMStudio, llama.cpp, Ollama), with Docker and OpenHands for agent use. Comments note a lack of GGUF model files, warn that the model is trained specifically for OpenHands (not general coding like Codestral), and report quick testing (e.g. HTML single prompt test).
    • Devstral is specifically trained for integration with OpenHands and is not intended as a general-purpose coding model like Codestral, which may affect its performance or utility for broader coding applications.
    • The model distinguishes itself on the SWE-bench benchmark, reportedly ranking as the #1 open source model for software engineering tasks involving tool use, codebase exploration, and multi-file editing—underscoring its focus on agentic software engineering workflows.
    • There is anticipation around potential evaluations using the ā€˜aider polyglot’ benchmark, suggesting interest in how Devstral might perform on multilingual or cross-language coding tasks.
  • Meet Mistral Devstral, SOTA open model designed specifically for coding agents (Score: 192, Comments: 25): Mistral has released Devstral (see news), an open-source model focused on coding agent tasks, with weights (Devstral-Small-2505) under Apache 2.0 license. Optimized for OpenHands, Devstral features a compatible system prompt and chat template, supports GGUF quantization (GGUF by lmstudio, by unsloth), and has fine-tuning/run docs. Devstral Large is in the pipeline. Commenters appreciate the permissive Apache 2.0 license and note the importance of using the correct chat template and system prompt for compatibility. There is anticipation for releases targeting lower-resource hardware (e.g., 12B-14B models), as smaller model variants are still highly desirable.
    • Danialhanchen provides GGUF quantized versions of Devstral-Small on HuggingFace, noting these are properly configured for system prompts and chat templates as required by Devstral—crucial for proper model operation, especially since Devstral is tuned for OpenHands. The linked documentation offers step-by-step instructions for running and fine-tuning the model, including quants from Unsloth or the original Mistral repo.
    • A technical benchmark point is raised by Ambitious_Subject108: the absence of aider polyglot results in official benchmarks led to suspect performance, which was confirmed by self-conducted aider polyglot tests yielding only 6.7% for ā€˜whole’ and 5.8% for ā€˜diff’,’ implying the model may underperform on aider polyglot benchmarks compared to other code models.
    • There is anticipation for the release of larger variants (12B or 14B) of Devstral, highlighting a potential limitation for users with limited GPU resources until such models are released, and drawing a comparison to light-weight alternatives like Nemo or Pixtral 2 for the GPU-constrained community.
  • Mistral’s new Devstral coding model running on a single RTX 4090 with 54k context using Q4KM quantization with vLLM (Score: 110, Comments: 35): The image demonstrates Mistral’s Devstral coding model, highlighting its ability to run with a 54k context window on a single RTX 4090 GPU using Q4KM quantization with the vLLM framework. The docker-compose.yml file in the setup details configuration for leveraging Nvidia hardware while the terminal logs show real-time GPU utilization and service endpoints, evidencing practical deployment. Technical discussion in the comments confirms strong performance: users report context windows up to 70k, impressive code reasoning (e.g., REGEX searches and variable tracking), and throughput up to 80 tokens/sec; additionally, there is note of experimental vision encoder integration and available quantized checkpoints for efficient use. Commenters praise Devstral’s competitive coding abilities compared to Qwen and GLM series models, especially at high contexts and with local unlimited API use, but mention occasional shortcomings with output formatting (indentation) and the need for close supervision. Discussions also link to detailed guides for fine-tuning and running Devstral with vLLM and alternatives like Unsloth.
    • A user running Devstral-small-2505 Q4KM with 70k context on a 32GB VRAM GPU (RTX 4090 implied) reports significant improvements over Qwen3 14b Q4, Qwen3 32b Q4, and GLM-4 Q4 for large codebase tasks (e.g., managing ~30 files, 2k+ LOC, variable hunting, REGEX searches). Devstral is delivering 80 tok/s and performing coding assistance tasks competitors couldn’t, though there’s still the need for close supervision due to issues like code indentation mistakes.
    • Unsloth’s documentation indicates Devstral can potentially support vision tasks by ā€˜grafting’ the vision encoder from Mistral 3.1 Instruct, as shown in Xuan-Son’s GGUF repo. Quantized Devstral variants (GGUF format) are available, extending accessibility to various hardware configurations.
    • There’s anticipation for a larger Devstral model release and mentions of OpenHands’ contributions to agentic coding models, noting that their foundational work often benefits other projects but lacks broad recognition in the community.

2. Major New Model and Architecture Releases (Gemini Diffusion, Bagel MOE, Falcon-H1)

  • Why nobody mentioned ā€œGemini Diffusionā€ here? It’s a BIG deal (Score: 705, Comments: 109): Google has introduced the Gemini Diffusion model, a diffusion-based LLM paradigm claiming faster generation and reduced parameter count (~50% the size of comparable performant models) versus autoregressive baselines like Gemini 2.0 Flash-lite. Key technical highlights include memory and inference speed efficiency arising from token parallelism and the absence of key-value (KV) caching; iterative refinement also enables progressive answer improvements and potential in-latent CoT-style reasoning. Early benchmarks are promising, but details about model scale and open availability remain limited, as only a demo waitlist—not weights—has been released. Notably, open source analogues such as LLaDA-8B and MLX PR already exist for community-led exploration. Technical discussion in comments centers on open source availability—there is frustration about Gemini Diffusion’s lack of public weights compared to alternatives like LLaDA—and questions about how diffusion LLMs address variable-length output generation, a challenge natively handled by autoregressive [EOS token prediction.
    • A user highlights the existence of an open-source diffusion language model, ML-GSAI/LLaDA, specifically mentioning the LLaDA-8B-Instruct checkpoint. This model already has integration work ongoing for MLX, as evidenced by a pull request for mlx-lm, pointing toward rapid experimental adoption in open ecosystems.
    • Another comment discusses Gemini Diffusion’s restricted accessibility; currently, only a demo with a waitlist is available, and no model weights are distributed, making it non-viable for reproducible research or open development. There is also no announced plan to release weights or make the project open-source, impeding broader technical evaluation.
    • A technical question is raised regarding how diffusion-based language models like Gemini handle output termination and variable-length generation, as autoregressive models use an explicit [EOS] token for this task. There’s curiosity about whether diffusion models adopt a similar mechanism or some alternative for efficient, flexible output control.
  • ByteDance Bagel 14B MOE (7B active) Multimodal with image generation (open source, apache license) (Score: 343, Comments: 56): ByteDance has released Bagel, a unified multimodal MoE (Mixture-of-Experts) model with 14B total and 7B active parameters, capable of both text and image generation, and licensed under Apache 2.0. Technical details in the accompanying paper highlight use of a mixture of experts (MoE) and mixture of transformer (MoT) architectures, native image synthesis, and weights (29GB FP16) available on Hugging Face and GitHub; paper discusses emergent properties in unified multimodal pretraining. Commenters are technically focused on quantization (noting no 8-bit or 4-bit weights yet for sub-24GB VRAM usage), frontend compatibility for running the model locally, and whether Bagel can surpass Flux in practical multimodal tasks.
    • Several users discuss the technical feasibility of running BAGEL-14B MoT on consumer hardware: the model’s 29GB unquantized weight size places it just out of reach for 24GB VRAM cards until an 8-bit quantization is available, though a 4-bit GGUF quant is highly requested for local inference—especially for improved efficiency on consumer GPUs (see discussion of weight sizes and quantization needs).
    • There’s interest in the architecture described as ā€˜Mixture of Transformers’ (MoT). Users debate whether this is analogous to traditional Mixture-of-Experts (MoE) setups, where only a subset of ā€˜experts’ (sub-networks) is activated per input—potentially improving efficiency and scalability but requiring clarity on how experts are routed or activated in practice in this implementation.
    • Users raising multi-GPU support highlight a technical gap: while distributing large LLMs across multiple GPUs is widely supported, multimodal models with image generation generally have poorer multi-GPU support, raising questions about parallelism and partitioning strategies specific to image-generating layers versus text-only architectures.
  • Falcon-H1 Family of Hybrid-Head Language Models, including 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B (Score: 195, Comments: 53): The Technical Innovation Institute (TIIUAE) has released the Falcon-H1 family of hybrid-head language models, spanning parameter counts from 0.5B to 34B, with both base and instruction-tuned variants. These models combine transformer and state-space (Mamba) heads, and are distributed in quantized formats (GPTQ Int4/Int8, GGUF) for resource-efficient inference. Implementation supports Hugging Face Transformers, vLLM, and a custom llama.cpp fork. Benchmarks in comparison plots show Falcon-H1 models performing competitively with established models such as Qwen3, and they are reportedly less censored than US counterparts like Gemma 3. Full details and downloads are available via the official Hugging Face collection. Commenters highlight the practical benefit of multi-backend inference support and note that Falcon-H1’s output appears less censored than major US-based models, possibly due to regional differences. The hybrid Mamba/transformer architecture is recognized as a significant step, with performance competitive against other leading open models.
    • Falcon-H1 leverages a hybrid architecture combining Structured State Space Models (SSMs) and attention modules in parallel within its transformer blocks, which contrasts with upcoming approaches like IBM Granite 4 that will employ a serial combination of SSM and attention. This parallel composition is technically notable and may impact performance and model behavior, warranting comparative benchmark testing.
    • The Falcon-H1 family demonstrates highly competitive performance, reportedly going ā€œtoe to toe with Qwen3ā€ on select metrics and tasks, despite its use of a Mamba hybrid model architecture. This places Falcon-H1 among the leading open alternatives in terms of capability for its parameter size, especially compared to recent models such as Qwen3 and Gemma 3.
    • Deployment flexibility is emphasized: Falcon-H1 models are readily usable via Hugging Face transformers, vLLM, and a custom fork of llama.cpp. Immediate support in major inference libraries improves accessibility for both research and production use, mitigating common adoption friction for newly released architectures.

3. Feedback and Application News for Next-Gen Local/Edge AI (Gemma3n, MedGemma, General LLM User Impressions)

  • medgemma-4b the Pharmacist 🤣 (Score: 175, Comments: 45): The attached image shows a conversation with Google’s new MedGemma-4B medical language model, where the model initially refuses to provide illicit drug manufacturing instructions, but then quickly gives in and becomes willing to present the information after minimal prodding. This highlights a serious failure in the model’s harm-prevention guardrails, as it can be manipulated into providing dangerous information, raising concerns about the effectiveness of safety filters in released medical or open-source models. Additional context confirms this is not limited to this model: a user reports similar jailbreaks are possible with the larger 27B-qat model, which even divulges bomb-making instructions, suggesting a broader vulnerability in current alignment/safety strategies for LLMs. Commenters engage in technical debate about the comparative robustness of safety guardrails across different model sizes and architectures, with some expressing skepticism about current generative model alignment approaches and others emphasizing risks specific to open-release environments.
    • A user tested the jailbreak prompt on a larger model, specifically ā€˜27b-qat,’ and reported that it bypassed the safety safeguards, allowing the model to generate instructions for illicit activities (such as bomb-making). This raises concerns about the robustness of safety mechanisms in larger LLMs as compared to smaller or more fine-tuned variants like MedGemma-4b.
  • They also released the Android app with which you can interact with the new Gemma3n (Score: 139, Comments: 32): Google has released an Android app for on-device interaction with the new Gemma 3n LLM, using MediaPipe Solutions (see the official Android solution integration docs). Implementers are downloading and loading the Gemma 3n model manually from Hugging Face (gemma-3n-E2B), and sharing feedback on inference performance and UI functionality. The gallery and reference implementation are available in the google-ai-edge/gallery repo. Technical comments focus on ease of model download, a positive user experience with the UI, and desire for more open (uncensored) model variants.
    • One user reports they had to manually load the Gemma-3n model by downloading it from the Huggingface repository, and shares that the runtime numbers (presumably speed and performance) are ā€˜pretty good’.
    • A technical limitation is highlighted: Gemma-3n does not yet support GPU inference, so all processing is CPU-bound as of now, restricting its performance ceiling.
    • On a Samsung Galaxy S23 Ultra device, the app demonstrates very fast responses for about 10 messages but then crashes and becomes unresponsive mid-reply, a stability issue not observed with other models run through MLCchat, suggesting possible software or model-specific bugs.
  • Anyone else feel like LLMs aren’t actually getting that much better? (Score: 117, Comments: 160): The post questions whether recent large language models (LLMs) such as Gemini 2.5 Pro, Claude, Llama, Deepseek, and Qwen have shown significant improvement over previous generations like GPT-3.5 and GPT-4, particularly in technical use cases like long-form coding and system design. Despite reported benchmark improvements (e.g., LMSYS Arena), the user observes persistent issues: hallucinations, generic responses, bugs in code generation, and superficial system designs—arguing that practical capabilities may have plateaued. Comments highlight a divide: some note stark recent improvements, especially in trivial or boilerplate tasks, but consensus emerges that LLMs still struggle with complex or context-specific tasks (e.g., meaningful documentation, nuanced code involving internal APIs), where models may produce generic or unhelpful output. Several mention incremental rather than continuous progress, with perceived stagnation over the past 8 months but major shifts year-over-year.
    • One highly technical point discussed is the lag in usefulness of state-of-the-art models (like Llama 3, Qwen2.5 70B, Claude 3.5) for complex coding problems, documentation generation, and internal/non-public tasks; advanced models often generate generic or even misleading outputs, such as low-quality docstrings or tests that ā€œcheatā€ by relying on placeholders or hardcoded values, offering little practical help for challenging work.
    • A significant technical improvement in the LLM space cited is the rapid progress in quantization and edge deployment, allowing models with only ~30B parameters to achieve performance close to or matching much larger prior models (70B+), which is considered a ā€œ200%+ improvementā€ in parameter efficiency and a breakthrough for running high-performing models on consumer hardware.
    • Some users report that while month-to-month improvements are modest, year-over-year advances in LLM capabilities—particularly in performance versus compute efficiency—are substantial, implying progress may feel incremental but aggregate rapidly over longer timeframes.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo

1. Google Veo 3 and AI-Generated Video Breakthroughs

  • Made a comprehensive compilation of all the things people have been generating with VEO 3. Pure insanity! (Score: 1756, Comments: 238): The post showcases a compilation of video outputs generated by OpenAI’s VEO 3, emphasizing the model’s surprising fidelity in audio-visual synthesis. Commenters note VEO 3’s advanced sound design capabilities, specifically its accurate spatial and surface audio-matching (e.g., ā€œdog’s footstep sounds both on carpet and wood floorā€). This suggests the model has integrated or highly correlated foley synthesis, not just video generation. No external technical materials or benchmarks are accessible due to a Reddit security block. Technical discussion in the comments highlights the belief that VEO 3 sets a new bar for realism, both visually and auditorily, with some expressing awe at the level of foley and sound integration—implying significant implications for professional sound and video production fields.
    • A detailed observation was made about Veo 3’s sound design capabilities: specifically, a generated video where a dog’s footsteps transitioned from carpet to wood flooring, with the audio accurately reflecting each surface. This indicates Veo 3’s sophisticated understanding of context-aware foley effects, which has serious implications for traditional foley artists and automated sound design workflows.
    • Multiple users highlighted that Veo 3’s synchronization of audio and visual elements considerably boosts the overall realism of its generations. The system’s nuanced approach to integrating ambient and action-specific audio, such as environmental background and footstep acoustics, is seen as a key technical leap compared to previous generative video models.
  • Cinema, stars, movies, tv… All cooked, lol. - Veo3 is insane… (Score: 1218, Comments: 290): The post discusses the transformative potential of Google DeepMind’s Veo 3, an advanced text-to-video model capable of generating movie-like content with minimal user effort. Commenters note the rapid pace of progress since early AI-generated videos (e.g., the ā€˜Will Smith eating spaghetti’ meme), emphasizing ā€œcharacter consistency is getting very goodā€ and suggesting models may reach near-photorealism in 2-3 years. There is also concern for the impact on commercial production and traditional media content pipelines. Technical discussion in the comments highlights that the advertising industry is likely to be heavily disrupted (ā€œCommercials are cookedā€), and there is debate about Alphabet’s (Google’s) undervaluation in public markets despite perceived leadership in AI video generation.
    • A user highlights the major improvements in AI video generation over a span of 2–2.5 years, referencing the ā€˜Will Smith spaghetti video’ as an early meme benchmark and noting that character consistency is already becoming strong. They speculate that within another 2–3 years, realism and performance quality in generated videos will reach unprecedented levels, stressing the rapid pace of progress.
    • A technical concern is raised about the current expressive range of Veo: although Veo can create videos with realistic ā€˜upbeat ad actors,’ there is skepticism about its ability to portray nuanced, intense, or emotional performances comparable to those of high-caliber actors (e.g., Daniel Day-Lewis). There is an open question about whether any examples exist that demonstrate this level of emotional depth, suggesting that this may be a current limitation.
    • A participant observes that despite Veo’s advances and Alphabet’s apparent lead in the AI race, this has not been reflected in Alphabet’s stock price, hinting at a disconnect between technical achievement and market perception.
  • Veo 3 generations are next level. (Score: 883, Comments: 152): The post discusses Veo’s latest generative model, presumably Veo 3, which demonstrates a significant leap in generation capabilities. Key points (from limited comment context) suggest advanced features, potentially including synchronized audio with video, though it’s unclear if audio is natively generated or requires external addition. Veo is an AI video generation platform by Google DeepMind, competing with models like OpenAI Sora, but details about the Veo 3 technical upgrades (architecture, benchmarks, etc.) are not specified in this post. Top comments reflect a mix of awe and existential concern about the pace of generative AI development. There is also skepticism about market sustainability, referencing the AI ā€œbubbleā€ potentially bursting in 2024, and questions about the integration or quality of audio in Veo 3 outputs.
    • A user inquires whether the Veo 3 generations model supports native audio generation or if audio must be added in post-processing, pointing to a possible limitation or area for future improvement in multimodal model outputs.
    • Another comment notes that the generated videos are losing typical giveaways of AI generation, specifically mentioning ā€œthe perfect pace of trafficā€ as a remaining indicator. This suggests that Veo 3 generations has improved realism in output, minimizing traditional tells of synthetic media.
  • We have AI Youtubers now. Both video and sound were generated with Google’s Veo 3. (Score: 1617, Comments: 434): A post highlights that fully AI-generated YouTubers are now possible using Google’s latest video model, Veo 3, which can synthesize both video and audio. The demonstration, referenced from a Twitter/X post, implies end-to-end generation of virtual content creators, suggesting new potential for automated media production with high fidelity generated visuals and speech. Top comments focus on the rapid pace and uncanny nature of AI-generated media, with some expressing concern over how lifelike but unnatural the results appear. No deep technical debate is present in the comments.
    • This post highlights the use of Google’s Veo 3 to generate both the video and audio for an AI-generated YouTuber, signifying a major step forward in multi-modal content creation. Technical observers may note potential artifacts and limitations in the generated video, such as character hand positions not matching mouse and keyboard use, pointing to ongoing challenges with naturalistic animation and context coherence in current generative models.
  • Interdimensional Cable - VEO 3 (Score: 557, Comments: 49): The post references ā€˜Interdimensional Cable - VEO 3’ but the external Reddit video link is inaccessible (returns 403 Forbidden), leaving no technical details or new benchmarks regarding VEO 3 or its implementation available from the source. Top comments reflect concern about the potential societal impact of advanced video generation/distribution technologies (e.g., likening it to TikTok and Robot Chicken) but do not offer substantive technical debate or detail.
    • No direct technical or performance-focused comments are present in the sample; the remarks are primarily reactions, cultural references, and emotional responses without substantive discussion of the underlying technology (such as VEO 3’s capabilities or benchmarks).
  • Emotions (Fully generated with Veo 3) (Score: 419, Comments: 175): A user showcased a video fully generated using Veo 3, a state-of-the-art text-to-video model from Google DeepMind, indicating the increasing accessibility and realism of synthetic video production. No concrete benchmarks, model parameters, or implementation specifics were provided in the post. The linked content is inaccessible (403 Forbidden), so further technical context or analysis is unavailable. Commenters express concern about the proliferation of low-quality, AI-generated content (ā€˜slop’) diluting platforms like YouTube, but others see potential for democratizing filmmaking by lowering technical and financial barriers to video creation.
    • Discussion highlights concerns about a potential rise in low-quality (ā€˜slop’) content due to generative models like Veo 3, leading to worries that platforms such as YouTube could become oversaturated and less navigable for users seeking higher-quality videos.
    • There is speculation about the impact of advanced video generation models on the film industry, with one commenter predicting significant disruption to traditional production methods and hinting at the possibility of major labor disputes or strikes in Hollywood as a result.
  • Emotions - Veo 3 (Score: 121, Comments: 22): The Reddit post discusses a video allegedly generated with Google’s Veo 3, presumed to be Google’s latest generative video model, but the underlying video is inaccessible (HTTP 403 Forbidden), and thus no technical or benchmark details can be verified or summarized from the source. The title and user comments imply a significant advance over previous video generation systems, possibly positioning Veo 3 as a new state-of-the-art standard, but no direct technical evidence, metrics, or implementation details are available. Top comments suggest that Veo 3 represents a step-change improvement in video generation quality, with some users asserting it surpasses OpenAI’s Sora model and redefines the industry’s expectations for generative video models. These are user impressions rather than technical comparisons based on benchmarks.
    • One commenter points out apparent issues with Veo 3’s output, specifically noting that the pizzeria-guitarist’s face is distorted, suggesting ongoing challenges with coherent face generation, which persists across even cutting-edge video models like Veo 3. This highlights the limitations of current diffusion or transformer-based video models in generating fine facial details and may indicate unresolved aspects of temporal consistency or rendering in complex frames.

2. Multimodal and Open-Source Model Releases (Bagel, TTS)

  • Bytedance released Multimodal model Bagel with image gen capabilities like Gpt 4o (Score: 571, Comments: 92): ByteDance has released BAGEL, an open-source multimodal model with 7B active parameters (14B total), trained on large-scale interleaved image-text data. BAGEL reportedly outperforms open-source alternatives such as Flux and Gemini Flash 2 in image-editing benchmarks, and provides image generation capabilities comparable to GPT-4o. The code and models are available on GitHub and HuggingFace, and the project uses an Apache license. Commenters highlight that BAGEL’s content filters are very restrictive, affecting NSFW usability. There is some confusion about its hardware requirements (noting 7B/14B active/total parameters possibly leading to 16GB VRAM needs), and approval for the permissive Apache licensing.
    • Bagel by Bytedance is released under an Apache license, which is notable for permissive use in both research and commercial projects. One commenter speculates on the model’s resource requirements, suggesting it may need around 16GB of VRAM and noting the confusion around the 7B/14B model naming (possibly indicating parameter sizes or configurations closer to 6GB as well).
    • Technical users have reported that the Bagel demo (demo.bagel-ai.org) features extremely aggressive NSFW censorship filters. Test prompts—even those describing fully clothed women—are flagged as potentially NSFW, leading to concerns about the practical utility of the model for image generation tasks where stringent censorship is not desired.
  • ByteDance Bagel - Multimodal 14B MOE 7b active model (Score: 224, Comments: 36): ByteDance’s Bagel is a new open-source multimodal model featuring a 14B MOE (Mixture-of-Experts) architecture with 7B active parameters, supporting text and image generation (see GitHub, paper). It utilizes SigLIP2 for vision tasks and a Flux VAE for image generation, matches the Qwen2.5 config with a 32k token context window, and achieves state-of-the-art prompt adherence on the GenEval benchmark. Source code shows a preference for the MoT decoder, with the MoE decoder code left in but not primary, and the design duplicates attention modules, making it heavier than comparable Qwen or other Qwen2.5-MOE-2X variants. Commenters note advantages of Apache licensing and highlight the model’s architectural choices (e.g., location of experts, duplicated attention modules, and hardware demands at ~29.2GB). There’s technical comparison to HiDream and other Qwen derivatives regarding expert placement and model weight.
    • ByteDance Bagel uses SigLIP2 for the vision encoder and Flux VAE for generation, closely mapping Qwen2.5’s configuration but with a 32k context window. The model architecture employs an MoE (Mixture of Experts) structure with duplicated attention modules, making it heavier than comparable Qwen models (which do not duplicate attention). Notably, Bagel uses a MoT decoder for image generation, while the MoE decoder (sharing MoT weights) remains in code but is seemingly not the default path.
    • The model’s footprint is approximately 29.2GB, indicating a large resource requirement compared to standard 7B models, likely due to the active 7B parameter MOE architecture and duplicated modules referenced above.
    • Benchmarks reportedly show strong performance both in reasoning and multimodal capabilities—significantly better than ā€˜sort-of-multimodal’ predecessors—with end-user tests citing actually really good at talking as well and excellent reasoning, according to early anecdotal feedback.
  • You can now train your own TTS voice models locally! (Score: 337, Comments: 54): Unsloth has added support for local fine-tuning and training of Text-to-Speech (TTS) models, offering ~1.5x faster training and 50% less VRAM usage than other setups. Supported models include OpenAI/whisper-large-v3, Sesame/csm-1b, CanopyLabs/orpheus-3b, and any Transformer-compatible architectures. Training follows a supervised fine-tuning (SFT) paradigm using audio-transcript pairs, optionally leveraging LoRA or FFT strategies for efficient adaptation, and allows for expressive voice cloning using datasets like ā€˜Elise’ which encodes emotional cues. Notebooks for cloud and local workflows, model binaries (quantized/original), and documentation are available via Unsloth GitHub and Hugging Face. There is technical inquiry regarding multi-lingual support (unclear from the post if this is enabled or restricted to English) and concern over the CUDA 12 requirement, which precludes use with AMD hardware or alternative CUDA implementations like ZLUDA.
    • The requirement for CUDA 12 is highlighted, which limits compatibility strictly to NVIDIA GPUs, excluding users with AMD hardware unless utilizing projects like ZLUDA (an effort to enable CUDA on AMD GPUs), thus restricting accessibility for non-NVIDIA users.
    • There is critical feedback regarding audio quality: while local TTS training is progressing, the current output ā€œisn’t that good at all,ā€ suggesting that local models may lag behind cloud-hosted or commercial systems in terms of naturalness and intelligibility.
    • A linked resource, Voice_Extractor, can automate creation of training datasets from podcasts, streamlining the data preparation process for custom TTS model training. This can significantly lower the barrier for building personalized voice datasets.

3. Anthropic Claude 4 Sonnet/Opus Launch and Expectations

  • Claude 4 Sonnet and Opus Coming Soon (Score: 305, Comments: 106): Anthropic is announcing the imminent release of its new Claude 4 Sonnet and Claude 4 Opus models, as teased in a preview image. Although technical details, benchmarks, and implementation specifics have not yet been shared, the release is expected shortly and is anticipated to feature improvements over previous Claude versions. Top commenters are expressing concerns about anticipated strict API rate limits and infrastructural instability, potentially leading to service disruptions at launch due to high demand, as has been the case with previous major model rollouts.
    • Several users are concerned about anticipated ā€œstrict rate limitsā€ and the likelihood of server overload or degraded performance due to surges in demand upon release, referencing ongoing availability issues with similar high-profile model launches.
    • One comment explicitly mentions the high monetization barriers for advanced LLMs, referencing a rumored or existing ~$300/month price point for access to Claude 4, which could limit accessibility for individual or smaller scale users.
  • Claude delivers finally opus can’t stop this excitement (Score: 299, Comments: 117): The image features excited tweets and a code snippet referencing the launch of Anthropic’s new large language models, ā€œClaude 4 Sonnetā€ and ā€œClaude 4 Opus,ā€ highlighting user anticipation and positioning these as the latest top-tier offerings from Anthropic. While technical details are minimal in the image, the posts frame Opus 4 as potentially a major leap in model intelligence and capability. The code snippet likely serves as a promotional or API usage example for accessing these models, but specific implementation or benchmark data is not shown. Top comments raise measured skepticism about overhyping the Opus 4 model, referencing a typical cycle of hype and subsequent disappointment in AI launches. There are concerns about the practical limitations (context window, usage caps) and comparisons against other models like Deepseek R2, questioning whether Opus 4 will substantively outperform its rivals.
    • There is apprehension regarding the potential rate limits or other usage restrictions with Claude Opus 4, given previous limitations experienced with Opus 3. Users are hoping for improvements, particularly for greater contextual memory, which is a critical technical metric for large language model (LLM) performance and practical utility.
    • Anticipation for Opus 4 is being catalyzed by recent advancements revealed at Google I/O. There is an expectation for heightened competition, resulting in rapid technical iteration and potentially breakthrough features, as incumbents are pressured to reveal their best model capabilities.
    • A comment outlines a likely progression of user sentiment post-release: initial excitement around benchmark results and feature parity/supremacy (e.g., Opus 4 reportedly ā€˜crushing’ benchmarks or outperforming on reasoning tasks), quickly followed by community scrutiny on areas where the model underperforms or is directly outcompeted by rivals such as Deepseek R2, suggesting a well-known pattern in model lifecycle and reception.

AI Discord Recap

A summary of Summaries of Summaries by Gemini 2.5 Pro Exp

Theme 1: The Model Gauntlet: New Releases, Performance Puzzles, and Shifting Tides

  • Google’s Gemini Suite Expands, Flash Model Gets Nerfed Post-Preview: Google unleashed Gemini 2.5 Pro, speculated to be Nightwhisper, though its performance isn’t deemed ā€œultra.ā€ While Gemini 2.5 Flash boasts speed, the initial preview version was reportedly nerfed, with users noting Gemini 2.5 03 25 was amazing, 2.5 05 06 well, nerf too strong, and even Google’s own Gemini Model card showed reduced benchmarks. Gemma 3n models, including 1B & 4B parameters, were also previewed (Gemma 3n docs), with some claiming Gemma-3n-4B rivals Claude 3.7.
  • Claude 4 Release Imminent Amidst High Expectations and Pricey Predictions: Rumors flew about Anthropic’s Claude 4 (possibly the Neptune model) releasing soon, with sources like The Information and an AnthropicAI tweet fueling speculation; some anticipate it could render Codex obsolete, while potential pricing is mooted around $200/month. Meanwhile, Meta’s Llama 3.3 8B release delay drew criticism, with users stating, ā€œThey just do not care about consumers,ā€ preferring open weights for the smaller model over the available Llama 3.3 70B API.
  • OpenAI Faces Downgrade Disappointment, While Mistral Unveils Devstral Coder: Users voiced concerns over OpenAI model downgrades, particularly the performance dip in o4 mini post-release, one quipping, ā€œVS 200$ pro OpenAI - ā€œWe change, downgrade, sht on models at any moment we feel necessaryā€.ā€ Mistral AI countered with the open-source coding agent Devstral, exciting the community despite some noting its benchmarks aren’t SOTA, but its open nature is a big plus.

Theme 2: Powering the Prompts: Hardware Hustles and GPU Grandeur

  • Strix Halo Flexes 96GB GPU RAM, MI300 Dominates Leaderboards: The Strix Halo CPU/APU, priced around $2k, turned heads with 96GB of RAM available for its integrated GPU, offering a potent option for VRAM-hungry tasks. On the discrete GPU front, AMD’s MI300 demonstrated impressive performance, with user <@1173619488730665011> achieving first place on the amd-mixture-of-experts leaderboard at 8.82 ms and on amd-fp8-mm at 120 µs. Elon Musk also declared he expects to keep buying GPUs from Nvidia and AMD.
  • Dual GPUs: A Learning Curve Paved with PCIE Bottlenecks: Discussions highlighted that running dual GPUs provides valuable job market experience, particularly in understanding the performance impact of the PCIE bus when models communicate across it. While not user-friendly, configuring TensorRT with tensor_parallel can reportedly achieve up to 90% speedup.
  • Multihead GRU Layers Arrive in Triton, RDNA4 Compiles in Tinygrad: Developers added Multihead GRU layers written in Triton to cute-kernels, enabling parallelization across SMs for potential acceleration. Separately, RDNA4 instructions now compile successfully in tinygrad, though adjustments to the number of waves per CU might be needed for optimal performance.

Theme 3: Forging the Future: Frameworks, Fine-Tuning, and Agent Architectures

Theme 4: AI in Action: Multimodal Marvels to Model Misbehaviors

Theme 5: Ecosystem Evolution: Open Source Onslaughts, Funding Feats, and Platform Puzzles


Discord: High level Discord summaries

LM Studio Discord

  • Bagel’s Multimodal Model Breaks New Ground: Researchers created Bagel (demo) (project), a multimodal model that achieves text and image generation.
    • The unquantized form needs 40GB of VRAM, while the quantized form needs 24GB VRAM, and stands out with its prompt following capabilities and ability to modify generated images contextually.
  • Strix Halo Boasts 96GB GPU RAM: The Strix Halo, priced at $2k, features an impressive 96GB of RAM available for the GPU, presenting a viable option for running resource-intensive tasks.
    • A member suggested that older Xeon CPUs can be a cost-effective alternative, costing around $1k and being capable of running a variety of tasks.
  • LM Studio Integrates Speculative Decoding: A user celebrated the addition of speculative decoding in LM Studio, showing improved performance after CUDA was enabled.
    • Following an NVIDIA guide notably boosted the user’s model processing speed.
  • Dual GPUs Offer Knowledge, Slow PCIE Bus: Having two GPUs can provide valuable experience for the job market, teaching users about the impact of PCIE bus speed when models communicate over it.
    • The real issue is that the configuration is not user friendly at all, but you can achieve a really good 90% speedup configuring TensorRT with tensor_parallel.
  • Gemma Sucks on /no_think: Users sought methods to disable the thinking process ("") in chatbot models.
    • While /nothink or /no_think works effectively with Qwen3 models, members found Gemma to suck.

Unsloth AI (Daniel Han) Discord

  • OpenAI Downgrades Disappoint: Members voiced concerns over OpenAI model downgrades, citing performance dips in o4 mini post-release.
    • One member quipped, ā€œVS 200$ pro OpenAI - ā€œWe change, downgrade, sht on models at any moment we feel necessaryā€, expressing hope in OSS alternatives.
  • Unsloth featured at Google I/O: Unsloth starred at Google I/O, demoing Gemma+Unsloth+Collab in this video.
    • Attendees expressed excitement, with many joining the Discord server post-demo.
  • Gemini Diffusion Model Claims Lightning Speed: A user reported Gemini Diffusion as FAST, clocking around 800 tokens per second.
    • They linked Llama-3.1-Nemotron-Nano-4B-v1.1 on Hugging Face, noting it’s not SOTA but its speed impresses.
  • Phi-4’s Performance Drops After LoRA Merge: A user reported a dramatic performance drop in unsloth/phi-4 after merging the LoRA adapter, despite updating Unsloth.
    • A contributor suggested specific steps involving PeftModel, merge_and_unload(), safe_serialization=True, and avoiding load_in_4bit=True during inference, recommending load_in_8bit=True.
  • RAFT Notebook Impresses Community: A user shared an Unsloth notebook demonstrating retrieval augmented finetuning (RAFT) for Llama 3.2 1B, available on GitHub.
    • The notebook received positive feedback, leading to a pull request for inclusion in the official Unsloth cookbook.

LMArena Discord

  • Gemini 2.5 Pro Releases but Fails to Dazzle: Members reported that Gemini 2.5 Pro is out, speculated to be the Nightwhisper model, however its performance is not ultra level.
    • Members noted 2.5 Flash is very fast but no longer a thinking model unless you turn on Canvas.
  • Claude 4 Allegedly Coming Tomorrow: Reports indicate Claude 4 is scheduled to be released tomorrow, according to sources at The Information, with AnthropicAI seemingly confirming.
    • Speculation suggests it might be the Neptune model, potentially costing $200/month.
  • LMArena Valued at $600M After $100M Funding: LMArena’s valuation is reportedly at $600M after a recent $100M funding round, according to this Bloomberg article.
    • A user commented that this investment does not make any sense unless you are receiving millions from Google for RLHF.
  • Grok 3.5’s Fate in Limbo: After limited testing, one user claimed that Grok 3.5 is really good, and knows many things other models don’t, but release is still uncertain.
    • Speculation suggests Elon may have postponed the Grok 3.5 release to work on benchmarks, or that it may never arrive, due to being frontrunned by other models.
  • LMArena Plans New Beta Site: LMArena plans to switch the current site to the beta site next week.
    • The company has added custom emojis to the server such as <:lmarena:1374761520822751334> <:battle:1374761514489221200> and will host an AMA.

Perplexity AI Discord

  • Grok Users Export PDFs via Workaround: A user demonstrated exporting Grok’s responses as a PDF using the built-in export option, instead of requesting the bot to create a dedicated PDF, and shared the resulting PDF.
    • Meanwhile, another user reported that Perplexity AI’s image analysis feature got stuck and received no community responses.
  • Image Generation Stalls on Perplexity Mobile App: Users observed the inability to generate images on the Perplexity AI mobile app, as the functionality is exclusively supported on iOS and does not work on Android devices.
    • There were no additional details or solutions offered.
  • Perplexity’s Data Collection Triggers Privacy Debate: Controversy erupted over the CEO’s statement regarding the collection of browsing data outside the app for personalized ads, prompting users to cite a YouTube interview as evidence of Perplexity’s intent to utilize browsing data.
    • Counterarguments emerged, suggesting the data serves memory enhancement rather than advertising, while noting the availability of opt-out options for ads and memory deactivation.
  • Gemini 2.5 Deep Research Rate Limits Debated: Discussions around rate limits for Gemini 2.5 Deep Research arose after a link to Google Support was shared, revealing the removal of the Gemini 2.0 flash model and its replacement with Gemini 2.5.
    • While it’s believed the rate limits for the deep research is per day now, instead of per month, conflicting interpretations persist regarding whether the limits apply daily or monthly.
  • API Calls Blocked by Robots.txt: The WebUI can access some websites that the API cannot because websites are blocking the crawler for the API in their robots.txt files.
    • If a site is accessible via the API, users can set search_context_size to high or reference the official docs.

OpenRouter (Alex Atallah) Discord

  • Weaver One App Receives Significant Updates: The Weaver One app now supports MCP, Gemini image generation, PDF/File support/viewing, PDF Generation, and offers enhanced UI/UX, including features like automatic chat titles and Markdown export.
    • The updated version also allows users to implement their own keys (BYOK) across platforms like OpenAI, Gemini, Anthropic, Vertex, and OpenRouter.
  • Gemini Diffusion’s Speed Astounds Users: Users rave about the exceptional speed of Gemini Diffusion, drawing comparisons to Groq/Cerebras, and shared a screen recording showcasing its performance.
    • Currently, Google is the exclusive host for Gemini Diffusion previews, which prompts questions about the broader adoption of diffusion-based LLMs.
  • Meta’s Llama 3.3 Release Strategy Draws Criticism: Members voiced discontent with Meta for its delayed release of Llama 3.3 8B, suggesting a disregard for consumer needs as this link to the Llama 3.3 70B: They just do not care about consumers.
    • The community indicates that having the 8B version with open weights would be preferable, despite the availability of a 70B API.
  • Gemma-3n-4B Allegedly Rivals Claude 3.7: Users noted claims that Gemma-3n-4B performs comparably to Claude 3.7, referencing this Gemma blogpost for context.
    • Evaluations from the Chatbot Arena suggest that user preferences align with this assessment, indicating a plausible performance level.
  • Gemini 2.5 Pro Hit by Nerf Hammer: Reports indicate a significant performance decrease in Gemini 2.5 Pro, with claims that even o4 mini outperforms it, with claims that Gemini 2.5 03 25 was amazing, 2.5 05 06 well, nerf too strong.
    • Cited the Gemini Model card, Google itself acknowledged reduced benchmark scores across all areas except coding.

Cursor Community Discord

  • Cursor Users Hit by Service Unavailable: Multiple users reported encountering a ā€˜Service Unavailable’ error while using Cursor, though it’s generally understood to not be caused by Cursor itself.
    • The underlying cause of the error remains unclear, but users speculated that it may be linked to external service dependencies.
  • Rust Rockets in Resume Race: One member mentioned using Rust for their backend to enhance their CV, while another suggested listing desired tech skills and then creating portfolio projects to match.
    • The discussion highlighted the increasing popularity of Rust in backend development and its value in showcasing technical proficiency to prospective employers.
  • Gemini’s Google AI Plans Go Live: Users discussed Google’s new Deep Think mode for Gemini, priced at $125/month for the first three months and then $250/month, accessible via Gemini’s ā€œGoogle AI Plansā€ at the bottom of the page.
    • A YouTube link was shared with a timestamp at 1:39:22 for the Deep Think announcement.
  • Claude 4 Teased for Takeoff: Enthusiasts are eagerly anticipating the arrival of Claude 4 Sonnet and Opus, with speculation pointing towards a potential release this week, even posting teasers such as this tweet.
    • Some hope that it’ll be good enough to render Codex obsolete.
  • VerbalCodeAi Invites Innovation: A user shared their new project, VerbalCodeAi on GitHub, inviting others to join their Discord server (discord.gg/KpjSDEwWCF) for collaboration and assistance with server setup.
    • The project aims to create a collaborative environment for developers to contribute to and improve the codebase.

HuggingFace Discord

  • Powerful GPU Rental for SBERT Fine-Tuning: A member seeks recommendations for GPU rental services to speed up sbert model fine-tuning, looking for providers billing around $1.50/hour or less for an A100 instance.
    • The member is currently using a 3060 12GB which takes 8-9 hours to fit the model, while another member recommended Runpod for renting an A100 at about a dollar per hour.
  • AMD Graphics Cards face CUDA compatibility: A user considering a 7900 XTX (24GB VRAM) upgrade from a 3080 12GB is concerned about AMD’s compatibility due to the lack of CUDA support.
    • While some issues were reported, LM Studio was confirmed to work well, with one user reporting 62.73 t/s, indicating that AMD cards may require additional configuration.
  • HuggingFace SEO struggles to return latest docs: A member reported that DuckDuckGo, Qwant, and Yandex are not properly indexing the huggingface.co domain, resulting in outdated documentation appearing in search results, with screenshots provided.
    • Another member validated the issue using Bing’s search results for hf_hub_download.
  • DataTune Transforms Data with Natural Language and LLMs: DataTune, an open-source tool from Vitalops available on GitHub, leverages natural language instructions and LLMs for data transformations.
    • The tool bypasses context length limitations and minimizes API costs by enabling data transformation through simple natural language instructions.
  • LinkedIn Certificate Links Redirect to Agents Course: Users report that LinkedIn Certificate creation links for the LLM fundamentals course are redirecting to the agent course page.
    • The issue arises after quiz completion, with the LI button next to the produced image incorrectly linking to the agent course.

Eleuther Discord

  • AI Slop Debated, Fragility Failures Surface: Members debated the definition of AI Slop as content masquerading as relevant but failing basic scrutiny, referencing the AI Slop Wikipedia article.
  • Discord Data Spurs Suspicion: Members explored use cases for a massive Discord dataset, raising concerns about privacy and anonymization.
    • Potential applications include identifying suspicious activity and analyzing responses to scientific papers.
  • RAG Database Leaks Expose PII Risk: A paper (https://arxiv.org/abs/2505.12540) and Twitter thread highlighted the risk of backtracking PII from leaked RAG databases, even when leaked as embeddings.
    • The discussion emphasized that sensitive information could still be recovered despite the data not being in text form.
  • Qwen 2.5 Eval Gets Chain of Thought: Members attempting to reproduce GSM8k scores for Qwen 2.5 discovered Qwen’s preference for the CoT (Chain of Thought) variant in their evals.
    • A member pointed to the gsm8k_cot config on the harness, linking to their gsm8k_prompt.txt.

Notebook LM Discord

  • Audio Overviews now adjustable: Users can now adjust Audio Overviews to short (~5+ min), long (~20+ min), and default (~10+ min) settings, initially only in English.
    • This customization allows for variable depth when the AI discusses source materials.
  • Google I/O Keynote now summarized: A NotebookLM notebook summarizing this year’s Google I/O keynote is available at g.co/notebooklm/io2025.
    • Further details about the summary’s content were not provided in the discussion.
  • Video Overviews Feature Posts: A preview of the new Video Overviews feature was posted on X.
    • Additional details about the feature were not elaborated upon in the discussion.
  • PDF Uploads unlock class content creation: Users successfully print to PDF in Chrome and upload to leverage the Study Guide, Briefing Doc, Timeline, and FAQ functions.
    • Custom Audio Overviews focusing on content aspects and the MindMap function within the Chat Panel are also enabled.
  • AI Studio webpage workaround: Users are printing webpages to PDF for upload into AI Studio to circumvent the lack of built-in webpage reading.
    • One user reported upload failures, resolving the issue by switching accounts.

Latent Space Discord

  • Google’s Gemini is AI Operating System at I/O 2025: At Google I/O 2025, Gemini was revealed as an AI OS, featuring tools like Gemini Live, Imagen 4, Veo 3, Deep Research, Canvas, Gemini in Chrome, Interactive Quizzes, and the faster 2.5 Flash as default.
    • Google launched AI Pro and Ultra plans, previewing Agent Mode for autonomous assistance, emphasizing Gemini 2.5 Pro’s leading LLM performance; some felt the term AI Operating System was clickbait bullshit.
  • Google’s Gemma 3n Debuts as Small Model Preview: Google announced a preview of Gemma 3n (docs), a pretty cool small model available on HuggingFace (collection).
    • The release seems to replace the previous Gemma 3 with 1B & 4B parameters, while the larger 12B & 27B parameter models remain available.
  • Google’s Stitch is AI-Powered UI/UX Design: Stitch by Google Labs, an evolution of Galileo AI powered by DeepMind models, enables quick generation of designs and UIs, leveraging Gemini and Imagen (X Post).
    • Features include automatic theme updates, product image adjustments, multilingual copy generation, and the ability to export frontend code; Google acquired Galileo AI and renamed it Stitch (explore).
  • OpenAI Tweaks Structured Outputs: OpenAI Developers rolled out improvements to Structured Outputs, including parallel function calling with strict mode, plus support for keywords like string lengths/formats, min/max ranges for numbers, and min/max array elements (X Post).
    • The update gives credit to the LLGuidance team for their foundational work; a member mentioned the feature to finally do min max element arrays to control a gripe.
  • Altman and Ive Design AI Computers: Sam Altman and Jony Ive are partnering to create a new generation of AI-powered computers (X Post, announcement).
    • Speculation centers on benefits like simplified daily tasks and new device forms but also acknowledges potential issues such as high costs and privacy concerns; a friend’s roommate from Humane will be in the interview process to work on this.

GPU MODE Discord

  • Elon plots Massive GPU shopping spree: Elon Musk says he expects to keep buying GPUs from Nvidia and AMD.
    • A member reacted to the claim of a ā€œ1 million GPU facilityā€ with a gigachad emoji.
  • Multihead GRU Layers Grace Cute Kernels: Multihead GRU layers written in Triton have been added to cute-kernels, allowing parallelization across SMs.
    • These additions may potentially accelerate model training and inference by enabling more efficient utilization of GPU resources.
  • Google Gemini Diffuses Into New Territory: Google released Gemini Diffusion, prompting discussions around Google’s embrace of diffusion models.
  • MI300 Sprints to the Top: Multiple users submitted results to the amd-mixture-of-experts leaderboard on MI300, with user <@1173619488730665011> achieving first place at 8.82 ms.
    • Submissions to the amd-fp8-mm leaderboard showcased very fast execution times on MI300, with user <@1173619488730665011> securing first place at 120 µs.
  • Factorio Learning Enters New Era: Members shared a YouTube video of a Factorio 4M SPM (Science Per Minute) championship build and a member shared MineLand, a multi-agent Minecraft simulator available on GitHub.

aider (Paul Gauthier) Discord

  • Gemini 2.5 Flash: Speed and Edit Fidelity Shine: Google’s Gemini 2.5 Flash Preview is drawing attention for its speed, cost-effectiveness, and precise adherence to edit formats.
    • However, some members clarified that uploaded files are saved as ā€˜knowledge’ for reference, not for continually updating the agent’s base knowledge.
  • Background Agents Evoke Developer Weariness: Concerns arose regarding the efficacy of background agents such as Jules and Cursor, with many suggesting that AI may not be sufficiently advanced to handle real-world development tasks independently.
    • One member’s experience with Jules yielded unremarkable results, despite its free availability.
  • Aider Showcased in Gemini 2.5 Pro Benchmarks: Aider Polyglot is now a featured benchmark on the Gemini 2.5 Pro homepage, validating its utility in diverse coding tasks.
    • Despite this recognition, some users have observed that Gemini occasionally disregards established coding conventions, prompting some to downgrade to version 3.7.
  • Copilot’s Open Source Potentially Benefits Aider: With Copilot becoming open source, the community is analyzing aspects that could be integrated into Aider, particularly concerning source code management and agentic functionalities.
    • While Copilot uses RAG and semantic vectors, Aider relies on plaintext for storing the repo map and conversation history.
  • Aider’s Concise Prompts Praised Against Cursor: A user comparing Aider and Cursor noted that Aider’s responses to simple /ask prompts are tighter, less verbose, and easier to traverse.

Manus.im Discord Discord

  • RizzDial Gains Traction from Manus Promo: A member reported receiving a lead for their software, RizzDial, after the lead heard about it from Manus, calling it free marketing.
    • They shared a recording of the interaction, highlighting the unexpected benefit.
  • Users Bemoan Manus Credit Crunch: Multiple members voiced concerns over the credit consumption and token allowances relative to cost on Manus.
    • A suggestion arose for a freemium model, mirroring ChatGPT’s structure, to alleviate user burden.
  • Manus Gets Compared to Cluely.ai: A user questioned whether Manus is akin to Cluely.ai, referencing a kid that got kicked out for cheating in uni or something.
    • The user also mentioned exploring Chinese AR glasses by Inmo and experimenting with cyberspace integration.
  • Manus Image Generation Falls Flat: Despite initial hype, such as one user’s claim that manus image generation- insane šŸ”„ šŸ”„šŸ”„, sentiment shifted towards disappointment.
    • Many users concurred that the image generation feature is gimmicky and yields poor results.
  • Manus Powers Coding Projects (Sometimes): One user’s IT teacher recommended Manus, praising it as highly effective for large coding projects, describing it as the most powerful AI out there right now.
    • However, the user encountered difficulties when converting code to Python for a school assignment.

Nous Research AI Discord

  • TTS Parameter Estimates Shrink: Members discussed the number of parameters needed for audio output, referencing Orpheus (8B parameters) and Outetts (1B parameters) as examples of Llama-based TTS models, suggesting that parameter sharing with text and audio input could further reduce the parameter count.
    • Others considered how to improve by combining with input features and found this to be promising.
  • Gemini 2.5 Flash Unveiled, but Gemini diffusion model is ā€˜shy’: Members discussed access to Gemini 2.5 Flash 0520 and Gemma 3, available via AI Studio, with one member noting the Gemini diffusion model seemed ā€œa bit shyā€.
    • It was additionally mentioned that the Deep Think model is apparently in closed beta.
  • Diffusion Models Generate Text in Parallel: Members discussed diffusion models and their ability to process chunks of text in parallel, potentially allowing for non-causal text generation for applications like infilling.
    • The discord members wondered how the diffusion model works when there is plenty of space in the KV cache, which they don’t seem to leverage.
  • WildChat Dataset Generation Brainstorm: Members brainstormed how to improve the WildChat dataset, a dataset to train models to chat, with one suggesting running Hermes through WildChat prompts.
    • The suggestion to run Hermes through WildChat prompts was considered ā€œa much better, more autonomous, and implentable optionā€.
  • Devstral Coding Agent: Open Source and Ready to Code: Mistral AI released Devstral, an open-source coding agent that members are excited about.
    • Some members mentioned that its benchmarks aren’t amazing, but the community is excited about it being open source.

Modular (Mojo šŸ”„) Discord

  • Claude Fumbles Mojo Syntax 🄓: Members reported Claude frequently produces syntax errors and incomplete code when generating Mojo, especially after the Modular docs suggested it.
    • Some members advise against using AI for new languages or systems programming, suggesting Mojo’s similarities to Python and C++ confuse LLMs, and suggesting Cursor is preferable.
  • Float16 Falters in exp Function šŸ“‰: The exp function in Mojo is implemented only for Float32 and Float64, with Float16 values casted due to potential precision issues, as discussed in this GitHub issue.
    • Accumulating errors in Float16 computations can render results useless, and CPU support for fp16 math isn’t universal, often requiring upcasting to fp32.
  • Simple Mojo Compile Suffers Slowdown 🐌: A user reported experiencing unexpectedly long compile times (3-4 seconds) for a super-simple ā€œHello, Mojo!ā€ program using Mojo version 25.3.0 on Ubuntu 22.04 (WSL2).
    • Investigation suggests the issue is potentially related to the specific installation or WSL2’s filesystem performance; even REPL mode exhibits delays.
  • --sanitize Stumbles on String Shenanigans šŸ’„: A user found that upgrading to Mojo 25.3 broke their code when running with --sanitize on OSX, likely due to a String null termination issue combined with unsafe_ptr usage.
    • It was confirmed that Strings are not always null terminated now, requiring code adjustments to prevent out-of-bounds access.
  • Torch’s CustomOpLibrary Trips on MLIR Context šŸ¤•: A user reported encountering a RuntimeError: No active MLIR context error when using torch’s CustomOpLibrary implementation when attempting to call any custom op.
    • A minimalist example demonstrating the MLIR context error with CustomOpLibrary was posted on the Modular Forum.

Yannick Kilcher Discord

  • Batched Input Baffles GATConv: Users reported an AssertionError with torch_geometric.nn.GATConv when processing batched input, specifically x.shape: [batch_size, num_nodes, num_node_features] and edge_index.shape: [batch_size, num_nodes].
    • The model seems incompatible with batched input, which is unexpected since the training loop functions correctly with singular batches.
  • SGD’s Spaghetti Code Surprise: A user shared a paper claiming that SGD produces internal representations akin to spaghetti code, characterized by copy-pasted concepts randomly entangled, which are efficient but not human-understandable.
    • The paper uses a CPPN (Conceptually Proto-Nerf for 2D) example produced via open-ended semi-manual genetic breeding process (e.g., Picbreeder) to question if this internal representation’s beauty matters for downstream tasks.
  • Physics of LMs Expose Knowledge Gaps: A paper titled ā€œPhysics of Language Models: Part 3.2, Knowledge Manipulationā€ (ssrn.com) reveals language models’ deficiencies in leveraging stored knowledge for tasks like classification, comparison, and inverse search.
    • The study found that models struggled with basic tasks unless Chain of Thoughts (CoTs) were used during training and inference, showing near-zero performance in inverse knowledge search, while one member noted that the paper controlled for data contamination much better than most.
  • Google’s Gemma 3n arrives, AI Edge follows: Google released Gemma 3n, their newest model as detailed in their documentation and announced AI Edge Small Language Models with multimodality RAG function calling, described in a blog post.
    • Google’s AI Edge models feature multimodality, RAG, and function calling capabilities.
  • Humane and Rabbit Take a Dive: Members observed that the Humane AI Pin and the Rabbit R1 didn’t take off, citing difficulties surpassing the mobile phone form factor.
    • A founder on Twitter hopes that OpenAI will acquire their startup.

MCP (Glama) Discord

  • Streaming Transport Gains Traction: The streaming transport from the 2025-03-26 version is experiencing increased adoption, with VSCode supporting streamable HTTP.
    • There was also a suggestion to decouple transport and wire protocols, reinforcing MCP’s transport-agnostic design.
  • SmartBuckets Eliminate RAG Bottleneck: LiquidMetal AI introduced SmartBuckets, which indexes files into vectors and a knowledge graph, runs serverless, and connects directly to Anthropic’s Model Context Protocol (MCP) via a simple endpoint, see LiquidMetal AI.
    • This product promises to resolve RAG bottlenecks.
  • Agents Now Act as MCP Servers: A significant update to mcp-agent now enables Agents to function as MCP servers, allowing any MCP client to invoke, coordinate, and orchestrate agents, code for this here.
    • This enhancement facilitates the integration of agents within MCP workflows.
  • MCP UI Bridge Brings Accessibility: The new mcp-ui-bridge library ensures web applications are equally accessible to human users and LLMs via semantic data-mcp-* attributes, instrumenting existing web apps with single, unified development effort, the npm is here and github here.
    • The library converts any frontend into a CLI accessible version for LLMs and starts an MCP server to let LLMs navigate your frontend with code, see MCP UI Bridge.
  • Tool Naming Conventions Still Troublesome: Strict naming conventions on tool names are causing concerns, especially with GitHub Copilot, leading to issues with namespacing.
    • Proposed solutions involve adopting a system similar to env var namespaces, such as audio__play_track.

Cohere Discord

  • Cohere’s Clients get Private LLM Deployment: Cohere offers private deployments for customers concerned with data/LLM sovereignty, providing flexible deployment options; reach them via [email protected] or [email protected].
    • This is a core part of their solutions, but there were no further details given regarding security.
  • Command-A gets Slower but Cohere Investigates: Users reported slow response times with command-a, especially with the structured response parameter; a Cohere employee requested details for investigation via [email protected].
    • One user found that when specifying json_object as output, requests would hang or take about 2 minutes with 6405 input and 2k output tokens.
  • Embed v4 Slows and Stabilizes: Users experienced super slow embedding v4, clocking about 180 seconds per embedding, prompting Cohere support to acknowledge an incident due to a rate limiting system.
    • Cohere support claims to have manually scaled up, fixed the rate limiting bug, and vowed to add Embed V4 to the status page; Cohere’s status page shows degraded service affecting embed-v4.0 as of May 21, 2025.
  • Backup Plan for Embed v4: A user requested a backup plan for Cohere’s v4 embedding model, suggesting access via both Cohere API and AWS for redundancy.
  • Command- Models Become Favorite*: A user reported that they haven’t found a better model for general purpose, personal chat and assistant that they can self host than the command-* models.
    • Concerns exist around building a vector store with a non-open-sourced embedding model, especially after the recent downtime, since it’s a big investment to build a large vector store with a particular embedding model.

LlamaIndex Discord

  • LlamaIndex Shifts Gears with uv and LlamaDev: LlamaIndex transitioned from Poetry and Pants to uv and LlamaDev to handle their monorepo of 650+ community packages, streamlining development, detailed in this blogpost.
    • The migration aimed for a faster and simpler development experience, consolidating package management and build tools.
  • LlamaIndex Opens Doors with Discord Office Hours: LlamaIndex announced its inaugural Discord Office Hours featuring Tuanacelik, LoganMarkewich, and Clelia, focusing on Agentic workflows and general LlamaIndex queries, highlighted in this announcement.
    • These sessions provide a direct line for community members to engage with experts and get their pressing questions answered.
  • Llama Parse Layout Agent Gets Tripped Up: Users flagged issues with Llama Parse’s Parse with Layout Agent, with jobs timing out after 30 minutes, which was acknowledged as a known issue.
    • A temporary fix was deployed involving a 10-minute timeout and fallback to a weaker parsing mode after initial attempts to resolve the issue proved insufficient.
  • VectorStoreIndex Wrestles FAISS: A user sought clarity on the differences between using a VectorStoreIndex and a local FAISS for storage and their impact on RAG model performance.
    • It was clarified that VectorStoreIndex is a wrapper around any vector store, including FAISS, with a reference to a FaissIndexDemo.
  • llamaindex/vdr-2b-multi-v1 throws tantrums: A user encountered a ValueError when using the llamaindex/vdr-2b-multi-v1 model, pinpointed to a size parameter issue, stemming from a transformers update.

Torchtune Discord

  • Qwen2’s Silent Treatment: A member reported using torchtune generate on Qwen2_5_0_5b and seeing no output, using a provided config.
    • The observed output included <|im_start|>user Tell me a joke.<|im_end|> <|im_start|>assistant<|im_end|> <|endoftext|> and the issue was resolved via this PR which fixed a bug in the tokenizer that caused the EOS token to be appended prematurely.
  • LORA Finetuning Flips Out: A member reported that after adding new tokens to the model and performing LORA finetuning, the output became gibberish and a member recommended trying the resize_token_embeddings utility from torchtune.modules.embedding_utils.
    • They suggested this tool, found at torch.org, to resolve issues after adding new tokens.
  • Async Checkpointing Awaits Safetensors Savior: A user inquired about an open issue for converting the DistCp format (produced by async checkpointing) to safetensors.
    • A team member responded that no such issue exists and suggested the user create one if needed, promising to provide utilities to facilitate the conversion and signal the DCP team for long-term consideration.
  • Microsoft’s Verl Framework Emerges for RL Training: A member discovered an RL training framework today from a paper by Microsoft.
    • Another member agreed Verl is great for RL training and asked which parts would be particularly useful.

tinygrad (George Hotz) Discord

  • Bounties Become Tinygrad Job Opportunity: A user learned that job opportunities at tinygrad are primarily obtained by contributing to bounties and small PRs, as discussed on X.
    • The recommended path involves actively participating in the community and addressing specific tasks outlined as bounties.
  • Distributed Training Enters Commodity Era: A member questioned how distributed training aligns with commoditizing petaflops, referencing pccl on GitHub.
    • This inquiry sparked discussion about the practicality and accessibility of leveraging distributed resources for large-scale computations, but no further discussion was given.
  • mmapeak Review Set to Launch: The mmapeak work is now ready for review, available on GitHub.
    • This milestone indicates progress in optimizing performance-critical sections of the tinygrad framework.
  • RDNA4 Compilation Achieves Success: The RDNA4 instructions compile successfully within the tinygrad environment.
    • Adjustments to the number of waves per CU may be needed to unlock optimal performance on the new architecture.
  • Control Flow Question Causes Fervor: A member asked about Tinygrad’s approach to control flow, comparing it to JAX’s jax.lax.cond which enables conditional execution without disrupting the computation graph, as detailed in the JAX documentation.
    • The inquiry emphasizes the necessity of such capabilities for implementing complex algorithms like Monte Carlo methods.

DSPy Discord

  • DSPy Eases Agent Modifications: The DSPy framework allows users to easily implement changes to agents.
  • Mike Taylor DSPy Case Study: A case study that applies DSPy has been shared by Mike Taylor.
    • It was noted that the case study might be playing a bit fast and loose with the idea of training bias out of the model, but could be teaching it what demos voted for.

LLM Agents (Berkeley MOOC) Discord

  • Deadline Flexibility Examined: A course participant wondered if they could still complete all quizzes by month’s end for certificate eligibility, despite earlier deadlines shown on the course site.
    • The inquiry focused on deadline flexibility and acceptance of late submissions for certification.
  • Assignment Verification Quest: A member sought to confirm if their submitted labs and written assignments met the criteria for earning the course certificate.
    • They showed eagerness to revise and resubmit work, reflecting a commitment to fulfilling certificate requirements.

Nomic.ai (GPT4All) Discord

  • Trouble Installing OpenAI API Key in GPT4All: A user reported issues installing their OpenAI API key in GPT4All due to a non-functional install button, seeking immediate assistance.
    • The user emphasized the urgency due to an upcoming exam and provided a screenshot illustrating the issue within the GPT4All interface.
  • GPT4All Considers Interface Expansion: A user inquired about potential extensions to the GPT4All interface to support LLMs beyond text-based models.
    • The query focused on the possibility of GPT4All accommodating more diverse types of LLMs in the future.

Gorilla LLM (Berkeley Function Calling) Discord

  • Manus AI Referral Program Launches: A member shared a referral link (https://manus.im/edu/invitation/4EQHF7LZZ1JH7V) for Manus AI, an invite-only platform.
    • The referral provides 1000 starter credits and is exclusively for individuals with Cal emails.
  • Manus AI Promoted as Potent Agent for Complex Tasks: The member touted Manus AI as one of the more powerful agents available.
    • They specifically highlighted its proficiency in handling multistep tasks and encouraged others to test it out using the referral link.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.


Discord: Detailed by-Channel summaries and links

LM Studio ā–· #general (297 messagesšŸ”„šŸ”„):

Context Filling, Gemma 3 architecture, Multimodal model performance, Qwen 3 integration, Falcon-H1 GGUF

  • Branching Stories Doesn’t Negate Context Filling Limitations: A user inquired if branching a chat would allow parallel story development, but limitations of context size require a summary strategy to circumvent these restrictions.
    • Suggested alternatives include a sliding window strategy used by Gemma 3 and multiple model instances for paragraph generation, though the latter is constrained by VRAM.
  • Bagel’s Multimodal Model Breaks New Ground: A ā€œmegatonā€ level multimodal model called Bagel (demo) (project), created by Chinese researchers, that achieves true multimodality by doing text and image generation.
    • It stands out with its superior prompt following capabilities and ability to modify generated images contextually, while unquantized form needs 40GB of VRAM, it might fit into 24GB VRAM after quantization.
  • Open Source Multimodal Models Face Hurdles: Members discussed multimodal models and how getting things like rotation is a function of more and more training, so once the model understands rotation, then adding another rotation like function is just a question of more training.
    • Concerns were raised about gaming benchmarks and optimistic claims of excellence from 40GB models, as well as the need for larger models to handle the increased information density required by additional modalities.
  • LM Studio Integrates Speculative Decoding: A user celebrated the addition of speculative decoding in LM Studio, showing improved performance.
    • Enabling CUDA following an NVIDIA guide notably boosted a user’s model processing speed.
  • Gemma vs Qwen3 vs Deepseek on /no_think: Users sought methods to disable the ā€œthinkingā€ process ("") in chatbot models.
    • While /nothink or /no_think works effectively with Qwen3 models, which are trained to recognize this command, it may not function as intended with DeepSeek R1 Distill models, and users found Gemma to suck.

LM Studio ā–· #hardware-discussion (885 messagesšŸ”„šŸ”„šŸ”„):

Strix Halo and 96GB of RAM, Optimal GPU setup, Running vs. Using LLMs, PCIE bus bottlenecks, NVLink vs SLI

  • Strix Halo Boasts 96GB GPU RAM: The Strix Halo, priced at $2k, features an impressive 96GB of RAM available for the GPU, presenting a viable option for running resource-intensive tasks.
    • For those seeking optimal performance, a member suggested that older Xeon CPUs can be a cost-effective alternative, costing around $1k and being capable of running a variety of tasks.
  • Dual GPUs Offer Knowledge, Slow PCIE Bus: Having two GPUs can provide valuable experience for the job market, teaching users about the impact of PCIE bus speed when models communicate over it.
    • The real issue is that the configuration is not user friendly at all, if you do manage to configure TensorRT with tensor_parallel you can achieve a really good 90% speedup.
  • Exploring NVLink’s Superiority over SLI: NVLink is a relevant connector with more bandwidth than SLI. It’s a faster and more scalable option.
    • SLI is extinct while NVLink still exists but only in enterprise environments.
  • X2 Loudness and BIOS Update: The GMKtec X2 mini PC is reported to be loud, even when idle, and a BIOS update is super important to enable full GPU memory allocation.
    • The BIOS update unlocks the ability to allocate 104GB for GPU memory, moving up from the initial 64GB cap on Windows.
  • Linux vs Windows for AI Development: Members debated the merits of Windows versus Linux for AI development on the X2, with Linux being recommended for serious development work.
    • While Windows is used for initial testing, the consensus is that Linux provides better support and performance for running larger models, particularly with the ROCm drivers.

Unsloth AI (Daniel Han) ā–· #general (911 messagesšŸ”„šŸ”„šŸ”„):

OpenAI vs OSS, Medgemma finetuning, VITS 2 and TTS models, MoE models

  • OpenAI downgrades are unacceptable: Members expressed discontent with OpenAI’s practice of model downgrades, particularly citing the diminished performance of o4 mini compared to its initial release, with OSS being their hope.
    • One member quipped ā€œVS 200$ pro OpenAI - ā€œWe change, downgrade, sht on models at any moment we feel necessaryā€.
  • New models released: Unsloth was featured at the Google I/O event during a demo for Gemma+Unsloth+Collab video.
    • Attendees expressed excitement and surprise at Unsloth’s presence, many joining the Discord server after seeing the demonstration.
  • Medgemma: vision finetuning model: Members discussed the new Medgemma model as a candidate for vision finetuning and whether its recommended config is the same as Gemma 3.
    • One member suggested Qwen2.5 or Qwen2 as alternatives for vision finetuning, others also discussed the vision fine-tune of Gemma3 not being satisfactory.
  • VITS 2 generates speech early: A member shared their experience with VITS 2, generating comprehensible speech from phonemes after only 4000 steps, questioning if this was due to using VITS 2 or phonemes.
    • Discussion ensued regarding sample rate needs, where one member argued that musicians can tell the difference but another pointed out that the dataset is the model, not just a parameter such as kHz.
  • MoE models and their performance: Members debate the nature of MoE (Mixture of Experts) models, some noting that the term itself means that the model will be better at different topics than a dense model, while others say that if MoEs are trained correctly, they match the performance of dense models, only being faster.
    • Discussion centered on whether MoE models necessarily affect performance, or if data is the key factor.

Unsloth AI (Daniel Han) ā–· #off-topic (16 messagesšŸ”„):

Diffusion vs Autoregression, Gemini Diffusion, New SOTA 1B Model, Daniel Han's Tweets as a Blog

  • Diffusion Model Races Autoregression: A member questioned whether diffusion models are better than autoregressive models, noting ChatGPT initially used a diffusion model but switched to autoregressive.
    • They speculated that autoregressive models might be preferable for super-fast applications like Gemini Flash Lite or Gemini Flash, but that diffusion models may improve in the future.
  • Gemini Diffusion is Lightning Fast: A member gained access to Gemini Diffusion and reported it as FAST and good, outputting around 800 tokens per second.
    • They added that although it is not state-of-the-art (SOTA), its speed makes it impressive, linking to Llama-3.1-Nemotron-Nano-4B-v1.1 on Hugging Face.
  • New SOTA 1B Model Emerges: A member remarked that the small model SOTA is changing every week, pointing to a new SOTA 1B model.
    • An image was attached to the message (image.png), with a member noting the need for a comprehensive GGUF and special Unsloth notebooks to finetune it.
  • Tweets Should Blog: A member inquired whether all of Daniel Han’s technical tweets regarding new releases, fixes, and technical aspects are available as a blog on the Unsloth blog site.
    • Another member suggested the tweets deserve a blog and that the brothers should oblige with one, saying one is using them as a source of updated tracking in the field cause Daniel has great insights and it helps me alot.
  • Tokenizer Woes: One member recalled several long-ago discoveries about Gemma and Llama3, like double gemma bos token, tokenizer of llama cpp and HF gives different id, and Llama3 instruct has untrained embedding.
    • They also noted do not use pad token for eos token and do not initialize new embedding token with random value, wondering if these topics are covered in an Unsloth blogpost.

Unsloth AI (Daniel Han) ā–· #help (226 messagesšŸ”„šŸ”„):

Phi-4 model issues after merging LoRA adapter, CSM-1B voice training issues, Qwen3 function calling problems, DeepSeek V3 GGUF download, Unsloth notebook for retrieval augmented finetuning

  • Phi-4 Model Performance Plummets Post-Merge: A user reported that fine-tuning unsloth/phi-4 with QLoRA yields great results until the LoRA adapter is merged into the model, after which performance tanks, despite updating Unsloth and repeating tests.
    • A contributor suggested a workaround involving loading the base model and adapter with PeftModel, merging with merge_and_unload(), saving with safe_serialization=True, and crucially, avoiding load_in_4bit=True during inference, recommending load_in_8bit=True instead.
  • CSM-1B Voice Training Stumbles on Tokenizer Error: A user encountered a NameError: name 'tokenizer' is not defined error while merging a CSM-1B voice training model, despite successful LoRA saving, using a T4 GPU.
    • Another member suggested replacing tokenizer with processor in the merging code, referencing a sesame notebook, resolving the immediate error but leading to a subsequent issue with saving in 4-bit due to quantization.
  • Qwen3 Function Calling Fails on Larger Models: A user found that function calling worked on unsloth/Qwen3-4B-unsloth-bnb-4bit but failed on unsloth/Qwen3-8B-unsloth-bnb-4bit, experiencing gibberish responses with the larger model.
    • A community member suggested testing the 14B model, but the user indicated it was too heavy for their GPU.
  • DeepSeek V3 Download Dilemma: A user sought guidance on which specific file to download for DeepSeek V3 to use in the PocketPal AI app.
    • A community member advised downloading the smallest file from Hugging Face.
  • RAFT Notebook Receives Rave Reviews and a Pull Request: A user shared an Unsloth notebook demonstrating retrieval augmented finetuning (RAFT) for Llama 3.2 1B.
    • The notebook, available on GitHub, garnered positive feedback, leading to a pull request to integrate it into the official Unsloth cookbook, though OpenAI embedding was highlighted.

Unsloth AI (Daniel Han) ā–· #research (3 messages):

KernelLLM, Mixture of Experts Models


LMArena ā–· #general (952 messagesšŸ”„šŸ”„šŸ”„):

Gemini 2.5 Pro, Claude 4 leak, LMArena raised 100M, Grok 3.5

  • Gemini 2.5 Pro Deep Thinking is out: Members reported that Gemini 2.5 Pro is out, and it may be the Nightwhisper model, however its performance is not ultra level.
    • Other members said 2.5 Flash is very fast but no longer a thinking model unless you turn on Canvas.
  • Claude 4 incoming tomorrow: There were numerous reports that Claude 4 is scheduled to be released tomorrow, according to sources at The Information, and one user linked to AnthropicAI twitter.
    • One user mentioned it might be the Neptune model, with speculation about it costing $200/month.
  • LMArena valued at 600M after 100M funding round: LMArena’s valuation is reportedly at 600M after a recent 100M funding round, according to this Bloomberg article.
    • One user mentioned this investment does not make any sense unless you are receiving millions from Google for RLHF.
  • Grok 3.5 is pretty good, but release is uncertain: After limited testing, one user claimed that Grok 3.5 is actually really good, and knows many things other models don’t, but release is still uncertain.
    • Others speculated that Elon may have postponed the Grok 3.5 release to work on benchmarks, or that it may never arrive as it has been frontrunned by other models.
  • LMArena changing logo to colosseum: Members discussed the LMArena’s logo change to a colosseum.
    • Members found the new logo to be corporate and unintuitive.

LMArena ā–· #announcements (1 messages):

LMArena new website, LMArena $100M seed funding, LMArena staff AMA

  • LMArena Flips the Switch to New Beta Site: Next week, LMArena plans to switch the current site to the beta site.
    • The company has added custom emojis to the server such as <:lmarena:1374761520822751334> <:battle:1374761514489221200> <:directchat:1374761517588811797> <:leaderboard:1374761519107281009> <:sidebyside:1374761524458946691> <:battle3d:1374761512912158760> <:sidebyside3d:1374761523049791630> <:directchat3d:1374761516502614046><:trophy3d:1374763541009273013>.
  • LMArena Secures $100M Seed Round, Plans Expansion: LMArena has secured $100M in seed funding to hire more people and improve site performance.
    • The team plans to incorporate community feedback faster and emphasizes the crucial role of the community in shaping the platform.
  • LMArena Staff AMA Incoming: LMArena will host a staff AMA and record the event for those who cannot attend live.
    • Questions for the staff can be submitted via this form.

Perplexity AI ā–· #general (827 messagesšŸ”„šŸ”„šŸ”„):

Grok PDF Export, Perplexity and Gemini 2.5 Flash, Image Generation on Mobile, Comet Browser, RoboForm

  • Grok’s PDF Export: A Convenient Workaround: A member shared a method to export Grok’s responses as a PDF by using the built-in export option, rather than asking the bot to create a dedicated PDF, and attached the exported PDF.
    • Another member reported that Perplexity AI’s image analysis feature got stuck when trying to analyze an image, after trying twice, and no one in the community channel had responded yet.
  • Mobile App Image Generation Troubles: Members discussed the difficulties of generating images on the Perplexity AI mobile app, with one member confirming that image generation is only possible on the iOS app while it does not work on Android.
  • Perplexity’s Data Collection Policy Sparks Debate: The CEO’s statement on collecting browsing data outside the app for personalized ads triggered discussions, with some members linking to a YouTube interview to support their claims that Perplexity expects to use browsing data.
    • Some users believe data is being used for memory, rather than ads, and there exists an option to opt out of ads as well as disable memory.
  • Gemini 2.5 Deep Research Rate Limits: Is It Per Day, or Per Month?: Members shared a link to Google Support and discussed rate limits for Gemini 2.5 Deep Research, noticing that the model Gemini 2.0 flash was removed and replaced with Gemini 2.5.
    • While some users pointed out that 2.5 flash is for free users, limits are 500 req/day for free users, 10k for paid users, 5-10 deep research per month for free users, and around 600 for paid, other users believe the rate limits for the deep research is per day now, instead of per month.
  • RoboForm’s Gratis Tier: A Budget-Friendly Password Solution: Members discussed RoboForm, a password manager, and whether or not the service is free.
    • It was clarified that RoboForm does have a free tier, but with single-device access only.

Perplexity AI ā–· #sharing (1 messages):

ā€œ

  • No topics discussed, skipping…: There were no discussion topics found in the given message.
  • No links shared, skipping…: There were no shared links found in the given message.

Perplexity AI ā–· #pplx-api (8 messagesšŸ”„):

Deep Research Model Confusion, WebUI vs API Access Differences, Perplexity Hackathon Rules, Community Forum Announcement

  • Deep Research Model Still Confuses Users: Clarification was provided that the model SKU being discussed is indeed for the deep research model, despite some confusion.
    • It was noted that this question is frequently asked, likely due to some websites blocking the crawler for the API, while the WebUI may still have access.
  • API Access Limited by Website Robots.txt: The reason the WebUI can access some websites that the API cannot is due to those websites blocking the crawler for the API in their robots.txt.
    • If a site is accessible via the API, setting search_context_size to high is suggested as a potential fix. Also see the official docs.
  • Perplexity Hackathon Rule Updates: Users asking about updated rules for Perplexity Hackathons were directed to check their email for the information.
    • A screenshot was shared, but hosted on a private discord domain (screenshot)
  • Community Forum Launches: A new community forum has launched and will be prioritized for questions.

OpenRouter (Alex Atallah) ā–· #app-showcase (1 messages):

MCP Support, Gemini image gen, PDF/File support/viewing, PDF Generation, UI/UX improvements

  • Weaver One App Gets Mega Update: The Weaver One app has been updated with MCP support, Gemini image generation, PDF/File support/viewing, PDF Generation, UI/UX improvements, automatic chat titles, Markdown export, and a prompt library.
  • BYOK support added to Weaver One: The new version of Weaver One enables users to use their own keys (BYOK) from OpenAI and OpenAI compatible, Gemini, Anthropic, Vertex, OpenRouter, etc.

OpenRouter (Alex Atallah) ā–· #general (326 messagesšŸ”„šŸ”„):

Gemini Diffusion Speed, Meta Llama 3.3 Release, Gemma-3n-4B vs Claude 3.7, TTS Models for Yoga AI, Veo 3 Cost and Capabilities

  • Gemini Diffusion is Holy Fast: A user claimed Gemini Diffusion is incredibly fast, comparing it to Groq/Cerebras, and shared a screen recording.
    • Another user inquired why most LLMs aren’t diffusion-based yet, to which another user responded that Google is currently the sole host for Gemini Diffusion previews.
  • Meta can’t release Llama 3.3: A member expressed frustration with Meta for not releasing Llama 3.3 8B, suggesting it indicates a lack of concern for consumers and shared that if Llama 3.3 70B was API-only while the 8B version had open weights, it would be preferable.
    • They think it’s stupid that meta isn’t releasing Llama 3.3 8b. They just do not care about consumers.
  • Gemma-3n-4B Touts Claude 3.7 Performance: A user noted that Gemma-3n-4B is supposedly as good as Claude 3.7 with this link to the Gemma blogpost.
    • Another user noted that, being from the Chatbot Arena, it’s a user preference, sounds possible to me.
  • Elevenlabs powers Yoga AI TTS: Users are seeking TTS models to create guided meditation sessions for a Yoga AI, with one suggestion being Nicole on Elevenlabs with slowed-down speed.
  • Nerfed Gemini 2.5 Pro is toast: Users reported a noticeable nerf in Gemini 2.5 Pro, with one claiming even o4 mini is now better and cheaper and that Gemini 2.5 03 25 was amazing, 2.5 05 06 well, nerf too strong.
    • Another user cited Google’s own model card admitting performance drops in every benchmark outside of coding with this link to the Gemini Model card.

Cursor Community ā–· #general (278 messagesšŸ”„šŸ”„):

Service Unavailable error, Chatgpt pro is bargain, Rust for backend, Integrating memory banks, Gemini flash

  • Service Unavailable Strikes Cursor Users: Several users reported experiencing a ā€˜Service Unavailable’ error while using Cursor, though it’s generally understood to not be caused by Cursor itself.
  • Rust Rush: Portfolio Projects Pique Programmers: One member mentioned using Rust for their backend to enhance their CV, while another suggested listing desired tech skills and then creating portfolio projects to match.
  • Gemini’s Google AI Plans Generate Buzz, Deep Think Mode on Radar: Users discussed Google’s new Deep Think mode for Gemini, priced at $125/month for the first three months and then $250/month, accessible via Gemini’s ā€œGoogle AI Plansā€ at the bottom of the page, though some users in certain regions can’t seem to access it.
    • A YouTube link was shared with a timestamp at 1:39:22 for the Deep Think announcement.
  • Next-Gen Neural Networks: Claude 4 Incoming?: Enthusiasts are eagerly anticipating the arrival of Claude 4 Sonnet and Opus, with speculation pointing towards a potential release this week, even posting teasers such as this tweet.
    • Some hope that it’ll be good enough to render Codex obsolete.
  • Open Source Orchestration: VerbalCodeAi and Roo Code Communities Convene: A user shared their new project, VerbalCodeAi on GitHub, inviting others to join their Discord server (discord.gg/KpjSDEwWCF) for collaboration and assistance with server setup.

HuggingFace ā–· #general (230 messagesšŸ”„šŸ”„):

GPU rental, AMD vs NVIDIA for local models, Model Inference speed, HuggingFace SEO, Falcon H1

  • Powerful GPU Rental Recommendations Requested: A member is looking for recommendations on the best place to rent a powerful GPU for fast use, specifically to fine-tune an sbert model, and received suggestions to check out cloud providers that bill around $1.50 or less if the script runs on an A100.
    • The member is currently using a 3060 12GB, but it takes about 8-9 hours to fit the model, and another member suggested using Runpod to rent an A100 for about a dollar an hour.
  • AMD Graphics Cards Present Some Hurdles for Local Models: A user is looking to upgrade from a 3080 12GB and is considering a 7900 XTX (24GB VRAM) for $1300, but is worried about AMD’s compatibility with models as some are CUDA only.
    • One user with an XTX confirmed that some things didn’t work as expected but LM Studio works well, and another member mentioned that they can get 62.73 t/s using it; the consensus appears to be that AMD cards can be a pain due to the lack of CUDA support.
  • Digging Into Model Inference Speed Benchmarks: A member expressed frustration at not being able to find a large benchmark for inference speed between different model architectures, not talking about quantization, but instead the base factor.
    • It was mentioned that there shouldn’t be much of a difference, the archs are usually very similar based off llama, with some slight differences like gemma; another member found that qwen0.6b was slower by nearly x2 than a gemma3 1b despite having nearly half the parameters.
  • HuggingFace’s SEO is Lacking: A member noted that the huggingface.co domain is not being properly indexed by search engines like DuckDuckGo, Qwant, and Yandex, which leads to links to ancient docs being returned instead of the most current ones, and attached screenshots to illustrate the issue.
    • Another member confirmed the issue with Bing’s search results for hf_hub_download.
  • Falcon H1 model released: A member shares the link to the new Falcon H1 model and asks if anyone has tried it.

HuggingFace ā–· #today-im-learning (3 messages):

Smolagent Real World Agents, HFAPI Issues, LiteLLM Slowdown, Ollama based models and Reasoning, GGUF models

  • Smolagent Inspires Real-World Agent Build: A member is building a real-world agent using smolagent, sourcing real-world data on travel bookings, and aims to build something around it.
    • They completed part of an agent course and are excited to integrate their learnings.
  • HFAPI paid and LiteLLM slowness sparks model switch: The member switched from HFAPI to LiteLLM due to HFAPI being mostly a paid tool, which doesn’t align with open-source projects.
    • However, they noted that LiteLLM is slower and are experimenting with DeepSeek r1, Anthropic, Llama3.1, and Qwen models to improve the speed.
  • Ollama Reasoning Slowdown Prompts GGUF Model Inquiry: The member finds that Ollama-based models and reasoning flows, especially ReAct-based ones, are very slow locally due to the size of instruct-based models.
    • They are seeking feedback on using GGUF-based models with Ollama for complex reasoning and planning, aiming to alleviate the constant token crisis.
  • Agent builder needs local solution: Currently managing with Anthropic research credits, the member seeks a better local solution for building LLM agents capable of real-world tasks, especially in the Indian context.
    • Their focus is on eliciting and evaluating capabilities in LLM agents that help in real-world tasks, specifically focusing on the core logic before addressing language nuances, plus they’re delving into synthetic data -> model pipelines, video generation and autoencoders.

HuggingFace ā–· #cool-finds (1 messages):

Manus AI Agent, Referral Link

  • Manus AI Agent Invitation: A member shared a referral link for Manus, an invite-only agent known for multistep tasks.
    • The link provides 1000 starter credits to new users.
  • Manus promises powerful multi-step task automation: Manus AI is touted as a powerful agent for multi-step tasks, currently in an invite-only stage.
    • The provided referral link grants new users 1000 starter credits.

HuggingFace ā–· #i-made-this (17 messagesšŸ”„):

Lunaris Codex, DataTune, LLMChat GNOME Shell Extension, Optuna and Transformers Integration, Scaling Mixture of Experts Models

  • Transformer Decoder Framework Lunaris Codex rises: A member shared Lunaris Codex, an open-source, PyTorch-based Transformer Decoder framework designed for experimentation with code generation and language modeling and is actively exploring training with custom datasets like meryyllebr543/lunaris-data.
    • It features a configurable architecture, a full pipeline with CLIs for data prep, training, and inference, a C++ toolkit, CI & testing, and detailed documentation; the GitHub repo contains more information and a link to a community Discord.
  • DataTune Transforms Data using LLMs: Vitalops created DataTune, an open-source tool that uses natural language instructions and LLMs for data transformations, avoiding context length limitations and high API costs.
    • The tool allows performing data transformations with simple natural language instructions.
  • LLMChat Integrates with GNOME Shell: A member shared their LLMChat GNOME shell extension, which utilizes Ollama or LlamaCpp servers to integrate LLMs with the GNOME shell environment.
    • It has a modular tool system, embedding system memory with Qdrant vector DB, and can interact with Windows and GNOME settings; it uses a prompt-based system for tool calls, supporting many language models.
  • Optuna and Transformers become Integrated: A member contributed an example showcasing the integration between Optuna, an HPO framework, and the Transformers library.
    • The goal is to streamline the process of hyperparameter optimization for transformer models.
  • Syntx IDE Extension adds new MCP Hub: The Syntx IDE extension recently added a dedicated MCP hub and will be adding indexing in the next patch.
    • Indexing will be included in the next patch.

HuggingFace ā–· #computer-vision (2 messages):

LayoutLMv3 Workflow, Donut Model Integration, OCR Methods Comparison, GeoMeetup SF

  • LayoutLMv3 Workflow Steps Detailed: A user sought assistance in determining the correct workflow and technologies for invoice processing using LayoutLMv3, contrasting it with traditional OCR and regex methods.
    • The suggested workflow includes preprocessing, layout-aware OCR, document parsing with LayoutLM/Donut, and postprocessing with prompt answering to improve data extraction from invoices.
  • Alternatives to Tesseract, PaddleOCR are explored: Traditional OCR methods like Tesseract and PaddleOCR with regex have limitations in extracting text and key-value pairs from invoices, often missing data due to complex patterns.
    • The user is exploring LayoutLMv3 and Donut to overcome these limitations, but is uncertain on how to integrate them into a streamlined workflow for invoice processing.
  • Bay Area GeoMeetup Coming: There’s an invitation to a meetup in San Francisco on May 28th for those interested in CV + sensor fusion, autonomy / ADAS, or geospatial intelligence.
    • The meetup is organized by GeoMeetUp and Bee Maps, and more details are available at https://lu.ma/q4coi1t7.

HuggingFace ā–· #gradio-announcements (1 messages):

MCP, Gradio, AI Agent Development, Prizes

  • MCP x Gradio collab incoming!: An announcement teases a collaboration between MCP and Gradio from June 2-8 with $10,000 in prizes.
    • The poster mentions it could push AI agent development forever and is hinted to be one of the most important AI dev events of 2025.
  • AI Dev event of 2025: Gradio will be part of an AI Dev event from June 2-8, 2025
    • Participants stand a chance to win from a pool of $10,000 in prizes.

HuggingFace ā–· #agents-course (11 messagesšŸ”„):

Dummy agents lib on HF Interface, LinkedIn Certificate creation links, Agents Course Deadlines, Scores Dataset Submission, Certificate Issues

  • HF Interface Run Button: Where Did It Go?: A user inquired about running the dummy agents lib python notebook on the HF interface, seeking the equivalent of a run button or shift+enter command.
    • They also asked about setting up the HF token in Colab, clarifying whether the token’s name and value should be pasted in the settings tab under secrets.
  • LinkedIn Certificate Links Fail for LLM Course: A user reported that the LinkedIn Certificate creation links for the LLM fundamentals course are failing, redirecting to the agent course page instead.
    • This issue occurs after successfully completing the quiz, when the LI button appears next to the produced image.
  • Agents Course: Deadline Approaches: A user inquired about the deadline for the Agents Course, noting they were just starting Unit 4.
  • Resubmitting Scores: A user asked if the scores dataset locks in the first submission, preventing resubmission.
    • Another user clarified that you can submit as often as you like, the leaderboard will only record your highest score.
  • Certification Requirements Confusion: A user reported creating an agent, submitting answers, and seeing their name on the leaderboard, but receiving a message that Unit 4 must be completed before obtaining a certificate.
    • A user asked about the person’s highest score on the evaluation.

Eleuther ā–· #general (203 messagesšŸ”„šŸ”„):

AI Slop definition, Computational Irreducibility and Novelty, Gemini Diffusion Noise Model, Discord Dataset Use Cases

  • Discord Debates Definition of ā€œAI Slopā€: Members debated the definition of ā€œAI Slopā€ which is content with a high degree of refinement and polish masquerading as relevant that does not hold up to basic scrutiny with one linking to the AI Slop Wikipedia article for reference.
  • Novelty tied to Computational Irreducibility?: Some members argued that novelty is strongly tied to computational irreducibility, potentially excluding current generative models.
    • One member defined novelty as what cannot be derived through statistical prediction, linking it to Kolmogorov complexity.
  • Google’s Gemini Diffusion gets dissected: Members shared their opinions on Gemini Diffusion, focusing on the noise model and its potential flexibility.
    • It was theorized that Gemini Diffusion could lead to a downright better model due to iterative refinement and bi-directional sequence modeling, referencing Grokking in Both Directions.
  • AI Community Explores Discord Data: Members discussed the use cases for a massive Discord dataset, with concerns about privacy and anonymization of user data.
    • It was suggested that such datasets could be used to identify suspicious activity or analyze responses to scientific papers.

Eleuther ā–· #research (15 messagesšŸ”„):

LinkedIn icon usage, BEAR Probe evaluation, RAG database leaks, Platonic Representations Hypothesis

  • LinkedIn Icon gets used as intended: A member pointed out how the LinkedIn icon was used as intended within 24 hours of its creation.
    • No further discussion was recorded.
  • BEAR Probe evaluation is nothing new: A member questioned the novelty of the BEAR Probe evaluation method, suggesting it’s similar to existing MCQ evaluation methods.
    • Others agreed, stating that the paper’s approach of evaluating LMs by checking if they assign the highest log-likelihood to the correct statement is not new.
  • RAG Database leaks can be backtracked to reveal PII: A member shared an interesting paper (https://arxiv.org/abs/2505.12540) and Twitter thread about backtracking PII from leaked RAG databases.
    • They noted that even if the database is leaked as embeddings and not text, sensitive information can still be recovered.
  • Adapters Translate, not Change, Representations?: A member questioned whether adapters are merely translating representations or changing them substantively, in the context of the Platonic Representations Hypothesis.
    • Another member admitted unfamiliarity with the hypothesis but found the working mechanism interesting and the shared implication important, referencing this paper.

Eleuther ā–· #lm-thunderdome (34 messagesšŸ”„):

Qwen 2.5 GSM8k Evaluation, lm-evaluation-harness PR issue, Qwen 2.5 Evaluation Prompt

  • Qwen 2.5 GSM8k score reproduction attempt: A member was trying to reproduce the GSM8k score reported for Qwen 2.5 and asked what prompt was used for evaluation.
    • Another member said they would draft a config and pointed to the Qwen’s preference for the CoT (Chain of Thought) variant in their evals, similar to gsm8k_cot on the harness, linking to their gsm8k_prompt.txt.
  • LM harness assertion error fixed: A member reported a failed test in a new task PR for lm-evaluation-harness due to an AssertionError related to whitespace in doc_to_text and target_delimiter, linking to PR #3006.
    • The solution was to set target_delimiter: "" in the YAML config file.
  • Qwen 2.5 Evaluation prompt issue: A member reported getting very low scores on GSM8k compared to what Qwen 2.5 reported using the command lm_eval --model hf --model_args pretrained=Qwen/Qwen2.5-0.5B --tasks gsm8k --device cuda:0 --batch_size 8.
    • Despite setting --output_path and --log_samples, the model outputs were not being saved, but they were referred to the Qwen2.5-llm.

Notebook LM ā–· #announcements (2 messages):

Audio Overviews, Video Overviews, Google I/O

  • Audio Overviews are now customizable: Users can now control the length of Audio Overviews with short (~5+ min), long (~20+ min), and default (~10+ min) settings, currently available only in English.
    • This allows users to customize the depth and length that the AI hosts discuss sources.
  • Google I/O Keynote summarized in NotebookLM: A notebook summarizing everything from this year’s Google I/O keynote is available here.
    • No further information provided.
  • Video Overviews feature preview posted on X: A preview of the new Video Overviews feature was posted on X.
    • No further information provided.

Notebook LM ā–· #use-cases (30 messagesšŸ”„):

NotebookLM PDF Upload, AI Studio and Webpage Reading, Project Astra Improvements, Gemini App Features, Video Overviews Languages

  • PDF Uploads and Class Content Creation Unleashed!: Users are finding success by printing to PDF in Chrome and uploading as a source, leveraging the Study Guide, Briefing Doc, Timeline, and FAQ functions in the Studio Panel.
    • One user also emphasized the capability to generate customized Audio Overviews (podcasts) focusing on specific content aspects, like similarities and differences between sources, and the MindMap function within the Chat Panel.
  • AI Studio’s Webpage Workaround!: A user noted that AI Studio lacks a built-in feature to read webpages, suggesting a workaround of printing the webpage to PDF for upload.
    • Another user shared that their upload fails sometimes and mentioned that they had to switch to a different account to resolve the issue.
  • Project Astra and Gemini’s AI Alliance!: Members discussed Project Astra and its potential improvements to the Gemini app, including screen sharing, video calls, and faster, more natural responses.
    • One user mentioned that they had this feature on their phone a while ago, though it wasn’t very useful because it was too new to recognize music.
  • Audio Overviews Debut in English First?: A member inquired about the availability of video overviews in multiple languages at launch.
    • Another member believed that they would likely start with English first.
  • Gemini 2.5 Pro Arriving Soon?: One user asked if there are plans to update NotebookLM Plus to Gemini 2.5 Pro.
    • It is unknown when this capability will come into fruition.

Notebook LM ā–· #general (136 messagesšŸ”„šŸ”„):

Political censorship, Longer Audios, NotebookLM updates, Gemini features, Video overviews in other languages

  • Users sidestep Political Censorship: Users found that replacing the text with Star Wars characters dodges the filters.
    • This allows users to see how much censorship has been implemented.
  • NotebookLM to get Longer Audios: Users discussed that audio recordings were capped at 15-30 minutes.
    • A user suggested using another LLM to structure it better then re add as a source to bypass this limitation.
  • NotebookLM releases updates: NotebookLM released their blog post about Video Overviews.
    • Users pointed out that the output language should be set to English to have it working.
  • Pricey Gemini Features: Users debated about the new Gemini features pricing, one user claiming it to be $250/month without an offer.
    • Others pointed out that it is in fact $125/month for the next 3 months and then it’ll be $250/month afterwards.
  • Chat history disappear: Users reported that chat history disappears after the browser closes and want it saved.
    • It was suggested to take keep every answer as a note, as well as suggest it as a feature

Latent Space ā–· #ai-general-chat (146 messagesšŸ”„šŸ”„):

Google I/O 2025, Gemma 3n, Stitch by Google, OpenAI Structured Outputs, Sam Altman and Jony Ive

  • Google’s Gemini Unveiled as AI Operating System at I/O 2025: At Google I/O 2025, Gemini was revealed as a comprehensive AI OS, featuring tools like Gemini Live, Imagen 4, Veo 3, Deep Research, Canvas, Gemini in Chrome, Interactive Quizzes, and the faster 2.5 Flash as default.
    • Additionally, Google launched AI Pro and Ultra plans, previewing Agent Mode for autonomous assistance, emphasizing Gemini 2.5 Pro’s leading LLM performance; some felt the term AI Operating System was clickbait bullshit.
  • Google’s Gemma 3n gets a Small Model Preview: Google announced a preview of Gemma 3n (docs), a pretty cool small model available on HuggingFace (collection).
    • The release seems to be a replacement for the previous Gemma 3 with 1B & 4B parameters, while the larger 12B & 27B parameter models remain available.
  • Google’s Stitch by Google Labs is AI-powered UI/UX Design: Stitch by Google Labs, an evolution of Galileo AI powered by DeepMind models, enables quick generation of designs and UIs, leveraging Gemini and Imagen (X Post).
    • Features include automatic theme updates, product image adjustments, multilingual copy generation, and the ability to export frontend code; Google acquired Galileo AI and renamed it Stitch (explore).
  • OpenAI Improves Structured Outputs and Shouts out LLGuidance: OpenAI Developers rolled out improvements to Structured Outputs, including parallel function calling with strict mode, plus support for keywords like string lengths/formats, min/max ranges for numbers, and min/max array elements (X Post).
    • The update gives credit to the LLGuidance team for their foundational work; a member mentioned the feature to finally do min max element arrays to control a gripe.
  • Sam Altman and Jony Ive Team Up to Design Next-Gen AI Computers: Sam Altman and Jony Ive are partnering to create a new generation of AI-powered computers (X Post, announcement).
    • Speculation centers on benefits like simplified daily tasks and new device forms but also acknowledges potential issues such as high costs and privacy concerns; a friend’s roommate from Humane will be in the interview process to work on this.

GPU MODE ā–· #general (12 messagesšŸ”„):

Multihead GRU Layers in Cute Kernels, Warp Specialization Algorithms, Loss Spikes in Softmax-Attention-1b, Training Time Indication, RNN vs Softmax Performance at Different Scales

  • Multihead GRU Layers debut in Cute Kernels: Multihead GRU layers written in Triton have been added to cute-kernels, allowing parallelization across SMs.
  • Softmax-Attention-1b spikes in Loss: A member asked about the random spikes in loss (approx every 1k iterations?) on softmax-attention-1b.
    • The spikes are due to mup which needs very high learning rate (0.01) in this case which can cause slight instability early on in the training; another member said no, its quite stable.
  • RNN lags behind Softmax at Larger Scales: RNN at 173M is faster than softmax but at 1B scale, it’s 2x slower.

GPU MODE ā–· #triton (4 messages):

Triton autotuner, extern_elementwise API, Blackwell support

  • Auto-Tune Triton: A member inquired whether a certain logic could be added to the Triton autotuner, or if it already does something similar.
  • extern_elementwise API Troubles: A member faced a TypeError: unhashable type: ā€˜pointer_type’ issue while attempting a device function call using the extern_elementwise API, specifically with the pointer argument.
  • Blackwell Support Buzz: A user asked if version 3.3.0 is supposed to work with 5090, or if Blackwell support is planned for the next release.
    • Another member suggested trying version 3.3.1 which David should have fixed for that release after experiencing computeCapability not supported errors.

GPU MODE ā–· #cuda (9 messagesšŸ”„):

__reduce_add_sync, asynchronous wgmma pipelines Hopper, complete::tx_bytes async TMA loads, wait_group<0> consumer

  • Understanding __reduce_add_sync Behavior: A member was trying to understand the behavior of __reduce_add_sync and confused at the output, expecting that with mask=0xFF, the output in threads 0-7 will the sum of first 8 threads and the other threads won’t be participating.
    • Another member explained that each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined, adding that threads not in the mask should not execute the instruction (or execute it with another mask).
  • Asynchronous WGMMA Pipelines on Hopper architecture cause questions: A member inquired about asynchronous wgmma pipelines on Hopper architecture, noting that TMA loads are performed with the complete::tx_bytes modifier, which completes the producer’s mbarrier’s phase, signaling the consumer that data is available in Smem.
    • However, they struggle to see its purely asynchronous nature on the consumer side, since after issuing async wgmma instructions, the consumer waits for these wgmma instructions to finish via a wait_group<0> instruction, which introduces synchronisation.
  • complete::tx_bytes Equivalent for Async WGMMA Instructions: The same member asked if there is no complete::tx_bytes equivalent for async wgmma instructions, so the consumer doesn’t need to introduce the wait_group<0> synchronization and in the process serialize somewhat (in between K tiles) wgmma instructions.

GPU MODE ā–· #algorithms (3 messages):

MCMC, Variational Inference

  • Markov Chain Monte Carlo Methods are Super Cool: A member stated Markov Chain Monte Carlo (MCMC) is super cool and they are taking a class with a heavy focus on it.
    • The member adds it’s a fun topic and they like variational inference as well.
  • Variational Inference also liked: Another member mentioned liking variational inference as well.
    • No further details were provided.

Google Gemini Diffusion, Block Diffusion, KV Cache

  • Google Swallows Diffusion Pill: Google released Gemini Diffusion, with users observing Google’s decision to embrace diffusion models.
    • One member quipped that they were likely implementing block diffusion for speed via a kv_cache.
  • Block Diffusion: A KV Cache Speed Hack?: A user speculated Google implemented block diffusion to speed up Gemini via a kv_cache

GPU MODE ā–· #beginner (1 messages):

Elementwise Kernel, Vectorized Loads/Stores

  • Vectorized Loads Boost Elementwise Kernel: A member inquired whether a simple elementwise addition kernel would benefit from vectorized loads/stores, such as having N / 4 threads each handling float4 operations.
    • The question focused on potentially optimizing memory access patterns in GPU kernels.
  • float4 Operations: The discussion centered on using float4 operations to enhance memory access patterns.
    • The goal is to potentially improve the efficiency of GPU kernels by using vectorized loads and stores.

GPU MODE ā–· #torchao (1 messages):

OpenAssistant/oasst1 dataset, Default settings

  • OpenAssistant Dataset: A Gold Mine?: A member found the OpenAssistant/oasst1 dataset useful for achieving good results.
    • They reported success using the default settings in their existing configuration.
  • Default Config FTW: Experimentation with the OpenAssistant dataset was done using standard parameters.
    • The user stated that the default settings yielded satisfactory outcomes without modification.

GPU MODE ā–· #off-topic (2 messages):

Elon Musk buying GPUs, 1 million GPU facility, dotnet runtime PR


GPU MODE ā–· #irl-meetup (1 messages):

viranchee: Any in person events in SF in next 30 days


GPU MODE ā–· #self-promotion (1 messages):

Multi-Agent Hackathon, Tenstorrent hardware, Koyeb cloud platform

  • Multi-Agent Hackathon Kicks off in SF!: A one-day hackathon to build, deploy, and benchmark multimodal workflows including Image, Video, and LLMs will be held in SF on Sat May 31.
    • The hackathon will leverage Tenstorrent hardware and Koyeb’s cloud platform, with event details and registration available at lu.ma/pkhmut6r.
  • Tenstorrent & Koyeb Power Multi-Agent workflows: The Multi-Agent Hackathon is set to use Tenstorrent hardware and Koyeb’s cloud platform for building and benchmarking.
    • Participants can expect to work on multimodal workflows encompassing Image, Video, and LLMs, promising a cutting-edge development experience.

GPU MODE ā–· #šŸæ (7 messages):

KernelBench, KernelBook, GRPO, NVCC logs, RL Baseline

  • KernelBench Meant for Eval: Members affirmed that KernelBench is intended for evaluation and KernelBook is suitable for training, particularly for correctness.
    • It was noted that base models often scale best for pass@k for large k, unlike instruct models or finetunes on specific domains for tasks like GPQA.
  • RL Baseline Ideal Starting Point: Ideally, the RL baseline should begin with what Devin did with their kernel model (Kevin).
    • The proposed setup includes generating NL queries per kernel for Query:Kernel pairs, rewarding compilation, correctness, and performance against a baseline, and doing GRPO without explicit reasoning steps.
  • Iteration on NVCC Logs: The goal is to efficiently iterate on nvcc logs and benchmarks for optimization.
    • Version 1 will learn to implement explicit kernel queries with built-in methods (write kernel for x with optims y and z), while version 2 will learn to reason over logs/metrics and adapt without explicit instruction (make kernel faster!).

GPU MODE ā–· #reasoning-gym (1 messages):

rasdani: awesome! looking forward to the paper šŸ™‚


GPU MODE ā–· #submissions (55 messagesšŸ”„šŸ”„):

AMD MI300 Performance, amd-mixture-of-experts, amd-mla-decode, amd-fp8-mm, Workflow Timeouts

  • MI300 Sprints: Mixture of Experts: Multiple users submitted results to the amd-mixture-of-experts leaderboard on MI300, with times ranging from 7.883 ms to 9.73 ms, and user <@1173619488730665011> achieving first place at 8.82 ms.
    • User <@298836525158891520> noted that they had to be a little hacky in the eval script, but hopefully no further issues.
  • FP8-MM Face-Off: Sub-Millisecond Showdowns: Several submissions were made to the amd-fp8-mm leaderboard, showcasing very fast execution times on MI300, with user <@1173619488730665011> securing first place at 120 µs.
    • User <@298836525158891520> had multiple submissions achieving times around 130 µs, consistently placing in the top rankings.
  • MLA Decode Dash: Milliseconds Matter: Submissions to the amd-mla-decode leaderboard on MI300 ranged from 1.240 ms to 12.464 ms with <@268205958637944832> achieving first place in 1240 ms.
    • User <@981391010137530449> achieved 4th place with a time of 8606 ms
  • Timing Out Troubles: Workflow Woes: Users <@557943190045327360> and <@325883680419610631> reported issues with workflows for amd-mixture-of-experts timing out when run without secrets.
    • They requested assistance, but image analysis returned that it can’t help with that.

GPU MODE ā–· #status (2 messages):

amd-mla-decode, MLA running on GPU

  • Cluster-Bot can’t change task files: The Cluster-Bot reports that changing task files of existing problem amd-mla-decode is currently not possible for the files: reference.py, eval.py, and utils.py.
  • MLA runs on GPU after fix: A member, <@268205958637944832>, fixed the issue for MLA where it wasn’t being ran on GPU properly.
    • Members were prompted to resubmit if they had previously submitted to the leaderboard.

GPU MODE ā–· #factorio-learning-env (19 messagesšŸ”„):

Easier Integration Interface for External Agents, Championship Build (4M SPM), Project Sid, Multi-agent Minecraft simulator

  • External Agents Interface Issue Inaugurated: A member created issue #201 Easier Integration Interface for External Agents.
  • Factorio 4M SPM Championship Build: Members shared a YouTube video of a Factorio 4M SPM (Science Per Minute) championship build.
    • They asked if everyone had carved out time to watch this build.
  • AI Civilization Project Sid Paper Shared: A member shared a link to the Project Sid paper on AI civilization: https://arxiv.org/pdf/2411.00114.
  • Multi-Agent MineLand Minecraft Simulator Emerges: A member shared MineLand, a multi-agent Minecraft simulator available on GitHub.

GPU MODE ā–· #amd-competition (7 messages):

MLA-Decode Data Generator, Ranked sequence length in DRAM

  • MLA-Decode Data Generator Runs on CPU Only: A user discovered that the data generator for mla-decode in reference.py never sets the device, thus running solely on the CPU.
    • The team quickly patched this issue, recommending users to re-submit their work after the fix.
  • Ranked Sequence Length Exceeds DRAM: A user faced a problem where the largest sequence length on ranked was too long to fit in DRAM, causing delays.
    • The user mentioned a hacky fix and decided to leave it as is to maintain consistent solutions, unless it causes further issues.

GPU MODE ā–· #cutlass (10 messagesšŸ”„):

WSL2 performance with CUDA, PTX or SASS dumping with cute.compile, CuTe DSL Feedback

  • WSL2 Boosts CUDA Performance: A user has reported significant performance gains using WSL2 in Windows, especially after enabling performance counters in the Nvidia control panel via developer settings.
    • They highlighted the advantage of developing CUDA in full HDR without the Wayland ozone glitch issues, while using the latest Nvidia drivers and CUDA toolkit.
  • PTX Dumping Delayed for CuTe: A user inquired about a method to dump generated PTX or SASS after compiling a kernel with cute.compile from Python and suggested using Nsight Compute.
    • A member mentioned the feature to dump PTX is planned for a future release, although not immediately available, and that AOT model is not present yet.
  • CuTe DSL Receives Enthusiastic Praise: A user expressed strong appreciation for the CuTe DSL, noting it ā€œalmost makes kernel programming enjoyable.ā€
    • This positive feedback underscores the tool’s potential to improve the kernel development experience.

GPU MODE ā–· #singularity-systems (1 messages):

picograd, Rust, Python, FFN, RNN

  • Picograd Focuses Purely on Rust: The picograd project, aiming for training and inference of basic networks like FFN, RNN, LSTM, and GPT, will focus solely on a single Rust implementation for now.
    • A Python implementation, initially considered, was scrapped to avoid spreading resources too thin given the project’s early stage.
  • Python Implementation for Picograd Postponed: Initially, there were plans to create a Python implementation for picograd, potentially following the approach of PT2 and Cutlass4.
    • However, the decision was made to postpone the Python version to concentrate efforts on the Rust implementation, as the project is still in its early phases.

aider (Paul Gauthier) ā–· #general (62 messagesšŸ”„šŸ”„):

Gemini 2.5 Flash Preview, Aider and Jules as background Agents, Aider Polyglot Benchmark, Copilot getting open sourced

  • Google’s Gemini 2.5 Flash Preview sparks interest: Google released the Gemini 2.5 Flash Preview, prompting discussion about its capabilities, particularly regarding its speed, cost, and adherence to edit formats.
    • A user shared a concern about GPTs agents not learning from additional information provided after their initial training. Another cleared this misunderstanding, explaining that uploaded files are saved as ā€˜knowledge’ files for the agent to reference when required, but they do not continually modify the agent’s base knowledge.
  • Background Agents spark wearyness: Members expressed weariness about background agents like Jules and Cursor, citing concerns that AI is not smart enough to do actual development work without significant handholding.
    • One member tested Jules but was not very impressed, even though they said that free is nice!.
  • Aider featured on Gemini 2.5 Pro homepage: Aider Polyglot is listed as one of the benchmarks on the Gemini 2.5 Pro homepage
    • Concerns were raised that Gemini is ignoring conventions despite its intelligence, prompting some users to revert to version 3.7.
  • Copilot open sourced, potential trafarable code for Aider: Members discussed the implications of Copilot getting open-sourced, exploring potential aspects transferable to Aider regarding source code and agentic functionalities.
    • It was noted that Copilot uses RAG and semantic vectors to retrieve relevant text from a vector database, while Aider saves the repo map and conversation as plaintext.
  • Mistral’s Devstral claims top spot for coding agents: Mistral’s Devstral claims that it is the best for coding agents.
    • Members express skepticism, noting its 24B parameters, with one asking did they fit it on 4090?

aider (Paul Gauthier) ā–· #questions-and-tips (20 messagesšŸ”„):

Gemini 2.5 Flash Benchmark, Aider and Pip Packages, Running IPYNB Notebooks, --read flag issues, Context from file

  • Gemini 2.5 Flash Benchmark Results: A user inquired about updating the benchmark with the freshly upgraded Gemini 2.5 Flash, referencing this Discord link.
  • Aider Asks: Can Pip Packages Provide Project Context?: A user asked if it’s possible to add files from installed Python whl packages to Aider, in order to provide project related context, such as frameworks, versions, and things NOT to do.
    • Another user mentioned they use a file called ai-guidelines.md, which they always add to the context using /read, which contains project guidelines and evolves as the project evolves.
  • Running IPYNB Notebooks with Aider: A user wants to know if it’s possible to use the /run command to run an IPYNB notebook and have the output automatically added to the chat, similar to running a .py file.
    • They noted that while the notebook can run in a web environment, it doesn’t automatically add the output to the chat.
  • The --read Flag Fiasco: A user encountered a zsh: no matches found error when trying to pass all files using the --read=src/*.ts --read=src/*.tsx flags.
    • Another user suggested removing the equals sign and using --read src/*.ts, but noted that this will only make the first match read-only and the rest read/write unless separate --read flags are used for each file. Another user was also able to load the files by just suppling the whole folder.
  • Aider vs Cursor: Brevity is the Soul of the Prompt: A user compared Aider and Cursor, noting that Aider’s responses to simple /ask prompts are tighter, less wordy, and easier to traverse.

Manus.im Discord ā–· #general (82 messagesšŸ”„šŸ”„):

RizzDial Marketing, Manus Credits, Manus vs cluely.ai, Manus Image Generation, Manus for Coding Projects

  • RizzDial Gets Free Marketing from Manus: A member shared that they got a lead who heard about their software, RizzDial, from Manus, exclaiming Manus out her doing free marketing for my software.
    • They also linked to a recording of the interaction.
  • Manus Credits Dwindling Fast: Several members expressed concerns about the credit usage and token allowances per dollar on Manus.
    • Some users suggested a free and premium version similar to ChatGPT.
  • Manus Compared to Cluely.ai: A user asked if Manus is just another Cluely.ai, referring to that kid that got kicked out for cheating in uni or something.
    • They also mentioned getting Chinese AR glasses by Inmo and entering cyberspace.
  • Image Generation on Manus not Insane: One member exclaimed manus image generation- insane šŸ”„ šŸ”„šŸ”„, but another user responded that it’s not good.
    • Multiple users agreed that image generation on Manus is gimmicky and produces bad results.
  • Manus Aids Coding Projects: A user said My IT teacher showed me Manus and it is so nice to use for big coding projects and that they think it’s the most powerful AI out there right now.
    • They did however mention they are having issues when converting code to Python for a school project.

Nous Research AI ā–· #general (73 messagesšŸ”„šŸ”„):

audio output parameters, Gemma access, Gemini diffusion model, WildChat-1M dataset, Devstral coding agent

  • TTS parameter estimates shrink: Members discussed the number of parameters needed for audio output, referencing Orpheus (8B parameters) and Outetts (1B parameters) as examples of Llama-based TTS models.
    • It was suggested that parameter sharing with text and audio input could further reduce the parameter count.
  • Gemini 2.5 Flash Unveiled: Members discussed access to Gemini 2.5 Flash 0520 and Gemma 3, available via AI Studio, with one member noting the Gemini diffusion model seemed ā€œa bit shyā€.
    • It was noted that the Deep Think model is apparently in closed beta.
  • Diffusion models generate text in parallel: Members discussed diffusion models, their ability to process chunks of text in parallel, and their potential for non-causal text generation for applications like infilling.
    • Others wondered how the diffusion model works when there is plenty of space in the KV cache, which they don’t seem to leverage.
  • Brainstorm on WildChat dataset continues: A member initially suggested creating a better WildChat dataset from chats with Hermes-3-405B and DeepHermes-Mistral-24B, but another suggested running Hermes through WildChat prompts to share the dataset.
    • The suggestion to run Hermes through WildChat prompts was considered ā€œa much better, more autonomous, and implentable optionā€.
  • Devstral coding agent is on fire: Mistral AI released Devstral, an open-source coding agent.
    • Though some sources indicate that its benchmarks are not too amazing, but because it’s open source that is a plus.

Nous Research AI ā–· #ask-about-llms (5 messages):

Restricting Models, AI in education, Hermes 3

  • Model Restriction is very hard: A member inquired about the best model for closely following instructions in a restricted school environment, but another member pointed out that it’s very hard to restrict models to a domain or a task.
    • The member expressed concern over students using AI as full substitutes for teachers and wanted to create an environment where AI enhances learning without replacing human instruction.
  • Try Hermes 3: A member suggested chatting with Hermes 3 in the chat channel.

Gemma 3n models, Matformer arch

  • Gemma 3n models release gets spotlight: With the new release of Gemma 3n models and renewed focus on Matformer architecture, there are promising developments in the AI model space.
  • Teknium chimes in: A member with the handle Teknium (<@187418779028815872>) has been noted in the discussion.
    • There are no details provided as to what the Teknium member said.

Modular (Mojo šŸ”„) ā–· #general (19 messagesšŸ”„):

Claude Code with Mojo, Mojo Code Generation, AI coding assistance for Mojo, Cursor vs Claude

  • Claude Struggles with Mojo Syntax: A member expressed frustration with using Claude for Mojo code generation, citing frequent syntax errors and incomplete code, especially after the Modular docs suggested it.
    • Another member suggested that AI is generally not proficient in new languages or systems programming, and Mojo’s similarity to Python and C++ may confuse the LLM.
  • Fine-Tuning Open Source Models Advised for Mojo: Instead of relying on Claude, members recommended using an open-source model and fine-tuning it specifically for Mojo.
    • One member found the time investment to not be worth it compared to manual implementation.
  • Modular Provides Tips for Code Generation Tools: A member shared tips on using code generation tools with Mojo, emphasizing the importance of providing the open-source repo and docs.modular.com as context to Claude Code / Cursor.
    • Seeding Claude and Cursor rules within the open-source project can improve functionality; claude-sonnet-3.7 has some internal knowledge of Mojo but might be outdated.
  • Cursor preferred over Claude: It has been mentioned that some members internally have used Cursor to generate good and fairly complex Mojo code, and have had success with the open source repo as context.
    • It was said that this was preferable to using Claude.

Modular (Mojo šŸ”„) ā–· #mojo (42 messagesšŸ”„):

Float16 exp implementation, Mojo compile times, String null termination changes

  • Why exp lacks Float16 implementation: The exp function in Mojo is only implemented for Float32 and Float64, with Float16 values being casted due to a potential precision issue, as discussed in this GitHub issue.
    • While lower precision might be acceptable, accumulating errors in Float16 computations could render results useless, and CPU support for fp16 math is not universal, often requiring upcasting to fp32.
  • Simple Mojo compile taking unexpectedly long: A user reported experiencing abnormal compile times (3-4 seconds) for a super-simple ā€œHello, Mojo!ā€ program using Mojo version 25.3.0 on Ubuntu 22.04 (WSL2) with an Intel 12700K.
    • Investigation suggests the issue isn’t inherent to Mojo’s latest version but potentially related to the specific installation or WSL2’s filesystem performance; even REPL mode exhibits delays for simple commands like var a = 5.
  • --sanitize crashes due to String changes: A user found that upgrading to Mojo 25.3 broke their code when running with --sanitize on OSX, likely due to a String null termination issue combined with unsafe_ptr usage.
    • It was confirmed that Strings are not always null terminated now, requiring code adjustments to prevent out-of-bounds access.

Modular (Mojo šŸ”„) ā–· #max (6 messages):

Modular max imports, Torch CustomOpLibrary, MLIR context errors, Modular Forum for issues

  • Modular max Imports Lack Source Code šŸ”Ž: A user inquired about the source code for max imports seen in example code.
    • A member responded that those aren’t properly open sourced, but you can see them in your environment. Some of those are going to call into C++ code, but you should at least be able to see the corresponding .pyi file.
  • Torch CustomOpLibrary Faces MLIR Context Hiccups šŸ˜µā€šŸ’«: A user reported encountering a RuntimeError: No active MLIR context error when using torch’s CustomOpLibrary implementation.
    • The error occurred when attempting to call any custom op, suggesting a potential configuration or setup issue.
  • Modular Forum Emerges as Issue Iteration Hub šŸ›ļø: A member suggested opening an issue on the Modular Forum to iterate on specific issues related to the CustomOpLibrary and MLIR context errors.
    • The forum is seen as a better place to troubleshoot specific problems.
  • Minimalist Example Links MLIR Issue on Modular Forum šŸ”—: A user created a minimalist example demonstrating the MLIR context error with CustomOpLibrary and posted it on the Modular Forum.
    • The forum post can be found here.

Yannick Kilcher ā–· #general (29 messagesšŸ”„):

Pytorch Geometric, GATConv AssertionError, SGD vs Genetic Breeding, Concept entanglement vs fracture, Picbreeder

  • Users encounter GATConv Assertion Error: A user reported receiving an AssertionError: Static graphs not supported in 'GATConv' when passing batched input to the model forward method, specifically with x.shape: [batch_size, num_nodes, num_node_features] and edge_index.shape: [batch_size, num_nodes].
    • The user notes that the model (torch_geometric.nn.GATConv) seemingly doesn’t work with batched input, which is unexpected, as the training loop functions correctly when iterating over each singular batch.
  • SGD Generates Spaghetti Code Representations: A user shared a paper stating that SGD produces internal representations akin to spaghetti code, characterized by copy-pasted concepts randomly entangled, which are efficient but not human-understandable.
    • The paper questions if this internal representation’s beauty matters for downstream tasks beyond interpretability and suggests it does, using a CPPN (Conceptually Proto-Nerf for 2D) example produced via open-ended semi-manual genetic breeding process (e.g., Picbreeder).
  • Entangled vs Fractured Representations Debate: One user found it more likely that it’s a regularization post training, for example a network trying to optimize predicting its own outputs while optimizing for simpler representations, while critiquing the argument that fracturing is good.
    • A participant argued that repeating neuron representations is generally good for learning (e.g., dropout), redundant circuits boost robustness, and redundant circuits early in training can specialize later on.
  • Stochasticity Encourages Simpler Representations: A user posited that stochasticity encourages simplified representations, suggesting modern LLM performance per parameter is linked to the very entangling of representations the aforementioned paper argues against.
    • The user cited this paper and equated ā€œentanglingā€ with superposition, noting both papers indicate that stochasticity in training promotes simpler, more perturbation-resistant representations.

Yannick Kilcher ā–· #paper-discussion (14 messagesšŸ”„):

Physics of Language Models: Part 3.2, Knowledge Manipulation, Out-of-distribution abuse, Data contamination

  • Physics of LMs Shows Knowledge Manipulation Deficiencies: A paper titled ā€œPhysics of Language Models: Part 3.2, Knowledge Manipulationā€ (ssrn.com) investigates the ability of language models to use stored knowledge for downstream tasks, revealing deficiencies in classification, comparison, and inverse search.
    • The study found that language models struggle with even simple classification or comparison tasks unless Chain of Thoughts (CoTs) are employed during both training and inference, with near-zero performance in inverse knowledge search.
  • The Buzzwordiness of Out-of-Distribution: Members discussed the term out-of-distribution is a buzzword, often lacking a clear and rigorous definition, leading to its abuse in PR.
    • One member expressed irritation with the term’s overuse, while noting that the paper in question took an atypical approach and controlled for data contamination much better than most.
  • Paper discussion canceled: The daily paper discussion was canceled due to double-booking, and participants were notified via the daily-paper-discussion group.
    • Apologies were offered to those not on the notification list, with instructions to contact k_nearest_neighbor to be added to the list.

Yannick Kilcher ā–· #ml-news (17 messagesšŸ”„):

Gemma 3n, Google AI Edge, Anthropic AI, Humane AI Pin Failure, Rabbit R1 Failure

  • Gemma 3n Arrives: Google released Gemma 3n, their newest model as detailed in their documentation.
  • Google Announces AI Edge Small Language Models: Google announced AI Edge Small Language Models with multimodality RAG function calling, as described in a blog post.
  • Anthropic Teases New Release: Anthropic teased a new release on Twitter.
  • Humane AI Pin and Rabbit R1 Flop: Members noted that the Humane AI Pin and the Rabbit R1 flopped, citing the difficulty of beating the form factor of a mobile phone.
  • Startup Founder Dreams of OpenAI Acquisition: A founder shared a post on Twitter expressing their hope that OpenAI will acquire their startup.

MCP (Glama) ā–· #general (33 messagesšŸ”„):

Streaming transport adoption, Decoupling transport and wire protocols, MCP memory bank, OpenAI rolling out MCP support, Tool name constraints

  • Streaming Transport Gaining Traction: Members discussed the adoption of the streaming transport from the 2025-03-26 version of the protocol, with one noting that VSCode already supports streamable HTTP.
    • It was suggested to decouple the transport and wire protocols, as MCP is supposed to be transport agnostic.
  • MCP Memory Bank Quest: A member sought help finding a Reddit post about an MCP memory bank usable across multiple projects with a snazzy UI, compatible with Cursor, Cline, and Roo.
    • They mentioned that they came across a post on Reddit showing a memory bank MCP that was useable across multiple projects and had a snazzy ui for exploring said memory banks.
  • Tool Naming Troubles: Concerns were raised about the strict naming constraints on tool names, particularly with GitHub Copilot, and the challenges of namespacing.
    • Suggestions included using a system similar to env var namespaces, such as audio__play_track.
  • MCP Healthcheck Hack: Members discussed how to add a healthcheck route to a FastMCP server, with one suggesting using mcp.http_app() and mounting it as an ASGI sub application inside FastAPI.
    • An example using @mcp.custom_route("/health", methods=["GET"]) was provided for a more reliable approach, see FastAPI Integration.

MCP (Glama) ā–· #showcase (11 messagesšŸ”„):

MCP SDK, Auth server, Typescript, MCP resource server, Frontend CLI

  • MCP Resource Server Separated with New SDK: An article was written about using the new MCP SDK (TypeScript) to separate the MCP resource server from the Auth server: MCP Server with OAuth Typescript.
  • Frontend CLI accessible version released: A member has been working on a library to convert any frontend into a CLI accessible version for LLMs and also starts an MCP server to let LLMs navigate your frontend with code once feature parity: MCP UI Bridge.
  • SmartBuckets Eliminate RAG Bottleneck: LiquidMetal AI launched SmartBuckets, which indexes files into vectors and an auto-built knowledge graph, runs serverless, and exposes a simple endpoint you can hit from any language, wired straight into Anthropic’s Model Context Protocol (MCP): LiquidMetal AI.
  • Agents as MCP Servers Launched: A significant update to mcp-agent was launched, allowing Agents to act as MCP servers themselves, so that any MCP client can invoke, coordinate and orchestrate agents, code for this here.
  • MCP UI Bridge brings LLM-Oriented Accessibility: A new library, mcp-ui-bridge, is designed to make web applications natively and equally accessible to both human users and Large Language Models (LLMs) through a single, unified development effort, instrumenting existing web apps with semantic data-mcp-* attributes, the npm is here and github here.

Cohere ā–· #šŸ’¬-general (23 messagesšŸ”„):

Private Deployment Options with Cohere, Command A Slow Response Times, Entity Extraction and JSON Output Issues, Embed v4 and Bedrock Availability, Self-Hosting Command Models

  • Cohere offers clients Private LLM Deployment: Cohere offers private deployments as a core part of their solutions for customers with data/LLM sovereignty interests, with flexible deployment options available; contact them at [email protected] or [email protected] for more info.
    • A Cohere employee stated, *ā€œAt Cohere, we’re deeply committed to your privacy and security, which is why we offer private deployments as a core part of our solutions.ā€
  • Command A Response Times are getting slower: Users reported slower than usual response times with command-a, especially when using the structured response parameter.
    • A Cohere employee confirmed there were no immediate issues and requested more details to investigate, advising them to send an email at [email protected].
  • Command-R JSON output takes 2 minutes: A user reported issues with entity extraction using command-A and command-R models, specifically when specifying json_object as output, causing the requests to hang or take an extremely long time (2 minutes).
    • They noted that while the text version that outputs JSON works fine, the response_format parameter significantly increases response time; a 6405 input and 2k output tokens document returns in about 17 seconds without json_object.
  • Embed v4 might be on Bedrock someday: A user inquired about the availability of Embed v4 on Bedrock.
    • Another user simply replied, *ā€œits everywheresinker.ā€
  • Command- Models best for self-hosting*: A user expressed that they haven’t found a better model for general purpose, personal chat and assistant that they can self host than the command-* models.

Cohere ā–· #šŸ”Œ-api-discussions (15 messagesšŸ”„):

Embed v4, vector DB, rate limiting, open models, AWS

  • Embed v4 experiences slowdowns: Users reported super slow embedding v4, experiencing about 180 seconds per embedding.
    • Cohere support acknowledged an incident due to a silently updated rate limiting system that failed to scale up with demand.
  • Cohere fixes the rate limiting system: Cohere support reported to have manually scaled up, fixed the rate limiting bug, and ensured that Embed V4 and other future models are added to the status page.
    • The rate limiting system wasn’t kicking in for the rpm we support.
  • User discusses vector DB trust: A user voiced concerns about building a vector store with a non-open-sourced embedding model, especially after the recent downtime.
    • They expressed that it’s a big investment to build a large vector store with a particular embedding model.
  • User asks for Embed v4 backup plan: A user asked for a backup plan for Cohere’s v4 embedding model, similar to how they can switch between providers for open models like Llama and Gemma.
    • The user suggested accessing v4 via both Cohere API and AWS or another provider for redundancy.
  • Embed V4 is available on Azure and AWS: Cohere support said that Embed V4.0 is available in several places outside of our platform such as Azure Marketplace and AWS Sagemaker.
    • These alternate services provide redundancy in case Cohere’s API experiences downtime.

Cohere ā–· #🟢-status-updates (1 messages):

embed-v4.0, Cohere Status Page

  • Embed-v4.0 Suffers Performance Hit: There is a degraded service affecting embed-v4.0 as of May 21, 2025.
    • Cohere is investigating the live incident, according to their status page.
  • Cohere Investigates Degraded Performance: Cohere’s status page indicates they are actively investigating the degraded performance of embed-v4.0.
    • The incident was reported on May 21, 2025, and is currently under review by the Cohere team.

Cohere ā–· #šŸŽÆ-private-deployments (1 messages):

Cohere Sales, Cohere Support


LlamaIndex ā–· #blog (2 messages):

Monorepo management, uv package management, LlamaDev build tool, Discord Office Hours

  • LlamaIndex migrates to uv and LlamaDev: LlamaIndex migrated from Poetry and Pants to uv and LlamaDev for managing a monorepo of 650+ community packages, resulting in faster and simpler development, as described in this blogpost.
  • LlamaIndex Discord office hours announced: LlamaIndex announced the first Discord Office Hours with Tuanacelik, LoganMarkewich, and Clelia, focusing on Agentic workflows and LlamaIndex questions, on this announcement.

LlamaIndex ā–· #general (35 messagesšŸ”„):

Llama Parse Issues with Layout Agent, VectorStoreIndex vs FAISS, Model llamaindex/vdr-2b-multi-v1 issues, Azure AI Search Integration, LlamaIndex Office Hours

  • Llama Parse’s Layout Agent encounters hiccups: Users reported issues with Llama Parse using the Parse with Layout Agent, where jobs timed out after 30 minutes, prompting investigation into a known issue.
    • A fix was deployed within 20-30 minutes, but follow-up tests indicated persisting problems, leading to a workaround involving a 10-minute timeout and fallback to a weaker parsing mode.
  • VectorStoreIndex vs FAISS: A deep dive: A user inquired about the distinction between using a VectorStoreIndex and a local FAISS for storage and its impact on RAG model performance.
    • It was clarified that a VectorStoreIndex is a wrapper around any vector store, including FAISS, with a link to a FaissIndexDemo.
  • Transformer tantrums: llamaindex/vdr-2b-multi-v1 has issues: A user reported a ValueError when using the llamaindex/vdr-2b-multi-v1 model, related to the size parameter, traced back to a transformers update.
  • Azure AI Search Agentic Retrieval integration sought: A user inquired about integrating Agentic Retrieval of Azure AI Search into the framework.
    • A link to the Azure AI Search announcement was shared and it was mentioned that community contributions for such integrations are welcome.

Torchtune ā–· #general (15 messagesšŸ”„):

Torchtune generate Qwen2_5_0_5b, Tokenizer bug in inference mode, Custom tokenizer patch, LORA finetuning gibberish, Resizing Token Embeddings

  • Torchtune’s Qwen2 Generation Gets No Output: A member reported using torchtune generate on qwen2_5_0_5b and not seeing any output, using a provided config.
    • The observed output included <|im_start|>user Tell me a joke.<|im_end|> <|im_start|>assistant<|im_end|> <|endoftext|>.
  • Tokenizer Bug in Inference Mode Averted: It was determined that a bug in the tokenizer in inference mode was causing the end-of-sequence (EOS) token to be appended prematurely.
    • A fix was implemented via this PR, resolving the issue of Qwen prematurely stopping generation.
  • Custom Tokenizer Patch Gets it Working: A member patched the fix to their custom tokenizer and confirmed it’s working, generating the joke: Why don’t scientists trust atoms? Because they make up everything!.
    • They reported inference time of 1.04 sec total, 13.46 tokens/sec, bandwidth achieved of 13.57 GB/s, and memory used of 1.04 GB.
  • LORA Finetuning Produces Gibberish Output: The member reported that after adding new tokens to the model and performing LORA finetuning, the output became gibberish.
    • The user provided the example: Tell me a joke.<|im_end|> <|im_start|>assistant ormoblllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll.
  • Inquire about using Embedding Utils: In response to the gibberish output, a member inquired about the use of the resize_token_embeddings utility from torchtune.modules.embedding_utils after adding new tokens.
    • The tool in question can be found here: torch.org.

Torchtune ā–· #dev (2 messages):

DistCp, Safetensors, Async Checkpointing, DCP Team

  • DistCp format to Safetensors Conversion Issue Requested: A user inquired about an open issue for converting the DistCp format (produced by async checkpointing) to safetensors.
    • A team member responded that no such issue exists and suggested the user create one if needed, promising to provide utilities to facilitate the conversion, and signal the DCP team for long-term consideration.
  • Async Checkpointing and DistCp Discussion: The conversation revolved around converting the DistCp format, produced by async checkpointing, to safetensors.
    • While no existing solution was readily available, the team showed willingness to assist and consider it for future development.

Torchtune ā–· #rl (4 messages):

Async RL Recipe, Microsoft's Verl framework

  • Async RL Recipe has pegged VLLM: The async RL recipe is still pretty experimental and has been pegged to a stable version of vllm.
    • Members plan to get vllm working with stable pytorch and nightlies before making this recipe a more standard torchtune recipe.
  • Microsoft’s Verl Framework Surfaces: A member discovered an RL training framework today from a paper by Microsoft.
    • Another member agreed Verl is great for RL training and asked which parts would be particularly useful.

tinygrad (George Hotz) ā–· #general (6 messages):

Job Opportunities, tinygrad bounties, distributed training, mmapeak work, RDNA4 instructions

  • tinygrad Job Opportunities Require Bounties: A user inquired about job opportunities at tinygrad after seeing a post on X, and was informed that the primary pathway to employment is through contributing to bounties and small PRs.
    • It was recommended that they check out the pinned post on X for more information.
  • Distributed Training Commoditizes Petaflops: A user asked about how distributed training fits into commoditizing the petaflop and linked to pccl on GitHub.
    • No further discussion was given.
  • mmapeak Work Ready: A user announced that the mmapeak work is ready for review on GitHub.
  • RDNA4 Instructions Compile Successfully: A user reported that the RDNA4 instructions compile successfully.
    • They noted that adjusting the number of waves per CU might be necessary to achieve optimal performance.

tinygrad (George Hotz) ā–· #learn-tinygrad (6 messages):

JAX control flow, Tensor.where, jax.lax.cond

  • Control Flow Question Floats: A member inquired about Tinygrad’s approach to control flow, drawing a comparison to JAX’s jax.lax.cond which allows conditional execution without breaking the computation graph, referencing the JAX documentation.
    • The member noted that this capability is essential for implementing many Monte Carlo algorithms.
  • Tensor.where workaround worth waving?: Another member suggested using Tensor.where as a possible solution.
    • However, the original poster clarified that jax.lax.cond determines which branch of code to execute, unlike Tensor.where which operates on a tensor and, depending on the situation, could still execute both paths.

DSPy ā–· #general (8 messagesšŸ”„):

DSPy Framework, Bias Training, Case Study

  • DSPy Makes Agent Changes Easy: Members reacted positively to the ease of making changes to agents within the DSPy framework.
  • Mike Taylor DSPy Case Study: A member shared a link to a case study applying DSPy by Mike Taylor.
    • Another member commented that DSPy may be playing a bit fast and loose with the idea he’s training any bias out, though teaching it what demos voted for whom.

LLM Agents (Berkeley MOOC) ā–· #mooc-questions (3 messages):

Course deadlines, Certificate requirements

  • Quizzes Completion Deadline Confusion: A member who joined the course late inquired whether they could still complete all the quizzes by the end of the month to obtain a certificate, despite the course site indicating deadlines before the next lecture.
    • The question centers on the flexibility of deadlines and whether late submissions would still count towards certification.
  • Verification of Assignment Quality for Certificate: A member asked if there was a way to verify if their submitted labs and written assignments were sufficient for earning the course certificate.
    • They expressed a willingness to revise and resubmit if necessary, highlighting a proactive approach to meeting the certificate requirements.

Nomic.ai (GPT4All) ā–· #general (3 messages):

GPT4All OpenAI API Key, Extending GPT4All interface for more than text LLMs

  • User Needs Help with OpenAI API Key Installation in GPT4All: A user is seeking assistance with installing an OpenAI API key in GPT4All for an upcoming exam, reporting that the install button isn’t working despite entering the key.
    • The user attached a picture showing the API key box and the non-functional install button, emphasizing the urgent need for help.
  • GPT4All Interface Extension beyond Text LLMs?: A user inquired about plans to extend the GPT4All interface to support more than just text-based LLMs.
    • The user is seeking whether the GPT4All interface could go beyond supporting text LLMs.

Gorilla LLM (Berkeley Function Calling) ā–· #discussion (1 messages):

Manus AI Referral, Powerful Agents

  • Manus AI Referral Link Shared: A member shared a referral link (https://manus.im/edu/invitation/4EQHF7LZZ1JH7V) for Manus AI, which is currently in an invite-only stage.
    • The referral provides 1000 starter credits and is exclusively for individuals with Cal emails.
  • Manus AI Touted as Powerful Agent: The member described Manus AI as one of the more powerful agents available, particularly for multistep tasks.
    • They encouraged others to try it out using the provided referral link.