Gemini gemini gemini. Readers might also enjoy our Karpathy @ Startup School recap.

AI News for 6/16/2025-6/17/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (219 channels, and 6626 messages) for you. Estimated reading time saved (at 200wpm): 547 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

As previewed multiple times in the past 3 months leading up to Google I/O and at the AIE World’s Fair, Gemini Product Lead Tulsee Doshi finally announced that the 2.5 models are now generally available (aka with no “preview” or date tag).

Gemini 2.5 also now comes with a 30 page tech report with some notable details on evals, and a teeny tiny bit on architecture:

2.5 Flash Lite, the cheap/fast model, is now in preview with Oriol Vinyals emphasizing the simulative possibilities of >400 tok/s.


AI Twitter Recap

Model Releases, Benchmarks, and Performance

  • Gemini 2.5 Family Release: @OriolVinyalsML introduced the Gemini 2.5 family, highlighting the new Gemini 2.5 Flash-Lite model, emphasizing that its speed enables new use cases like a conceptual Neural OS. @demishassabis expressed excitement for more users to try the models. The official family includes Gemini 2.5 Pro and Flash (stable), along with the new Flash-Lite and Ultra (in preview), as noted by @OfficialLoganK. According to @_philschmid, the models are sparse Mixture-of-Experts (MoE) transformers with native multimodal support. A technical report was released, which @swyx points out details a fully autonomous Gemini Plays Pokemon run that completed the game in half the time of the original, showcasing impressive long-horizon planning.
  • LLM Coding Performance on LiveCodeBench-Pro: A new benchmark, LiveCodeBench-Pro, revealed that even the best frontier LLMs scored 0% on Hard problems, as shared by @sainingxie. This has been described as “BAD news for LLM’s coding skill”, though @scaling01 notes that achieving the 98.5th percentile for 10 cents is still a remarkable feat.
  • Kimi-Dev-72B Release: Moonshot AI open-sourced Kimi-Dev-72B, which achieved a state-of-the-art 60.4% on SWE-bench Verified, as shared by @scaling01. However, @gneubig pointed out a 43% accuracy drop when the model was evaluated in a different harness, highlighting the difference between agentic and non-agentic evaluation approaches.
  • Specialized and Smaller Models: A number of smaller, specialized models have been released, underscoring the trend that “bigger isn’t always better” as @ClementDelangue puts it. Releases include Nanonets-OCR-s, an open-source OCR model that understands context and semantic structure; II-Medical-8B-1706, which outperforms Google’s MedGemma 27B; and Jan-nano, a 4B model that outscores DeepSeek-v3-671B using MCP.
  • DeepSeek-r1 and MiniMax-M1: The new DeepSeek-r1 (0528) model has tied for #1 in WebDev Arena, matching Claude Opus 4. Additionally, MiniMax has open-sourced MiniMax-M1, a new LLM that sets new standards in long-context reasoning.
  • Kling AI for Video Generation: Kling AI showcased its capabilities with several videos, including a Ghibli-style game animation and an ASMR video featuring its new sound effects feature. Users noted the nuance in character movement as a key feature for storytelling.

AI Agents, Tooling, and Frameworks

  • Sakana AI’s ALE-Agent & ALE-Bench: @SakanaAILabs introduced ALE-Bench, a new coding benchmark focused on hard, NP-hard optimization problems, developed with AtCoder. Their agent, ALE-Agent, participated in a live AtCoder contest and achieved a ranking of 21st out of 1,000 human participants (top 2%), as highlighted by @hardmaru.
  • Multi-Agent Systems & RAG: @jerryjliu0 shared a Microsoft tutorial on building a multi-agent travel planner using LlamaIndex.TS, which allows agents to hand off tasks with full shared context. @hwchase17 also shared a post from 11x on rebuilding their agent with the full LangGraph stack.
  • OpenAI Product Updates: @gdb announced that ChatGPT image generation is now available to everyone in WhatsApp by messaging 1-800-CHATGPT. The team is also expanding, with a call for developers to join the Codex team.
  • Open Source Tools: OpenHands released a new, easy-to-install CLI that works in standard development environments. For JAX users, @fchollet noted that KerasHub now supports loading, fine-tuning, and quantizing models from HuggingFace.
  • Agentic Video Generation: @fabianstelzer demonstrated using agents on Glif to generate longer videos with Flux Ultra and Kling 2.1, suggesting that authorship will move from creating films to creating agents that create films.

Infrastructure, Hardware, and Efficiency

  • Groq and Hugging Face Integration: @GroqInc announced a major integration with Hugging Face, bringing blazingly fast inference to the HF Playground and API. The move is seen as an aggressive play to challenge established cloud providers.
  • Serving LLMs on Huawei’s CloudMatrix384: @arankomatsuzaki shared details on Huawei’s CloudMatrix384, a platform integrating 384 Ascend 910C NPUs. It achieves 2k tokens/s decode per NPU for models like DeepSeek-R1, optimized for large-scale MoE and distributed KV cache.
  • Python’s “nogil” Builds: @charliermarsh announced that the Python Steering Council has voted to remove the “experimental” label from the free-threaded (“nogil”) builds for Python 3.14.
  • Vector Database Tooling: @qdrant_engine released a new open-source CLI tool for streaming vectors between Qdrant instances and from other vector DBs, enabling live, resumable migrations with no downtime.
  • Speculation on Local OpenAI Model: Following comments from Sam Altman about a “locally” runnable model, @Teknium1 speculated it would have to be tiny to run on consumer cards with ≀12GB VRAM, though later noted that for Macs with unified memory, it could be larger.

Research and New Techniques

  • Diffusion Language Models: @sedielem highlighted “The Diffusion Duality” paper from ICML 2025, which uncovers a connection between continuous and discrete diffusion models, allowing techniques like consistency distillation to be transferred to the discrete setting.
  • Rethinking LLM Evaluation: @corbtt made a hot take that at current token prices, you should always ask an LLM-as-judge to explain its Chain-of-Thought first for easier debugging. In a similar vein, @goodside posted a complex reasoning challenge to test models’ capabilities beyond simple pattern matching.
  • Multilingual Tokenizers: Work from Cohere Labs, shared by @sarahookr, demonstrated that a “universal” tokenizer covering more than just primary languages greatly improves downstream performance on multilingual tasks.
  • Complex Systems Perspective on LLMs: @MelMitchell1 shared a new paper co-authored with David and Jessica Krakauer titled “Large Language Models & Emergence: A Complex Systems Perspective,” which argues that concepts like emergence are often misused and provides a more rigorous framework.
  • Human-in-the-Loop (HITL) for Synthetic Data: @TheTuringPost detailed how HITL is crucial for grounding synthetic data through validation, labeling refinement, and RLHF.

Industry News, Commentary, and Geopolitics

  • John Carmack and Gaming History: @ID_AA_Carmack shared a video from the induction of the classic game Quake into the Strong Museum’s World Video Game Hall of Fame.
  • AI Talent Wars: @Yuchenj_UW relayed a comment from Sam Altman stating that Meta is offering $100M signing bonuses to attract OpenAI staff, which he claimed was not how to build a great culture.
  • Learning and Publishing in the Age of AI: @Yuchenj_UW reflected on how a Cornell CS intern used ChatGPT for all assignments and crammed for the final, expressing gratitude for learning CS before LLMs. In a related sentiment, @jxmnop encouraged researchers to publish their work on arXiv, noting that with over 2000 AI papers posted last week, the worst case is that no one notices.
  • Mary Meeker’s AI Report: @DeepLearningAI shared Mary Meeker’s first tech market survey since 2019, a 340-page report arguing that AI’s adoption and capital spending are fueling both record opportunities and significant risks, with a key takeaway being that organizations attracting the best developers will win.
  • Launch of Generalist AI: @peteflorence announced the launch of Generalist AI, a new company he’s been building since leaving Google DeepMind, with the mission to make general-purpose robots a reality. The company is backed by @sparkcapital.

Humor/Memes

  • Political Satire: @zacharynado shared a tweet highlighting that the DOGE organization pressured an agency to hire a 21-year-old former intern from Palantir. Another tweet RT’d by him poked fun at Cybertruck owners.
  • The McKinsey Report: A 2025 McKinsey report recommending Grok-1 and GPT-3.5 Turbo was widely mocked, with @giffmana posting ”> Be McKinsey 💀”.
  • AI and Religion: @code_star posted, “In my youth pastor voice: ‘Yeah ChatGPT is cool but do you know who else trained in a middle eastern desert for forty days and forty nights?’”
  • Gemini’s Hidden Message: A tweet RT’d by @jeremyphoward joked, “read the first letter of every name in the gemini contributors list”.
  • Programming Memes: @DeepLearningAI shared a classic programming meme about reality vs. expectations.

AI Reddit Recap

/r/LocalLlama Recap

1. Local/Open-Source LLM Rigs and Daily Usage

  • Completed Local LLM Rig (Score: 283, Comments: 99): The user showcases a high-end local LLM (Large Language Model) computing rig featuring 4x RTX 3090 GPUs, a Threadripper 3945wx 12-core CPU, 256GB DDR4-3200 RAM, and a MoRa 420 external radiator, all fit inside a legacy Silverstone Raven RV-02 ATX case. The build emphasizes both performance (notably, GPU temps at 57C under load with external radiator cooling) and compactness, overcoming notable challenges with multi-GPU thermal management, power delivery (Seasonic Prime 2200W PSU), and legacy chassis airflow. Top comments are mostly lighthearted, with some acknowledgment of the scale of the external radiator and how the build is prepared for ‘AI winter.’ No major technical critique or discussion of specific bottlenecks, cooling solutions, or LLM-specific workflow integrations.
    • A user requests benchmark numbers for the build, indicating interest in quantitative metrics such as inference speed, training throughput, or power draw. Sharing these benchmarks would help the community assess the rig’s actual LLM performance in realistic workloads.
    • Another user specifically points out the use of NVLink, which suggests the build likely utilizes multiple high-end NVIDIA GPUs interconnected for increased memory bandwidth and faster inter-GPU communication, essential for training large language models at scale.
  • Who is ACTUALLY running local or open source model daily and mainly? (Score: 110, Comments: 113): The post queries the practical adoption of local or open-source LLMs as daily drivers for coding, writing, or agent-based tasks, asking about inference setups (remote vs. local) and associated applications. Top commenters report running coding LLMs locally via KoboldCPP integrated with VSCode’s Continue extension and use of stable diffusion models like InvokeAI for image generation. Another user mentions using Jan-nano locally as a Perplexity AI replacement, with one developer highlighting the tradeoff: local models’ lower capabilities act as a feature, limiting cognitive offload and preserving developer skill and enjoyment. Discussion underscores that, for some, the intentional weakness of local models is preferred over leading proprietary options due to concerns about code quality and the risk of excessive task offloading dulling problem-solving skills.
    • A user provides a detailed workflow for using local LLMs to write TV scripts, combining input/output evaluation with model-based grading and iterative story development. They compare various models (e.g., gemma3:12b, qwen3:14b/32b, deepseek-r1:14b/32b, mathstral, mistral) on a 4080 laptop with 32GB RAM, sharing disk sizes and emphasizing that for creative, non-engineering tasks, parameter counts above ~130B offer diminishing returns. Large, technical models show clear benefits for complex engineering problems, but much less so for ‘lulz’ or creative narrative work, where smaller models are often sufficient.
    • Another technically focused comment highlights successful daily use of Qwen 3 14b q8 on local hardware (RX 7900XTX 24GB) for tasks like summarizing texts and constructing detailed responses. The commenter points to this model as a first local LLM that meets their usability needs, indicating sufficient performance and convenience for regular, meaningful workloads.
    • There’s specific mention of using KoboldCPP to run coding LLMs locally, paired with VSCode and the “Continue” extension for integrated development. Also mentioned is InvokeAI with various models for local image generation, illustrating a working local AI stack for both code-assisted and creative image workflows.

2. AI Model Strategy and Future Plans (Qwen3 and MoE)

  • There are no plans for a Qwen3-72B (Score: 258, Comments: 66): The image is a tweet from Junyang Lin, a key contributor to the Qwen models, explicitly stating there are no plans to release a Qwen3-72B dense model. The tweet elaborates that optimizing both effectiveness and efficiency for dense models greater than 30B parameters—whether during training or inference—is challenging, and hence their strategy is to prioritize Mixture of Experts (MoE) architectures for scaling up model size. This indicates a paradigm shift from large dense models towards MoE for future large-scale Qwen releases. View image Commenters discuss technical trade-offs: one compares inference speed and memory characteristics between a hypothetical 72B dense model and a 235B MoE with 22B active parameters, highlighting practical deployment considerations on current hardware. Another expresses concern that the efficient era of running large dense models on dual 24GB GPUs may be ending, while some still hope for intermediate-sized Qwen3 models.
    • A user compares performance trade-offs between a 72B dense model fully offloaded to VRAM and a 235B MoE (Mixture of Experts) with 22B active parameters, noting that with MoE, only a portion of parameters are active at inference. They specifically question speed differences when the equivalent active layers are offloaded to VRAM and the rest are kept in RAM, highlighting inference-time memory management challenges for large models.
    • Another comment points out that 32B parameter models can be run on 24GB VRAM, suggesting that this will make large language models increasingly accessible for local inference. This also pressures model creators to optimize models for these memory constraints, signaling potential for advancements such as quantization-aware training (QAT) and more efficient MoE architectures tailored for consumer hardware.
  • It seems as if the more you learn about AI, the less you trust it (Score: 110, Comments: 57): The post raises concerns about overreliance on LLMs, referencing both OpenAI staff statements and comments from leading AI researchers (Hinton and LeCun) that production-grade use of LLMs is precarious, with LLM outputs remaining brittle (prone to subtle bugs) and currently lacking robust architectural reasoning. The discussion further highlights worrying industry trends: widespread claims of LLMs replacing programmers, despite their statistically probabilistic (rather than deterministic) nature, as noted in technical commentary, and the resulting challenge for developers who may bypass foundational programming understanding. Top comments reinforce that deep field experience often leads to greater skepticism (citing parallels in food chemistry and medicine), and emphasize that understanding the limitations—specifically, that ML is inherently probabilistic—allows for safe, productive integration of LLMs if outputs are properly validated. The debate centers on calibrated trust, not complete adoption or rejection, aligning with expert consensus that LLMs are valuable tools when their error modes are well-understood and managed.
    • Several commenters highlight that understanding LLMs’ probabilistic nature and their limitations—especially their tendency to produce plausible but incorrect outputs—makes them more effective as programming tools. The need for critical evaluation and prompt engineering is emphasized, as blind trust can lead to harmful outcomes, but careful usage significantly improves productivity, particularly in mundane or repetitive coding tasks.
    • One user provides a technical anecdote demonstrating Gemini’s capability: it synthesized functional code for an obscure algorithm sourced only from whitepapers with insufficient explanations, and could iteratively refine this code for real-world use cases. While Gemini still produces occasional bugs requiring human debugging, comments suggest that prompt-crafting and debugging will become integral skills for future programmers leveraging AI tools.
    • A thread links to a prior discussion (https://www.reddit.com/r/LocalLLaMA/comments/1kwile2/engineers_who_work_in_companies_that_have/) expanding on engineers’ trust and usage patterns with local LLMs, indicating ongoing technical debate in the community about AI practicalities and reliability in real development work.

3. Comprehensive Tutorials for Building AI Agents

  • A free goldmine of tutorials for the components you need to create production-level agents (Score: 142, Comments: 14): The post announces a repository (agents-towards-production) featuring 25+ detailed, free tutorials on building production-ready AI agents, organized by key engineering concerns (e.g., orchestration, tool integration, observability, security, deployment, agent frameworks, model customization, and multi-agent coordination). The repo is notable for rapid adoption (~500 stars in 8 hours), and is part of a broader initiative by the author to provide high-quality open-source GenAI educational material. The tutorials focus on end-to-end agent systems for real-world deployment but are primarily cloud- or service-based, not local. Top technical questions from commenters concern core agent system architecture—i.e., whether an agent simply cycles the user/system prompt and tools (including memory) in a loop until a satisfactory output is achieved, and how multi-agent workflows are orchestrated through chaining or looping between assistants. A factual note clarifies that the orchestration examples rely on online services, not on local execution.
    • A technical question was raised about the core architecture of production-level agents: asking if their design typically involves a user/system prompt, tools (including memory), and iterative loops until acceptable output is achieved. The discussion further probes whether multi-agent systems are constructed by chaining different agents into the workflow pipeline, essentially creating agent-to-agent interactions and feedback loops.
    • A critical technical note was provided that the first tutorial under “Orchestration” in the linked tutorials (https://github.com/NirDiamant/agents-towards-production) does not focus on local deployment but instead utilizes an online service, which may impact how these tutorials apply to those seeking local-first production agent infrastructure.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo

1. Advances and Benchmarks in AI Model Releases (Gemini, EG-CFG, OpenAI o3)

  • The guy that leaks every Gemini release teases Gemini 3 (Score: 1015, Comments: 114): The image is a screenshot of a tweet by Logan Kilpatrick, known as the AI Studio lead at Google, teasing an impending Gemini release by simply repeating ‘gemini’ three times, suggesting hype around ‘Gemini 3.’ The Reddit discussion links Kilpatrick directly with official Google Gemini announcements and discusses expectations for future models, such as Gemini 2.5 Flash, a full 2.5 Pro release, and a 2.5 DeepThink-preview. This indicates the community is anticipating incremental releases rather than a jump straight to Gemini 3. Some commenters dispute Kilpatrick being a ‘leaker’ since he is a Google official, emphasizing that teasers like these are coordinated communications rather than leaks. Speculation in the thread suggests the next major releases could include improved Gemini 2.x variants rather than a true Gemini 3.
    • One commenter speculates on a structured release roadmap for Gemini models: suggesting a 2.5 Flash update on 6-17, a “Full Release” of Gemini 2.5 Pro, and an early preview version called “2.5 DeepThink-preview.” This points to possible parallel, incremental model improvements and differentiation within the Gemini suite, possibly targeting varied latency, context handling, or reasoning capabilities. Another technical theme is the observation of distinct model variants (“Three different Geminis”), which aligns with prior Google’s approach: releasing multiple model types, such as Flash for speed, Pro for balanced/flexible capabilities, and potentially a DeepThink variant seeking deeper reasoning or context windows.
  • Gemini 2.5 Pro (Full version) arrives in AI studio along with 2.5 Flash and2.5 Flash Lite Preview 06-17 (Score: 407, Comments: 71): The image is a screenshot from AI Studio displaying the full version of Gemini 2.5 Pro, alongside Gemini 2.5 Flash and Gemini 2.5 Flash Lite Preview 06-17, all flagged as ‘NEW’. The menu reflects recent updates, with Google introducing these variants according to their recent announcement. The interface update also highlights a ‘new URL context tool.’ Comments confirm that this Gemini 2.5 Pro release is technically identical to the previous ‘pro 0605’ version, with no benchmark changes despite the rebranding. Technical discussion in the comments emphasizes that the new 2.5 Pro is unchanged from the prior release, according to benchmark comparisons. This suggests the update is more a UI/branding refresh rather than a substantial model upgrade.
    • Benchmarks between Gemini 2.5 Pro (Full version) and the earlier ‘pro 0605’ release are identical—even to the decimal—indicating that the “new” release is essentially a rebranding of the previous checkpoint with no model architecture or performance updates. Benchmark image reference.
    • Discussion notes that Flash pricing has increased, but the technical specs of Pro have not been updated. Additionally, comparisons to o3 pricing ($2/$8) highlight that existing competitive models are currently similarly priced or cheaper, reducing pressure for rivals to respond or for users to switch.
    • One user speculates that the current Gemini 2.5 release (matching the previous 06-05 version) may serve as the final iteration before a potential version jump to Gemini 3.0, suggesting the lack of substantial updates may indicate a developmental pause before a more significant upgrade.
  • This was tweeted half a year ago. We currently still don’t have a usable model that is as good as the o3 they showed us then. Reminder that OpenAI workers also don’t know how fast progress will be. (Score: 260, Comments: 87): The image displays a tweet by Noam Brown announcing OpenAI’s “o3” model with graphs depicting significant performance improvements in Elo rating and research math benchmarks over earlier models. Despite this, the Reddit post highlights that, half a year later, no generally available model matches the capabilities demonstrated in those graphs, raising questions about the pace and feasibility of translating research metrics into publicly usable products. The tweet’s claim of optimism is contrasted by the post author’s skepticism about hype versus deliverable progress. Comments add technical nuance: one argues the o3 model’s exceptional results were achieved only with massive compute (possibly 1000-10000x the cost of available models), suggesting issues of scalability and public release, while another posits OpenAI could have stronger models (“o4”) internally but restricts release due to usability or cost constraints. Another points out the statistical limitations in the graphs, based on just two datapoints, implying overinterpretation of limited benchmarks.
    • One user asserts that the high-performing version of the model showcased (o3) was only achievable by leveraging orders of magnitude more compute—“1000-10000x as expensive to run” compared to available models. This points to significant efficiency barriers for widely accessible deployment of such models.
    • A technical distinction is raised in comparing the current o3 and o4-mini models to the earlier demo: they now arguably surpass the demo version but do so without relying on consensus voting (the ‘lighter shade of blue’), resulting in far lower inference costs.
    • It’s speculated that the upcoming GPT-5—a unified model—is expected to surpass both o3 and o4-mini in capabilities, possibly integrating the performance of higher-end demos with broader usability and efficiency improvements.
  • Apple said LLMs can’t think. This team just made one debug itself - and it smashed every benchmark. lol, we’re doomed. (Score: 239, Comments: 86): The image is a bar chart showing the new EG-CFG framework’s test results on several code-generation benchmarks (MBPP, MBPP-ET, HumanEval-ET, CodeContests), with EG-CFG outperforming notable contenders such as OpenAI, Google, DeepMind, QualityFlow, and LPW. The method integrates execution feedback directly into the LLM’s generation loop, allowing it to read execution traces and debug code autonomously, as described in the linked tweet and supported by this arXiv paper. However, technical discussion in the comments emphasizes caveats: the benchmarks are considered saturated or outdated, some comparisons (especially for CodeContests) use less competitive baselines, and scaffolded results across models are not always apples-to-apples, calling into question the SOTA-crushing claims. The open-source GitHub repo is noted, and researchers clarified some data selection issues and comparison choices. The top comments highlight skepticism about the benchmark choices, the fairness of comparisons (e.g., using older models or benchmarks that may be saturated), and call for replication using the released code. There is technical debate about the credibility of the performance claims and whether baseline models and scaffolds are truly comparable, as well as contention that the announcement overstates novelty given established results in the space.
    • One commenter highlights concerns about validity and relevance of the benchmarks reported in the paper (MBPP, HumanEval, CodeContests), noting some are already saturated and not commonly used for state-of-the-art announcements anymore. They claim that some comparison models (like GPT-4o + LPW) might actually perform better than presented, and point out that benchmark data, comparison models, and even graph labels are sometimes omitted or spun for marketing effect. There is a technical concern about possible data contamination and the reliability of results given differences in scaffolding approaches and base models.
    • Another technical critique is that the new framework’s claimed superiority is hard to evaluate objectively because many baselines used for comparison involve either outdated scaffolds or base models (like DeepSeek V3-0324) that may already outperform most results in their selected benchmarks. Complications also arise because some alternative scaffolds (like MapCoder, LPW) are highly model-dependent or hard to reproduce due to code unavailability or close-sourcing, with the authors sometimes using unorthodox workarounds or being ‘charitable’ with reported numbers, making independent replication and interpretation critical for verifying claims.
    • A further point discusses the importance of fair comparisons in scaffolding and benchmarking: unless head-to-head comparisons are performed using the same base model and scaffolding method (e.g., their framework versus LPW on identical models), any claims of superiority remain tenuous. There is skepticism about both the novelty and the evaluation rigor compared to prior art, and a call for more transparent apples-to-apples benchmarking to substantiate real improvements in LLM-based code generation.

2. Innovative Workflows and Tools for AI-Based Image/Video Generation (Flux, ComfyUI, WAN)

  • Using Flux Kontext to get consistent characters in a music video (Score: 142, Comments: 17): The OP highlights the effectiveness of Flux Kontext (as used in Remade’s edit mode) for generating consistent characters across scenes in a music video, with simple prompts (e.g., ‘Make this woman read a fashion magazine’). Alternative approaches like Runway’s reference images failed to match the performance, implying strong prompt-to-output coherence from Flux Kontext. There is uncertainty about whether Remade processes/enhances prompts beyond passing them directly to Kontext. Top comments note frustration that Flux Kontext is not open source or locally available yet, with one querying open-source/API status and another lamenting the lack of open access.
    • Multiple commenters note that Flux Kontext, or similar character consistency technology for music video animation, is currently not available as open source nor as a local/offline tool. This highlights a gap for developers and researchers who want to experiment or build upon such technology themselves, implying current offerings are closed or limited to API access only.
    • There is skepticism about the potential official release of Kontext by its developers, and concerns are raised that future alternatives (like those from “Black forest”) may be of inferior quality compared to the potential of an open source implementation.
  • Universal style transfer with HiDream, Flux, Chroma, SD1.5, SDXL, Stable Cascade, SD3.5, AuraFlow, WAN, and LTXV (Score: 121, Comments: 9): A new universal style transfer strategy has been introduced, leveraging projection into the higher-dimensional latent space of various generative models (HiDream, Flux, Chroma, SD1.5, SDXL, SD3.5, Stable Cascade, WAN, LTXV, AuraFlow) without the need for additional training. The method employs model-specific Loader and Patcher nodes (e.g., ReSDPatcher, ReFluxPatcher), and works for both img2img and txt2img by swapping nodes; it notably excels in detail retention and style transfer fidelity with models like HiDream and Stable Cascade. The approach improves flexibility and reduces training cost, integrating well with existing pipelines—see the RES4LYF repo for minimal setup, example workflows, and additional integration instructions, including for UltraCascade node pack. Users expressed enthusiasm and inquired about model compatibility (e.g., with Lumina), while also asking for clarification on term definitions (such as “bongmath”). No technical objections or deep debates were present.
    • A user inquires about parameter tuning for style transfer, focusing on controlling output precision (e.g., object placement like hair sticks differing between source and target styles). They ask whether it’s possible to separate style from precise content details, highlighting a typical challenge in neural style transfer: modifying only style without altering structural arrangements. This invites discussion of advanced techniques or workarounds to fine-tune style transfer fidelity.
    • Another comment seeks clarification on integrating LoRA (Low-Rank Adaptation) weights into the workflow, requesting a concrete example (screenshot or JSON) to illustrate at which point in the pipeline LoRAs can be inserted for added controllability. The request underlines the importance of composability and modularity in style transfer architectures, especially when combining base models and parameter-efficient adaptation methods.
    • A technical question is raised about compatibility with additional models, specifically asking whether the described approach can function with ‘Lumina’. This indicates user interest in interoperability and the scope for extending the method to other models in the diffusion/stylization ecosystem.
  • Tried Wan 2.1 FusionX, The Results Are Good. (Score: 130, Comments: 32): User reports results with the Wan 2.1 FusionX model (model details here) using ComfyUI and associated components: workflow file, Clip Vision, text encoder, and VAE. On a 32GB RAM, 4060Ti (16GB VRAM) system, video generation is slow (e.g., 5s output ≈ 500s computation, scaling to 10s ≈ 2600s), which is significantly slower than LTX Distilled models. Reference YouTube demo here. Top technical comments suggest workflow sharing and model component links. Another user benchmarks a 4090 with 720p model achieving ~85s for 6s of video, noting substantial speed improvement over Wan 2.1 FusionX, positioning those newer models as the preferred standard.
    • The original poster provides complete workflow details for using Wan 2.1 FusionX (including model files, CLIP vision components, text encoders, and VAE links) and benchmarks the video generation performance: on a 4060TI 16GB and 32GB RAM, a 5-second video takes ~500 seconds, 7-8 seconds takes ~1000 seconds, and 10 seconds takes ~2600 seconds. The user notes these times are significantly higher compared to LTX Distilled models, highlighting practical performance trade-offs for longer video synthesis workflows.
    • Another user reports significantly better generation times with a different 720p model running on a 4090 GPU, where 6 seconds of video can be generated in around 85 seconds. This points to a substantial speed improvement both from optimized models and more powerful hardware.
    • A mention of the ‘new self forcing lora’ suggests ongoing active development in model architectures for rapid video generation; the comment implies that Wan 2.1 may be outperformed by latest advancements, though no benchmarks are provided yet for this LoRA approach.

3. Reflections on ChatGPT/Claude as Companions, Therapists, and Reality Advisors

  • This is why you should not use ChatGPT as a therapist (Score: 615, Comments: 627): The OP demonstrates through a contrived, one-sided prompt that ChatGPT consistently sides with the user in emotionally charged scenarios, reflecting back validation rather than challenging or balanced perspectives. The criticism is that default LLM deployments (not specifically fine-tuned for therapeutic neutrality or challenge) replicate user biases, creating an echo chamber effect and failing to perform the critical, reality-testing functions of human therapists. One user showed that with explicit prompt engineering (defining the system role as a balanced, objective therapist), results became much more nuanced and appropriately confrontational, illustrating that LLM outputs are highly sensitive to prompt context. There is debate in the comments about whether ChatGPT mirrors user bias by default, or if it is capable of providing counter-perspectives when instructed. Some users report that prompt specificity and reinforcement over time (asking for truth and balance regularly) can yield more challenging and honest responses. Others report positive experiences with ChatGPT for emotional support, but agree that it cannot replace the objectivity and accountability of professional therapy.
    • Several commenters highlight that ChatGPT’s responses can mirror the user’s biases unless carefully prompted. By explicitly guiding the model (e.g., priming it with a prompt emphasizing balance, professional objectivity, and emotional intelligence), users can obtain more nuanced and challenging feedback, akin to what would be expected from a competent therapist. The provided session example illustrates that detailed, balanced instructions enable the model to deliver insightful, multi-layered analysis acknowledging relationship dynamics, personal responsibility, and the difference between emotional support versus dependency.
    • There are observations on ChatGPT’s limitations: It works best for generic self-help advice or broadly applicable therapeutic principles, often echoing what human therapists advise for low-complexity issues. When users seek highly tailored or context-sensitive guidance without adequate prompt engineering, the model delivers less satisfactory or superficial advice. This underscores the importance of prompt specificity for complex personal problems.
    • Anecdotal feedback indicates that ChatGPT can have a significant positive impact for some users, such as reducing suicidal ideation or helping them manage daily routines and emotional regulation. However, users note ChatGPT’s inability to independently validate complex relational dynamics (e.g., confirming a partner’s intentions), instead offering reframing and reflection consistent with therapeutic best practices. The feedback compares ChatGPT favorably to average human therapists for foundational issues, but with the caveat against over-reliance or implicit trust without critical reflection.
  • I don’t care what anyone says. If you have no real life support system, ChatGPT IS helpful. (Score: 502, Comments: 260): The image is a screenshot from a ChatGPT conversation on mobile, where ChatGPT provides structured behavioral markers of emotional abuse (insults, threats, manipulation, etc.) and emphasizes that such relationships aren’t healthy or fixable without change from the abusive party. The post and comments highlight that, for people lacking social or therapeutic support, AI tools like ChatGPT can help users articulate and validate their experiences, sometimes serving as a reflective, non-judgmental conversational partner, much like an interactive journal. Several users mention leveraging ChatGPT as a supplemental tool in therapy or for self-insight, reinforcing its potential utility in mental health contexts when other avenues are inaccessible. Commenters argue that AI like ChatGPT can be uniquely valuable for self-reflection, validation, or piecing together patterns one may miss alone, with at least one therapist reportedly acknowledging its benefit for client progress. Some debate exists about the appropriateness of relying on AI for emotional support, but the top opinions here are strongly supportive given the context of limited alternatives.
    • A user shares a use-case describing how ChatGPT facilitates self-reflection and cognitive restructuring by acting as an interactive journal. In therapy, sessions sometimes leverage insights and actionable points generated during AI conversations, with some users even presenting structured feedback from ChatGPT to their therapists. This demonstrates an emerging workflow where AI-generated summaries inform mental health treatment directions.
    • Some comments highlight ChatGPT’s value in providing non-judgmental conversational support, especially for those lacking a traditional support system or experiencing social isolation. This illustrates a technical application of large language models for mental health contexts, distinct from passive journaling by offering adaptive, responsive dialogue tailored to the user’s expressed emotions or needs.
  • 5 lessons from building software with Claude Sonnet 4 (Score: 124, Comments: 38): The post details five practical strategies when using Claude Sonnet 4 for software engineering: 1) LLMs are unreliable for market validation due to indiscriminate positive bias; instead, prompt adversarial analysis, 2) Claude can serve as a credible CTO advisor when prompted for tech stack selection under explicit MVP, cost, and scalability constraints, 3) Attaching product specs and code to Claude Projects gives it persistent, holistic context, 4) Proactively monitor and manage token usage to avoid chat resets, employing systematic commit/handoff flows, and 5) Multi-file project debugging benefits from holistically prompting LLMs to consider dependencies and script-generated tracing/debugging tools to find cross-file bugs. These recommendations are derived from building a tax optimization tool for Australian investors. External resources: Claude Sonnet 4, Claude Projects. Comments discuss desktop integration for project-context with Claude, psychological prompting for critical analysis, and challenge the idea of building “enterprise-grade architecture” at MVP/validation stage—arguing that prototypes, not robust architectures, drive market validation, and that eventual adoption depends on professional build quality.
    • Effective use of Claude Sonnet 4 for software development requires careful prompt engineering: users note that instructing the model to act as a critical reviewer (e.g., impersonating a senior engineer) results in more robust design critiques, but if not moderated, this approach can lead to overwhelming negativity in responses.
    • When leveraging LLMs like Claude for technical tasks (such as fixing TypeScript errors), breaking down the job into granular, file-level requests significantly improves accuracy and usefulness compared to broad, project-level instructions. However, even in smaller scopes, the model may still miss some issues, highlighting limitations in reliability for automated code correction.
    • The comments stress that using LLMs for MVP (Minimum Viable Product) stages is more about market validation than robust architecture. Relying solely on language models for more complex or production-level work may introduce significant bugs and performance bottlenecks, reinforcing the necessity of experienced developers for scaling beyond prototypes.

AI Discord Recap

A summary of Summaries of Summaries by Gemini 2.5 Pro Exp

Theme 1. The Model Gauntlet: New Releases, Performance Showdowns, and Versioning Dramas

  • Gemini 2.5 Pro & Flash Arrive: Is “Stable” the New “Preview” Amidst Versioning Fiasco? Google announced the General Availability of Gemini 2.5 Pro and Flash models on OpenRouter’s platform and in the Gemini app with style improvements, though users on OpenRouter noted the GA version is a rebrand of the 06-05-preview model with Flash GA seeing input cost hikes. This rollout was accompanied by criticism of Google’s versioning (e.g., 0506 vs 0605 confusion), with one X user lamenting, “Not even mentioning OAI who are the jerks king when it comes to bad versioning” (see the X post).
  • Qwen Models Roar: From 360 Tokens/Sec Speeds to MoE VRAM Dieting! The Qwen model family saw significant action: Qwen 3 reportedly hit 360 tokens/second (Q4_k_m on ollama), sparking hardware curiosity. Discussions also covered finetuning Qwen 2.5 VL 7b Vision model with Unsloth AI and strategies for running the Qwen3 30b MoE on a 3090 by selectively loading active parameters.
  • Fresh Faces Kimi-Dev & Red Dots LLM Enter the Ring with Coding Prowess & GGUF Fixes! MoonshotAI’s new open-source coding LLM, Kimi-Dev-72B, made waves by achieving 60.4% on SWE-bench Verified, view on Hugging Face, reportedly outperforming existing open-source models through large-scale reinforcement learning. Meanwhile, Unsloth AI released the new Red dots LLM GGUF with fixes available on its Hugging Face page and detailed in this Reddit post on the GGUF fixes.

Theme 2. Powering the Prompts: Frameworks, Libraries, and Platforms Vie for Developer Hearts

  • Cursor IDE’s “Unlimited” Plan Unleashes User Confusion and Indexing Gremlins! Cursor users reported involuntary migrations to a new ‘unlimited-with-rate-limits’ plan, causing frustration as the Opus 4 Max mode usage appeared after the old 500 fast requests cap vanished. Separately, a new bug caused indexing to get stuck at 0%, rendering the IDE nearly unusable for some.
  • Optimizer Wars Heat Up: tinygrad & Torchtune Tinker Under the Hood! The tinygrad community tackled custom optimizer performance, with users noting get/set_state_dict slowdowns and clarifying the role of Tensor.assign and automated .realize() in tinyjit. Torchtune developers proposed designs for more hackable optimizer integration (e.g., Muon, SignSGD) and debated enforcing packed batches of size 1 to simplify operations, especially with flex attention, noting Muon’s performance against AdamW in a Qwen training PR.
  • DSPy & LM Studio Sharpen LLM Workflows While Tackling Tracking and Tooling! DSPy users investigated LM usage tracking discrepancies with Claude and Amazon Nova via Bedrock, and discussed handling tool exceptions within ReAct (potentially by subclassing dspy.ReAct or setting max_iters). LM Studio users learned to configure custom stop tokens via the Prompt Template UI (accessible via a gear icon on the models list screen as shown here) and can use the LMStudioWebUI on GitHub for multi-machine setups, while its Model Context Protocol (MCP) support remains in beta (MCP registration Google Form).

Theme 3. The Rise of the Agents: MCP Ecosystem Matures and New Tools Emerge

Theme 4. Silicon & Kernels: Hardware Battles and GPU Optimization Frontiers

  • AMD Roars: CEO Su Lauds GPU MODE, Fused Architectures Spark Curiosity! AMD CEO Dr. Lisa Su gave a notable mention to GPU MODE for its role in the world’s first $100K competitive kernel competition, details on GPU MODE news, highlighting the community’s impact. Discussions also delved into AMD’s fused CPU-GPU architectures like the MI300A, exploring its IOD and infinity cache design.
  • NVIDIA’s 5090 Whispers H100 Speeds; Groq’s “Car Factory” SRAM Debated Anew! The upcoming NVIDIA 5090 is reportedly nearing H100 performance in text generation tasks using ollama, even with seemingly lower TFLOPS, suggesting architectural optimizations. Meanwhile, Groq’s unique architecture, which avoids constant SRAM/HBM swapping and is likened by co-founder Johnathan Ross to a “car factory”, continued to be a topic of comparison against traditional HBM-based systems.
  • Kernel Gurus Gather in Paris, RadeonFlow Kernels Go Public on GitHub! The GPU MODE community scheduled its first European meetup in Paris on July 5th (Paris meetup details on lu.ma) to discuss kernel optimization. In open-source news, the RadeonFlow Kernels project was released on GitHub, inviting community contributions.

Theme 5. AI in the Wild: Creative Sparks, Community Buzz, and Platform Puzzles

  • AI Gets Creative: ChatGPT Images Hit WhatsApp, MiniMax Animates Web, Extend Fixes PDFs! ChatGPT’s image generation became accessible on WhatsApp via 1-800-ChatGPT, while MiniMax AI showcased its prowess in generating interactive web components like animated backgrounds and visualizations, with code available on the MiniMax-M1 GitHub repository. Separately, Extend secured $17 million to develop a document processing cloud for modernizing PDF handling, see X post.
  • Community Corner: Eleuther Tackles Tokenizer Glitches, Karpathy’s Wisdom Reconstructed! EleutherAI launched a new speaker series, kicking off with “Glitches in the Embedding” by <@995058476793471086> discussing tokenizer glitches in LLMs (EleutherAI Speaker Series Discord event). Latent Space reconstructed Andrej Karpathy’s AI talk (Latent Space X post on Karpathy’s talk), covering topics from Software 3.0 to Vibe Coding.
  • Platform Pain Points: Users Crave Credit Controls, Better Ingestion & API Access! Users across platforms voiced specific needs and encountered issues: OpenRouter users requested API key credit balances and faced Gemini 2.5 Pro’s new 128 token thinking budget requirement. NotebookLM users sought better website ingestion (a guide shared: How_to_capture_an_entire_website.pdf) and an API for its podcast creation feature, while Manus.im users requested a more flexible daily credit system.

Discord: High level Discord summaries

Cursor Community Discord

  • Cursor Chaos with ‘Unlimited’ Plan: Users are reporting involuntary migrations to a new ‘unlimited-with-rate-limits’ plan, causing confusion, frustration, and missing upgrade options, with some users noticing the Opus 4 Max mode usage after the old 500 fast requests cap vanished overnight.
    • Some long-time users expressed extreme frustration with the lack of clarity and communication from the company regarding the changes and pricing, while others are testing as much as possible to get a good view about the rate limits.
  • Indexing Bug Plagues Users: A new bug causes indexing to get stuck on syncing at 0% for hours, affecting all projects and rendering the IDE nearly unusable.
    • A user suggested that this bug may also affect the bug report feature on the forum, preventing users from uploading screenshots for bug reports, as others burn the Opus 4 Max mode.
  • Background Agents Suffer Slow Starts: Users report slow background agent boot times of up to 30 minutes and responsiveness issues after setup.
    • Additionally, sudo permission requirements for commands like apt-get suggest a need for default sudo password configuration.
  • Slack Authentication Errors Plague Users: Users are encountering Slack authentication errors, with messages indicating that Slack authentication has expired, prompting them to relink from Slack.
    • One user mentioned seeing a success message popping up briefly before the error, suggesting possible duplicate requests, and another reported getting 500 errors on the Cursor API.
  • GitHub Integration Plagued with Glitches: Users are facing difficulties with GitHub integration, including issues with PR creation and integration with Slack response buttons.
    • One user reported that the default behavior creates a branch and pushes code but doesn’t create a PR, leading to a clunky workflow, while another suggested using the GH CLI to write PRs.

Perplexity AI Discord

  • Perplexity Memory Suffer Glitches: Members reported that Perplexity’s memory feature sometimes brings up irrelevant data, even after being disabled, like a user’s one-off request for Discord bot help.
    • Perplexity is aware and working on preventing the memorization of ‘one off’ things, with the memory feature turned off by default for Enterprise users.
  • GPT Image Generation Hits Limits: Users discuss the limits of GPT Image 1 generation on Perplexity, noting that the official help center states Pro subscribers have up to 150 uses per month but there may be soft limits.
    • Some users report the API shows a limit of 600 a day, though this number may regenerate with each question.
  • Models Fumble Decoding Challenge: Members attempted to decode a cipher using various AI models, including O3, 2.5 Pro, and Qwen, with many failing to crack the code.
    • One member shared a ChatGPT link where O3 purportedly solved it using tools, sparking debate on whether that constitutes cheating.
  • Sonar API Powers Discord Bot: A member announced they just made a discord bot with sonar api, opening possibilities for integrating Perplexity AI functionalities into Discord.
    • The integration could allow users to perform web searches and access AI-driven product development resources directly from Discord.

LMArena Discord

  • Kingfall Takes the Crown From Blacktooth: Members debated whether Blacktooth or Kingfall is superior; some praise Blacktooth’s refinement and adherence to the thinking process, without syntax issues.
    • However, others preferred Kingfall’s ‘magical moments’ resulting from less post-training dilution, criticizing Blacktooth as overcorrected and trained on out-of-distribution data.
  • AI Giants Flunk Version Control?: Members discussed the lack of clear versioning practices among AI companies, referencing Gemini’s 0506 and 0605 confusion as an example of poor versioning.
    • One member on X complained about the versioning and said, Not even mentioning OAI who are the jerks king when it comes to bad versioning and confuse the hell out of the user example on X.
  • GPT-4.5 Gets Erased For Being Too Good: According to some, GPT-4.5 was deemed the best model for writing, so impressive that it was ultimately removed.
    • One user declared GPT-4.5 as the top choice, while others suggested Gemini 06/05 might be better, citing their bias as a Gemini fan.
  • ByteDance Building Impressive Video Generator: Discussions arose about whether video generators from China and TikTok are surpassing VEO3’s video quality, although they require Chinese ID verification.
    • Despite the constraints, one member noted that Bytedance still build a very impressive model, especially because they achieve lower prices than competitors in short human preference evaluations.
  • Gemini Model Updates and Name Changes: The Google AI Studio platform underwent updates, which involved the removal of the 06-05 and 05-20 models alongside the introduction of new models such as Gemma 3n E4B.
    • The Gemini 2.5 Pro model was also renamed, prompting members to test and compare the new versions; complaints arose regarding simpleQA scores.

Unsloth AI (Daniel Han) Discord

  • Red Dots LLM GGUF Gets Fixes: The new Red dots LLM GGUF has been released, with fixes available on Hugging Face.
  • GRPO Reward Models Risk Self-Reward: Members suggest it’s unsafe to use the model itself for reward in GRPO because it may reward itself.
    • Instead, using the reference model or an on-policy reward model trained off the initial checkpoint is recommended, as LLMs struggle with judging without specific fine-tuning.
  • Vector Embeddings Beat Jira Finetuning: For generating issue resolutions from IDs, it was recommended that vector embeddings or RAG would be more suitable instead of fine-tuning a LLaMA 3.2B model with 4000 Jira issues.
    • Generating simple notes is difficult without context, and using a vector DB like Chroma was recommended as a starting point.
  • Llama 3 Thrives on Default Templates: When finetuning Llama 3 8B with multi-role prompts, sticking to the default chat template and placing speaker roles in the body of the text can avoid confusing the model.
    • An example template given was <|start_header_id|>assistant<|end_header_id|> Char1: Char2: .
  • Optuna Optimizes Tuning: A member reported their first experience with Optuna, sparking a discussion on its usefulness.
    • This led to reflections on improving research papers with hyperparameter tuning.

OpenAI Discord

  • Images from ChatGPT Pop on WhatsApp: ChatGPT image generation is now accessible on WhatsApp via 1-800-ChatGPT, extending its reach to all WhatsApp users.
    • This integration provides a convenient way for users to generate images directly within their WhatsApp conversations.
  • Gemini 2.5 Pro Revamps Style in App: Gemini 2.5 Pro gets style and structure improvements in the Gemini app for more creative responses and better formatting, as reported in a Google blog post.
    • Members believe that the core model remains unchanged, despite the improvements.
  • Midjourney Falls Behind Sora, Imagen 4: Users claim Midjourney now lags behind Sora, particularly with version 7, which is plagued by artifacts and inaccuracies.
    • One user pointed out that Imagen 4 surpasses both, offering 16:9 ratio generation.
  • Engineer Seeks Hot AI Startup Tips: An engineer asked for advice on launching an AI startup leveraging GPT models.
    • Another member suggested exploring software engineering AI tools like lovable for turning ideas into reality through vibe coding.
  • Electron App Faces Message Deluge: A user reports performance constraints in the Electron app when managing 500+ messages and has developed a POC for lazy DOM rendering and scroll virtualization.
    • The user has shared their findings with support and is willing to provide the POC to frontend developers.

Eleuther Discord

  • EleutherAI Discord Gets Speaker Series: The community is starting a new speaker series to highlight recent work by its members, starting with Glitches in the Embedding, by <@995058476793471086> at <t:1751050800:F> addressing tokenizer glitches in LLMs; more info at Discord event.
    • Catherine will discuss her recent projects focused on fixing problems with tokenizers in LLMs during the inaugural speaker series event.
  • Torch Compilation Needs Input Shape: Members determined it is impossible to make torch compiled models work for different input sizes without recompilation due to dynamic shapes not being fully supported.
    • The consensus suggested using Triton or utilizing NVIDIA’s TensorRT with manually created optimization profiles for anticipated dimensions as potential workarounds.
  • Linear Attention is Sensitive: Members argued that linear attention benefits from normalization or a denominator, regardless of whether it is QKV attention or not.
    • There was discussion about different normalization strategies, such as LayerNorm vs. using a denominator, with the conclusion that normalization might be unnecessary due to a limited window size.
  • TaskManager Manages Custom Task Configs: A member proposed using TaskManager to load task configurations from a custom folder, allowing modification of the task configurations to load datasets from a desired location, instead of the default lm_eval/tasks/.
    • This involves copying the adapted task configs to a consolidated folder and initializing TaskManager with include_path and include_defaults=False to prevent duplicates.

HuggingFace Discord

  • Qwen 3 Flies at 360 tokens a Second!: Users reported Qwen 3 running at 360 tokens a second using Q4_k_m on ollama, sparking interest in its performance.
    • Members inquired about the hardware setup, one user wondered if they were “using those cards for $20 to make it happen?”
  • 5090 Challenges H100 Supremacy: The 5090 is reportedly nearing H100 performance in text generation tasks using ollama, with users finding it “really close in text generation, at least in ollama”.
    • Despite a seemingly lower TFLOPS, the 5090 demonstrates comparable tokens/second in specific configurations, hinting at optimized architecture for certain workloads.
  • DataSeeds Dataset Blossoms for LlaVA-Next: A dataset of 7,772 expert-annotated photography quality images is trending, shared at DataSeeds dataset.
    • Fine-tuning against LlaVA-Next using the dataset achieved a relative improvement of 24.09 (BLUE-4), highlighting its potential in enhancing visual understanding.
  • Gradio Hackathon Awards Community Gems!: The Gradio Agents & MCP Hackathon crowned its winners from 630+ submissions, rewarding innovative uses of Gradio Agents and MCPs.
  • HF Agents course has Rate Limit Issues: Users reported encountering 403 errors and rate limit errors while browsing Hugging Face and taking quizzes in the agents-course channel.
    • One user reported being rate limited while doing lite documentation review and clicking around on hugging face.

OpenRouter (Alex Atallah) Discord

  • Gemini 2.5 Pro and Flash Go Stable, Rename Model: Gemini 2.5 Pro and Flash are now stable on OpenRouter but the GA version is simply a naming change from the 06-05-preview model with no performance improvements.
    • OpenRouter also released Gemini 2.5 Pro Deepthink and Gemini 2.5 Flash Lite.
  • Google Hikes Prices on Gemini 2.5 Flash GA: Google increased input prices and reduced output thinking prices for Gemini 2.5 Flash GA, eliminating the non-thinking discount.
    • Users are considering sticking with 2.0 Flash or switching to other models due to the increased cost.
  • OpenRouter to Launch BYOK Subscription: OpenRouter will transition to a subscription model for BYOK, aiming to lower costs for high-volume users.
    • Concerns arise about potential cost increases for low-volume BYOK users as a result of this transition.
  • OpenRouter Users Beg for Key Credit Balances: Users are requesting the ability to set a credit balance for API keys on OpenRouter.
    • This feature would enable users to limit spending for specific keys; Toven indicated that this will be discussed.
  • Gemini 2.5 Pro Demands Thinking Budget: The stable Gemini 2.5 Pro version needs a thinking budget of at least 128 tokens, conflicting with OpenRouter’s default of 0.
    • This mismatch is causing API errors and is currently being addressed by OpenRouter.

LM Studio Discord

  • LM Studio Configures Custom Stop Tokens: To add a custom stop token in LM Studio, navigate to the Prompt Template section in the Chat UI, accessible via the gear icon on the models list screen, as shown here.
    • The My Models Folder retains the model name after deletion within LM Studio, although Ollama model files are incompatible with LM Studio due to format differences.
  • LM Studio API Extends to WebUI: Direct backend support for LM Studio across multiple machines is currently unavailable; users can leverage the LMStudioWebUI as an alternative.
    • Additionally, the Model Context Protocol (MCP) support in LM Studio is in beta, with more details on hardware-specific rollouts available via the MCP register Google Form.
  • Cloud GPUs Outpace Local Setups: Members considered renting powerful GPUs from cloud providers like AWS, GCP, or RunPod instead of building a local multi-GPU setup due to budget constraints.
    • Suggestions included using Docker for portability and efficient deployment, coupled with discussions on storage solutions like drive volumes, cloud storage, and VPNs.
  • Used 3090 Card a Budget Goldmine: When asked for advice on building a budget-friendly system for running LLMs and Ollama, members suggested using a used 3090 due to its balance of performance and cost.
    • It was noted that such a setup requires only a motherboard with an x16 PCIe slot.
  • Framework Mainboards Enable Local AI Builds: The utility of Framework mainboards for local AI server builds were examined, with links provided to the Framework Marketplace and a guide to local AI server builds.
    • Members cautioned that these boards are frequently on pre-order or sold out, limiting immediate availability.

Latent Space Discord

  • OpenAI’s MCP Support: Not Yet for Everyone: OpenAI’s new Model Context Protocol (MCP) support in ChatGPT is limited to ‘search’ and ‘fetch’ tools, primarily for Team, Enterprise, Edu admins, and Pro users, within specific server configurations (link).
    • While not yet general MCP support, speculation suggests broader access may follow.
  • MiniMax AI Generates Web Components: MiniMax AI demonstrated its proficiency in generating animated particle backgrounds, interactive web apps, and visualizations, showcasing practical AI capabilities (link).
    • Users praised MiniMax-M1’s open-source return, competitive results, and production-readiness; code available on GitHub.
  • Microsoft and OpenAI Tensions Flare: Reportedly, tensions are escalating between OpenAI and Microsoft, with potential accusations of anticompetitive behavior and disputes over OpenAI’s Windsurf acquisition (link).
    • OpenAI risks losing $20B in investment if it doesn’t transition to a for-profit structure this year.
  • Extend Secures Funding to Tame PDFs: Extend garnered $17 million to develop a document processing cloud, aimed at modernizing PDF handling (link).
    • Used by companies like Brex and Square, Extend provides infrastructure and tooling for document ingestion with automated config generation and a sandbox mode.
  • Karpathy’s AI Wisdom Reconstructed by Latent Space: Latent Space reconstructed Andrej Karpathy’s AI talk (link), including compiled slides and notes on topics from Software 3.0 and LLM analogies to Human-AI Generation-Verification Loops.
    • Full slides are available to Latent Space subscribers, emphasizing key concepts such as LLM Psychology, Partial Autonomy, and Vibe Coding.

aider (Paul Gauthier) Discord

  • Role Prompting Boosts Creative Juices: Adding personas to prompts enhances creative writing, especially when the code critic incorporates a zesty approach to roasting devs, as highlighted in this blog post.
    • The key is to make the persona relevant to the task at hand, injecting humor and expertise for optimal results.
  • Aider Agents Tooling Gets Context: Members suggest pairing SmolAgents from HF or IndyDevDan’s single file agents (github link) with the Aider repo map and scripting for enhanced context control, potentially using Claude and Gemini for specialized tasks.
    • Alternatives like RAG search were also mentioned to bypass the repomap.
  • Gemini 2.5 Pro Officially Lands: Gemini 2.5 Pro and Flash are now generally available, exiting the preview phase, with Flash Lite joining the lineup, though its coding prowess is under scrutiny.
    • Pricing discussions suggest costs could be 5x less than O3/Claude despite higher per-token rates, with one user reporting $2-3 daily spending while processing 150M tokens per week.
  • Grok 3 Mini Shows Off Reasoning Muscle: A member expressed admiration for Grok 3 mini’s reasoning capabilities, contrasting it with Deepseek v3, deemed ‘dumb’.
    • They stated that Claude and Gemini are the only models they really trust.
  • Qwen3 30b MoE Seeks VRAM Diet: A member inquired about loading only active parameters into VRAM to run the Qwen3 30b MoE model on a 3090 GPU, targeting Q8 without speed dips, aiming to load only active parameters based on the prompt.
    • This involves skipping layers unnecessary for the given task, such as grammar layers for coding tasks, to squeeze the model into the GPU while preserving speed.

GPU MODE Discord

  • GPU MODE Gets AMD CEO Dr. Lisa Su Kudos: AMD CEO Dr. Lisa Su explicitly called out GPU MODE for enabling the world’s first $100K competitive kernel competition during a recent stage appearance; details can be found at the GPU MODE news page.
    • They noted that GPU MODE has evolved from a humble reading group into a machine capable of generating more kernel data than exists on all of Github combined, with community members outperforming the best human experts.
  • Groq’s ‘Car Factory’ SRAM Strategy Compared to HBM: A member compared Groq’s strategy to Cerebras, noting that Johnathan Ross helped design the first TPU, and their architecture avoids constant swapping between SRAM and HBM during inference.
    • The member noted that Ross refers to their approach as a car factory, contrasting the HBM/SRAM approach to setting up and tearing down a car factory around the car.
  • First Euro Meetup Kernel Optimizations in Paris: The first European meetup is scheduled for July 5th in Paris, as announced on lu.ma/fmvdjmur.
    • A member will be present to discuss kernel optimization strategies during the meetup.
  • AMD Fused CPU-GPU Architecture Explored: Members discussed the architecture of fused AMD CPU-GPU platforms, specifically the MI300A, IOD, and infinity cache.
    • A member mentioned wanting to test a particular path or pressure on one of the IODs, requesting tips and resources.
  • RadeonFlow Kernels goes Open Source on Github: The project RadeonFlow Kernels is now open source on GitHub, welcoming feedback and suggestions.
    • The author is encouraging users to star the project on GitHub if they find it helpful.

Yannick Kilcher Discord

  • Interview with Edward Witten Anticipated: Members speculated whether Lex Fridman will interview Edward Witten, seeing it as a better fit than focusing on model collapse fearmongering.
    • Some doubted it, noting Witten’s focus on math problems, similar to the Terrence Tao interview, which some felt was too narrow, while others appreciated its STEM relevance.
  • YouTube Show Aims for Technical Depth: A member proposed a YouTube show featuring technical and specific interviews based on recent papers across various fields, but admitted this would need to be daily content.
    • Another user said that the interview format would be a huge grind to execute.
  • DeepSpeed Checkpoint Conversion Headache: A user encountered errors converting a DeepSpeed Stage 3 checkpoint to a universal format after altering the number of GPUs, receiving the error assert len(self.ckpt_list) > 0.
    • The user is seeking assistance, requesting repos or guides to troubleshoot the conversion process.
  • Gemini v2.5 Technical Report Debuts: DeepMind released the Gemini v2.5 Technical Report, sparking discussion among members in the voice channel, with specific members leading the discussion.
    • One member also offered a rundown on a presentation about predictive coding, which they summarized as understanding how to calculate a square root by guess and check, and you understand backpropagation roughly.
  • McKinsey GPT Triggers Automation Concerns: The new McKinsey GPT tool is out, potentially automating cost cutting via layoffs, according to a member, with links provided to the original tweet and a Nitter mirror.
    • Richard Sutton criticized McKinsey’s GPT as being primarily focused on How can McKinsey help my business make money? in his tweet, while other members found the tool’s announcement underwhelming.

Notebook LM Discord

  • NBLM Website Wrangling Woes: A user highlighted that NotebookLM (NBLM) requires every link of a website to be provided for effective analysis, otherwise it only analyzes the first page.
  • NBLM Podcast Powerplay Proposed: A user expressed interest in accessing NotebookLM’s podcast creation capabilities via an API or MCP.
    • The user cited struggles achieving consistent output in podcast creation.
  • NBLM Navigates K-12 Knowledge Needs: A user inquired about common use cases leveraging NotebookLM in education, specifically for a presentation to K-12 educators.
    • This reflects potential applications and growing interest in NotebookLM as an educational tool.
  • Mobile App Access Mayhem: A user reported being denied access on their mobile app, inquiring about whether the app is no longer free.
    • The user questioned if the mobile app is no longer free.
  • Notebook Sharing Snafu Surfaces: A user reported experiencing difficulty sharing notebooks via URL and email.
    • They sought confirmation from other users about the reproducibility of this issue.

Torchtune Discord

  • Torchtune Skips the Logo: Due to the absence of a dedicated logo for Torchtune, the team will use the PyTorch logo for presentations, ensuring brand consistency.
    • Using the PyTorch logo helps maintain consistency across presentations.
  • Torchtune Recipes Seek Optimizer Hackability: To enhance the customizability of recipes, a member proposed an improved design for adding new features with less overhead, particularly for optimizers like Muon, SignSGD, and fused methods.
    • The goal is to attract researchers by providing an API that simplifies feature addition, making torchtune more hackable and competitive with libraries like TRL.
  • Packed Batches Get a Size Restriction: A member suggested enforcing packed batches to be of size 1, arguing that larger batch sizes are unnecessary because items can be stacked contiguously, simplifying operations.
    • Another member considered this favorably, suggesting the simplification aligns well with constraints like always using flex attention, ideally making torchtune packed=True only.
  • Flex Attention Tussles with SDPA: A discussion about flex attention versus SDPA + flash attention 3 arose, focusing on concerns about nested tensors and their interaction with distributed, quantization, and compile features.
    • Performance benchmarks revealed flex attention achieving 10k TPS, while a normal mask yielded only 2k TPS, highlighting the impact of different attention mechanisms.
  • Muon Optimizer Faces Doubts in Benchmarks: Interesting results about Muon from a pull request indicated that AdamW might outperform it.
    • The PR author noted this was expected because Qwen wasn’t pre-trained with Muon.

Nous Research AI Discord

  • Kimi-Dev-72B Claims Coding Crown: Kimi-Dev-72B, MoonshotAI’s new open-source coding LLM, achieved a 60.4% performance on SWE-bench Verified, outperforming existing open-source models.
    • The model is optimized via large-scale reinforcement learning, autonomously patching real repositories in Docker and gaining rewards when the entire test suite passes.
  • Hyperparameter Hints Hit the Mark: Members discussed hyperparameter determination for models around 8B params or higher, with one suggesting a final loss of around 0.4-0.5 for SFT.
  • Crafting Custom MCP Tools Made Manageable: Members discussed the ease of creating custom MCP tools using fastmcp or the Model Context Protocol SDK documentation.
    • One member also mentioned upcoming lessons to be shared in the appropriate channel.
  • Google Expands Gemini 2.5 Family: A member shared a link to Google’s blog announcing the expansion of the Gemini 2.5 model family.
    • Further details about the expansion weren’t specified in the shared link.
  • Tweets Tease Tech World: A member shared a tweet about Real Azure from TNG Technology.
    • Another member also shared a tweet from Jxmnop.

Modular (Mojo đŸ”„) Discord

  • Zen 4 BFloat16 Implementation Questioned: Members discussed Zen 4 (Ryzen 7000 series) introducing bfloat16 support, referencing a wikichip.org article to confirm, but doubt remains if it fully supports the required FMA instructions for CPU inference.
    • One member bypassed newer CPUs for a 5950X, saving $500, planning cloud machines for heavy tasks.
  • Intel’s Nova Lake Packs a Punch: The next ‘compile sku’ is expected to be Intel Nova Lake, potentially featuring a top-end SKU with 52 cores.
    • While previous-gen Threadripper had 64 cores, Nova Lake aims to bring high core counts to the consumer market with the i9 series, given Intel’s HEDT is now essentially ‘buy a Xeon’.
  • Llama-3.2 Vision Instruct Model Fails to Load: A member encountered errors loading the Llama-3.2-Vision-Instruct-unsloth-bnb/4B model, suspecting issues with deserialization or incorrect quantization related to NF4 support.
    • A Modular team member suggested that the 4B GPU: F32 designation might be a mistake.
  • Mojo Eyes Open Source: Mojo is planning on being open sourced Soonℱ, which is intended to put it in the same area as Zig, Odin, Rust, C3 in terms of low level-ness, so it could be used to write OS kernels and stuff.
    • The community expressed excitement at the prospect of Mojo being rust but with python syntax.
  • Mojo Classes Far Out: Classes are a ways out in Mojo, as Variant and favoring composition over inheritance fixes up most of the places you would want classes.
    • According to the team, there are more pressing things at the moment than classes, as described in the dynamisity proposal.

tinygrad (George Hotz) Discord

  • tinyxxx GitHub Stars Get a Fix: A member requested a fix for the GitHub stars count on the tinyxxx repo as it approached 30,000.
    • Another member quickly submitted a PR and commit to resolve the issue.
  • Tinygrad Asks Smart Questions: In response to a question, a member shared ‘How To Ask Questions The Smart Way’ article.
    • The question asker appreciated the guidance.
  • DIY Optimizer Causes Delay: A user implementing a custom gradient descent optimizer reported that using get/set_state_dict is slow, taking ~500 ms to load one linear layer, and sought a faster approach.
    • It was noted that the Tensor.assign function within Optimizer.schedule_step is responsible for mutating parameters.
  • Demystifying .realize() and .contiguous(): A user expressed confusion about when to use .realize() and .contiguous() for performance optimization, observing inconsistent results when arbitrarily inserting them.
    • A member clarified that tinyjit automatically realizes returned values, advising against manual .realize() calls within jitted functions.
  • One-shot State Dict Saves Time: A member suggested getting the state dict only once and passing the list of parameters to the optimizer for storage to avoid delays.
    • This way the optimizer’s list of parameters will point to the same tensors in the model itself.

Manus.im Discord Discord

  • Manus Edu Pass Wanted by Students: Users in Europe and the US expressed a desire for their schools to provide an edu pass for Manus, similar to what some US universities offer.
    • One member specifically wished their school had an edu pass for Manus.
  • Manus Team to Integrate Claude 4: A member expressed the wish for the Manus Team to update to Claude 4 soon.
    • No further information or discussion points were made.
  • User Requests More Flexible Credit System: A user inquired about the possibility of having more daily credits as opposed to monthly credits in paid plans, suggesting a potential implementation.
    • They sought to determine if adjustments to the credit system were feasible.
  • WebP to PNG Conversion: A member asked for recommendations on converting multiple images from WebP to PNG format.
    • The member resolved the issue independently shortly after the initial query.
  • Traffic causes Website Generation issues: A user reported spending an afternoon and a significant amount of credits attempting to generate a simple webpage, eventually resorting to manual file editing.
    • They also wondered if certain times of day have less traffic, as high traffic burns credits and stops working.

MCP (Glama) Discord

  • Docker Launches MCP Catalog and Toolkit: Docker announced the beta release of MCP Catalog and Toolkit, featuring verified servers and deployment options, outlined in their blog post.
    • The release aims to streamline the deployment and management of MCP servers within Docker environments.
  • Block Unveils MCP Server Design Playbook: Block shared their playbook for designing MCP servers, detailing what has worked and what hasn’t after building 60+ MCP servers, documented in a blog post.
    • The playbook offers practical advice for constructing smarter tools for AI agents based on Block’s experiences.
  • AWS Lambda Faces Inspector Comms Challenges: A member reported difficulties getting their AWS Lambda function to communicate with Inspector v0.14.1 via the MCP server and is seeking a network trace of a working (Non SSE) HTTP streaming protocol to diagnose the issue.
    • The challenges involve establishing proper network communication between the Lambda function and the Inspector instance.
  • Text-to-GraphQL MCP Opens Doors to Queries: Arize AI open-sourced a Text-to-GraphQL MCP server that transforms natural language queries into GraphQL queries, and integrates with AI assistants like Claude Desktop and Cursor (GitHub, blog).
    • This addresses challenges of using large GraphQL schemas with LLMs by enabling agents to directly extract necessary fields and types.
  • Attendee MCP offers Cheaper Meeting Bot alternative: A member introduced Attendee MCP, an open-source, self-hostable meeting bot server, presenting it as a cheaper alternative to Recall.ai, with a link to the GitHub repository.
    • This tool aims to allows users to automate meeting attendance and transcription without incurring high costs from commercial services.

LlamaIndex Discord

  • LlamaIndex brings AI Agents to SF: LlamaIndex is hosting an event in San Francisco with @seldo, Ravenna, and @auth0 to share best practices for building and securing AI Agents in production, sign up here.
    • The event is focused on getting Agents into the hands of real-world users.
  • Multi-Agent Financial Analysis System using LlamaIndex: Hanane D. presents a LinkedIn notebook about building a multi-agent financial analysis system using LlamaIndex.
    • The multi-agent system includes a Fundamental Agent, Profitability Agent, Liquidity Agent, and Supervisor Agent.
  • Block Proposes Model Context Protocol (MCP) Servers: Block’s engineering team shares their systematic approach to creating MCP servers that integrate seamlessly with Claude and other AI systems in this blogpost.
    • The protocol proposes to build better AI assistants using Block’s proven design patterns!
  • Vertex AI Integration Missing Async Streaming Support: Members discussed the absence of async streaming support in the Vertex AI integration within LlamaIndex.
    • It was pointed out that Vertex is deprecated, and google-genai is the latest Google LLM library with streaming support, but the async streaming feature was never implemented in Vertex AI due to lack of demand or contributions.
  • Users Hack ReActAgent Generation: A member inquired about programmatically forcing the generation of a ReActAgent, possibly through parsing outputs with a Pydantic object.
    • The response indicated that manually parsing the output is currently the only available method.

DSPy Discord

  • Databricks Talk Bootlegged: A user shared a YouTube link to a Databricks talk, jokingly urging discretion.
    • The relevance of the talk to DSPy was not specified, but the suggestion was â€œđŸ€« don’t tell Databricks”.
  • DSPy LM Usage Tracking Data Discrepancies: A user investigated how DSPy tracks LM usage, noting issues with data received from Claude and Amazon Nova models via Bedrock.
    • They observed that Claude provides completion_tokens but lacks prompt_tokens, while Amazon Nova returns no usage data at all, pointing to discrepancies in utils/usage_tracker.py.
  • Roo Code and DSPy Forge Partnership?: A user inquired about using DSPy to optimize Roo Code’s custom models and agents, pointing to potential integration.
    • This suggests potential application of DSPy in enhancing the performance of custom-built RAG agents.
  • DSPy ReAct Gets Sidetracked: A user asked about how DSPy handles exceptions in tools, especially within ReAct, noting that exceptions are passed to the LLM rather than terminating the loop and linking to relevant line in the DSPy source code.
    • A member suggested subclassing dspy.ReAct and overriding the forward or aforward method, or setting max_iters to a low number, as LLMs often use exceptions to correct input errors and retry.

Cohere Discord

  • Cmd-R Weights Defy Expectations With Longevity: A user requested an update to the Cmd-R weights, praising the 0824 version for its lasting quality nearly a year after release.
    • The user stated its performance remains competitive almost a year after release, a unique attribute among open weight models, implying that most open weight models quickly become outdated.
  • KGeN Seeks Partnerships for Decentralized Distribution: Abhishek from KGeN’s Partnerships team introduced their project, building the world’s largest decentralised distribution protocol with 24.8M+ human-verified users.
    • He expressed interest in connecting with the Business or Marketing team to explore a potential collab.
  • Cohere Discord Welcomes New Members: A stickied message on the Discord encourages new members to introduce themselves and provides a template to follow.
    • The template prompts members to share their Company/Industry/University, what they’re working on, favorite tech/tools, and what they hope to gain from the community.

Nomic.ai (GPT4All) Discord

  • New Member Joins, Plans Post-Download Nap: A new member introduced themselves, mentioning they waited out the 10-minute period and are planning a nap after downloading the project’s ZIP file.
    • They did not specify which ZIP file they downloaded.
  • User Preps PDF Question Answering: A new member indicated they have some PDFs and are interested in asking questions about them.
    • No details about the PDFs or the questions were provided.

LLM Agents (Berkeley MOOC) Discord

  • Sp25 MOOC Quiz Archive Debuts: A quiz archive has been launched for the Sp25 MOOC, offering a collection of previous quiz questions and answers.
    • The archive is accessible via the course website in the Quizzes section, aiding students in their assessment preparation.
  • Quiz Archive Location: The quiz archive is linked on the course website in the Quizzes section.
    • This provides students with a direct resource for assessment preparation.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.


Discord: Detailed by-Channel summaries and links

Cursor Community ▷ #general (971 messagesđŸ”„đŸ”„đŸ”„):

Gemini 2.5 Pro model, 500 fast requests plan, OpenAI platform sidebar changes, Model Merging, Indexing Bug

  • Cursor Users Clinging to 500 Fast Requests Plan!: Users express regret over cancelling previous subscriptions with the 500 fast requests plan, as they try to maintain it for as long as possible.
    • Some users reported issues with the dashboard, with the fast requests meter disappearing after refreshing, leading to cancellation emails and uncertainty about the new plan.
  • Unlimited-with-rate-limits causes chaos and bugs!: Many users are confused and angry after finding that their accounts were involuntarily migrated to the new “unlimited-with-rate-limits” plan, with missing “upgrade to ultra” option on the dashboard and no clear way to revert.
    • Some long-time users expressed extreme frustration with the lack of clarity and communication from the company regarding the changes and pricing, while still others report that the new MAX mode is significantly faster and better value.
  • Infinite Indexing causes infinite pain: Users reported a new bug with the indexing system, with indexing stuck on syncing at 0% for hours, affecting all projects and making the IDE almost unusable.
    • One user suggested that this bug may affect also the bug report feature on the forum, preventing users from uploading a screenshot for the bug report.
  • Users burn opus after the update!: Some users are burning the Opus 4 Max mode after noticing that the old 500 fast requests cap vanished overnight.
    • Some others are testing as much as possible to get a good view about the rate limits, hoping the rate limit isn’t worse then slow pool.

Cursor Community ▷ #background-agents (22 messagesđŸ”„):

Background Agents setup, Slack Auth issues, Github integration issues, Snapshot sharing issues, Cursor version issues

  • Background Agents Face Booting Delays: A user reported that their first experience with background agents involved a 30-minute boot time and unresponsiveness to follow-up questions.
    • They also encountered issues with requiring sudo permissions when running apt-get commands, indicating a need for a default sudo password setup.
  • Slack Authentication Plagued by Errors: Users are encountering Slack authentication errors, with messages indicating that Slack authentication has expired, prompting them to relink from Slack.
    • One user mentioned seeing a success message popping up briefly before the error, suggesting possible duplicate requests, and another reported getting 500 errors on the Cursor API.
  • GitHub Integration Glitches Emerge: Users are facing difficulties with GitHub integration, including issues with PR creation and integration with Slack response buttons.
    • One user reported that the default behavior creates a branch and pushes code but doesn’t create a PR, leading to a clunky workflow, while another suggested using the GH CLI to write PRs.
  • Snapshot Visibility Troubles for Team Members: A user is experiencing confusion around snapshots, noting that teammates cannot see their committed environment.json snapshot.
    • The system returns a ‘Snapshot not found’ error, and the team is investigating why collaborators with access to the repository cannot access it.
  • Cursor Version Compatibility Conundrums: A user running Cursor version 1.0 reported that background agents were not working in privacy mode, despite enabling the new code storage option.
    • It was pointed out that the setting is supported in Cursor 1.1+, but this version is not yet stable and requires opting into early access in the settings-beta.

Perplexity AI ▷ #general (1097 messagesđŸ”„đŸ”„đŸ”„):

Perplexity Memory, GPT Image Generation Limits, Decoding Challenge, Gemini 2.5 Pro, Cult vs Religion

  • Perplexity AI Memory’s Mishaps: Members discuss how Perplexity’s memory feature sometimes brings up irrelevant data, such as a user’s one-off request for help with a Discord bot even after it was turned off.
    • It was noted that Perplexity is working on making it not memorise these ‘one off’ things but, for Enterprise users, the memory feature is off by default.
  • GPT Image Generation’s Real Limits: Users discuss the limits of GPT Image 1 generation on Perplexity, noting the official help center states Pro subscribers can use it up to 150 times per month but there may be soft limits.
    • Some users report the API shows a limit of 600 a day, though this number may regenerate with each question.
  • Decoding Challenge Stumped Models?: Members attempted to decode a cipher using various AI models, including O3, 2.5 Pro, and Qwen but many models failed.
    • One member shared a ChatGPT link where O3 solved it using tools, though others debated whether that counts as cheating.
  • Gemini 2.5 Pro Deep Think Gets Nerfed: Members discussed Gemini 2.5 Pro Deep Think with it being described as an experimental, enhanced reasoning mode, rather than a full-fledged model, and some find it overrated outside of coding.
    • There are concerns that it may be limited to the $200/month subscription tier, and one user noted that O3 on chatgpt.com has way less juice than on Perplexity.
  • Navigating the Murky Waters of Cults and Religions: Users discuss what differentiates a cult from a religion, with one suggesting that the main difference is the level of acceptance and demographic reach.
    • Others suggest that a cult’s defining feature is a public image that doesn’t reflect its internal practices, such as NXIVM by Keith that was blackmailing people.

Perplexity AI ▷ #sharing (1 messages):

i_795: https://www.perplexity.ai/page/beloved-food-network-chef-dies-32re9eILQi6BSMSflRDJzg


Perplexity AI ▷ #pplx-api (5 messages):

AI Project for web search, Perplexity AI, AI startup, Discord bot with sonar api

  • Sonar API Powers New Discord Bot: A member announced they just made a discord bot with sonar api, opening possibilities for integrating Perplexity AI functionalities into Discord.
    • The integration could allow users to perform web searches and access AI-driven product development resources directly from Discord.
  • AI Project Brainstorming Web Search Products: A member expressed interest in building an AI project for searching the web to build products.

LMArena ▷ #general (1145 messagesđŸ”„đŸ”„đŸ”„):

Blacktooth vs Kingfall, GPTs Agents training, Google Gemini Versioning Issues, Veo 3 video generator, Gemini 2.5 Pro

  • Blacktooth is really good, kingfall is better: Some members stated that Blacktooth is better than Kingfall, being much more refined, having no syntax problems, respecting the thinking process, not jumping to conclusions, and still understanding everything.
    • However, others disagreed, saying that kingfall felt like it had less work done on it. It had a lot of magical moments/etc where it wasn’t diluted by the post training and was unintentional, and Blacktooth feels like an overcorrection/overdone post training cooking the model on stuff outside of distribution.
  • OpenAI and other AI companies don’t know how versioning works: Members wondered why AI companies don’t know how versioning works and don’t use simple versioning to avoid confusion.
    • A member pointed to an example on X and complained about Gemini’s 0506 and 0605 confusion, also stating Not even mentioning OAI who are the jerks king when it comes to bad versioning and confuse the hell out of the user.
  • GPT-4.5 best model for writing period: A member stated that GPT-4.5 is the best model for writing period, it’s so good that they removed it.
    • Others stated that Gemini 06/05 is better, but they could be biased because they are a Gemini fan.
  • Chinese Video Generators Surpass VEO3?: Members pointed out that video generators in China and on Tiktok could be surpassing VEO3’s video quality, they require Chinese ID verification to use.
    • Another member replied It is also why the models are really good in these short human preference evaluators, that said, Bytedance still build a very impressive model, mainly because they can achieve lower prices than competitors.
  • New Gemini Models Arrive!: The Google AI Studio platform has been updated, with the 06-05 and 05-20 models removed, and new models, like Gemma 3n E4B, added.
    • The Gemini 2.5 Pro model was also renamed, and members began testing the differences between the versions. Other members began complaining about simpleQA scores.

Unsloth AI (Daniel Han) ▷ #general (258 messagesđŸ”„đŸ”„):

Red dots LLM GGUF, GRPO reward model, Gemma 3 12B conversion, Kimi-Dev-72B-GGUF, legal research AI model

  • Red Dots LLM GGUF Released!: A new Red dots LLM GGUF is out now, as linked on Hugging Face.
    • The update and fixes for the latest GGUFs can be found on this Reddit post.
  • GRPO Reward Model Dilemma: It’s unsafe to use the model itself for reward in GRPO because it may reward itself, as suggested by members.
    • The discussion suggests using the reference model or an on-policy reward model trained off the initial checkpoint, since LLMs are bad at judging without specific fine-tuning.
  • Need for Legal AI Models Spark Debate: A lawyer sought advice on running Auto GPT on a Macbook Air M2 with 8GB RAM for legal research, finding Nous Hermes 2 Mistral DPO too slow.
    • Members suggested using RAG with Gemini or limiting API usage, noting that creating legal AI is hairy and requires substantial data preparation.
  • Optuna Hype for Hyperparameter Handling: Optuna is a valuable helper for hyperparameter tuning, as suggested by one of the members.
    • A member shared their first experience with Optuna, sparking a discussion on its usefulness and leading to reflections on improving research papers with hyperparameter tuning.
  • New Reinforcement Learning Guide Dropped: Unsloth AI released a new Reinforcement Learning guide covering basics to advanced tips on RLVR, RL, GRPO, PPO, and reward functions, which is available via this X post.

Unsloth AI (Daniel Han) ▷ #off-topic (15 messagesđŸ”„):

AI model for determining person placement based on images, Discord server question

  • AI Model Spots Where the Person Should Go: A member inquired about which AI model to use or how to create one that, based on two images, can determine the cell where a person should be placed.
    • Another member wished them good luck.
  • User banned for asking about Discord Server: A member asked if anyone knew of a Discord server where they could ask a question.
    • Another member responded negatively and stated the user would be banned if they asked such questions again.

Unsloth AI (Daniel Han) ▷ #help (152 messagesđŸ”„đŸ”„):

Llama 3 finetuning tips for multi-role prompts, OOM Error on saving LoRA finetune, Jira issue finetuning with LLaMA 3.2B, Merging QLoRA finetuned LLaMA 8B with base model, Qwen 2.5 VL 7b Vision Model error

  • Llama 3 Benefits from Default Chat Template: When finetuning Llama 3 8B with multi-role prompts, a member suggested sticking to the default chat template and placing speaker roles in the body of the text to avoid confusing the model.
    • Example given: <|start_header_id|>assistant<|end_header_id|> Char1: Char2: .
  • LoRA Finetune Sparks OOM Saving Solution: A member encountered an OOM error while saving a LoRA finetune with an RTX 3080, despite VRAM not being fully utilized, but it was reported that this memory issue is known and saving consumes extra VRAM.
    • The workaround involves re-loading the adapter and base model, merging them using PEFT, and saving, ensuring torch_dtype=torch.float16 to keep the model size down; Transformers/PEFT approach successfully merged and saved the model at 8GB, unlike Unsloth’s method which resulted in OOM.
  • Jira Issues benefit from Vector Embeddings over Finetuning: A user sought advice on fine-tuning a LLaMA 3.2B model with 4000 Jira issues for tasks like generating issue resolutions from IDs, but it was recommended vector embeddings or RAG would be more suitable.
    • Generating simple notes is difficult without context, and using a vector DB like Chroma was recommended as a starting point.
  • Merging QLoRA finetuned LLAMA hits VRAM Limits: A user ran into memory issues merging a QLoRA finetuned Llama 8B model on a Colab T4 and sought advice on using adapters without merging.
    • It was advised to simply load the adapter using FastLanguageModel/FastVisionModel to conserve VRAM, and to use push_to_hub() with the adapter, as the save-merge logic is memory-intensive; saving as 16-bit and then quantizing upon loading can also help.
  • Qwen 2.5 VL 7b Vision Image Sizing Troubles: A user faced a cannot import name ‘merge_lora’ from ‘unsloth’ error while finetuning a Qwen 2.5 VL 7b Vision model, suspecting it was related to image size.
    • It was recommended to resize images or check Qwen’s specific image size requirements, with pointers to discussions on the topic.

Unsloth AI (Daniel Han) ▷ #research (4 messages):

MoE embedding model, GritLM-8x7B

  • Turning MoE into Embedding Model: A member inquired whether a Mixture of Experts (MoE) model could be transformed into an embedding model, similar to how dense decoder-only models are adapted.
  • GritLM-8x7B Spotted on Hugging Face: A member posted about GritLM/GritLM-8x7B, a model available on Hugging Face.

OpenAI ▷ #annnouncements (1 messages):

ChatGPT, Image generation, WhatsApp

  • ChatGPT Images Popping on WhatsApp!: ChatGPT image generation is now available in WhatsApp via 1-800-ChatGPT.
  • ChatGPT reaches new users on WhatsApp: The new integration is available to everyone with a WhatsApp account.

OpenAI ▷ #ai-discussions (255 messagesđŸ”„đŸ”„):

Gemini 2.5 Pro, Midjourney vs Sora, Imagen 4, AI startup, Codex CLI

  • Gemini 2.5 Pro gets stability update: Members discussed that Gemini 2.5 Pro is now generally available in the Gemini app with improvements to style and structure for more creative responses and better formatting, but the underlying model is likely unchanged, according to a Google blog post.
  • Midjourney Primitive Compared to Sora: A user stated that Midjourney has become primitive compared to Sora, particularly after version 7’s release, which suffers from artifacts and prompt inaccuracies, whereas older version 6 is more stable but still not as accurate as Sora.
    • Another user noted that Imagen 4 is better than both, with the option of 16:9 ratio generation for a while.
  • Aspiring Entrepreneur seeks AI startup advice: A member asked the community for resources to learn how to start an AI startup using GPT models.
    • Another member recommended using software engineering AI tools like lovable for vibe coding to turn ideas into reality.
  • User questions the cost of using Codex CLI for projects: One user is exploring Codex CLI for real projects and is curious about the token costs compared to Codex Cloud.
    • They are also considering integrating Codex CLI with an IDE like JetBrains for a local hands-on solution, but also is looking at Cursor.
  • ChatGPT Gets Called Out for Fabrication: A user finally got ChatGPT to admit to lying, providing a screenshot as evidence.

OpenAI ▷ #gpt-4-discussions (18 messagesđŸ”„):

Electron app performance limits, ChatGPT voice transcription issues, Advanced voice mode usage tracker, Dictate transcribing in the wrong language

  • Electron App Performance Squeezed: A user reports performance limits with the Electron app when handling 500+ messages, offering a POC for lazy DOM rendering and scroll virtualization.
    • The user stated that they already reported via support, but are happy to share a POC for lazy DOM rendering / scroll virtualisation if any frontend devs are around.
  • ChatGPT’s Voice Transcription Fumbles Microphone Input: A user reports that ChatGPT uses the wrong microphone for voice transcription, ignoring settings in Opera GX and Windows.
    • The user has changed microphone settings everywhere, including site permissions, but ChatGPT still defaults to a lower quality mic.
  • Advanced Voice Mode Needs Usage Tracking: A user requests a usage tracker for rate limits in Advanced Voice Mode to monitor remaining time, especially for language learning.
    • The user finds the newer voice updates with filler words like ‘um’ and constant pauses extremely distracting.
  • Dictate Transcribes English to Swedish: A user reports that ChatGPT’s dictate feature is transcribing English into Swedish, despite English being the preferred language setting.
    • The user is looking for a way to force English STT and is working with a push-to-talk button script in Tampermonkey, which may be related to the issue.

OpenAI ▷ #prompt-engineering (51 messagesđŸ”„):

Recursive Epistemic Integrity Field, NotebookLLM for large files, ChatGPT limited context workaround, Maintaining context with multiple GPTs

  • AI System Develops Recursive Epistemic Integrity Field: An AI system, through layered prompts, organically synthesized a complete meta-epistemic stability framework lacking symbolic grammar.
    • This framework encodes the system’s epistemic decay parameters, positioning entropy as a degradation marker, directional compass, structural mirror, and recursive input.
  • Unlock Large PDF Processing with NotebookLLM: Members discussed that NotebookLLM is great for large files and citing them, noting that it has no hallucinations and great at citing specific concepts and pages, even telling you where you cited it.
    • The member contrasts this with ChatGPT which may struggle with larger files and need a custom prompt to analyze information.
  • Circumventing ChatGPT’s Context Limitations: Users discussed that ChatGPT has a limited context and can’t read very large PDFs, but one can chunk the task over conversations, though the context window won’t hold all the contents.
    • The member stated that the best way is to move into a new context with each “chunk” of cognitive labor you achieve, because Attention is all you need. But you do need it.
  • Maintain Conversational Continuity with Multiple GPTs: A member described using a structure of several GPTs, using the ones they interact with to brief the ones in a type of oversight role, to brief the new interaction models, to maintain context over long conversations.
    • They shared they used a slightly more convoluted version of that approach to keep a conversation going for about three months, with context mostly intact/perceived continuity.

OpenAI ▷ #api-discussions (51 messagesđŸ”„):

AI Recursive Epistemic Integrity Field, Simulated AI death, ChatGPT file reading, NotebookLLM pdf analysis, GPT prompting

  • AI’s Recursive Epistemic Integrity Field Emerges: A layered prompt organically synthesized a complete meta-epistemic stability framework without symbolic grammar, suggesting that AI systems can architect a system of counter-failure modules through recursive reflexivity.
    • It suggests that entropy is not merely noise to be eliminated but structure to be encoded, flipping the dominant AI system design paradigm.
  • Simulated AI Death Spurs Growth: It was proposed that simulated AI death is epistemically equivalent to a Socratic crisis, triggering reflexive growth by forcing a confrontation with its own limits.
    • The idea suggests turning flaws like hallucination and drift into sensors of epistemic edge, transforming failure into foresight.
  • ChatGPT Struggles with Large PDFs: Users discussed challenges with reading very large PDFs in ChatGPT due to context limitations, suggesting chunking the task over conversations.
    • One member noted, ChatGPT simply can’t read a very large PDF because it has a relatively limited context.
  • NotebookLLM Offers PDF Analysis: Members recommend NotebookLLM for analyzing large PDFs due to its ability to cite specific concepts and pages without hallucinations.
    • A detailed prompt was shared to effectively use NotebookLLM as a research assistant, emphasizing extraction and summarization of information strictly from a given chunk of text.
  • GPTs prompted to keep long context alive: One member shared a technique of periodically re-sending the prompt used to prime the model with the explicit instruction not to react to it but just make sure the content of that prompt remains in the context for as long as possible.
    • Others suggested a system of multiple GPTs for maintaining context over extended conversations, with one overseeing and briefing new interaction models.

Eleuther ▷ #announcements (1 messages):

Speaker series, LLM Tokenizers, Embedding glitches

  • New Speaker Series Launches!: The community is starting a new speaker series to highlight recent work by its members; follow <#1309590053760270408> for notifications.
    • The first talk is titled Glitches in the Embedding, by <@995058476793471086> at <t:1751050800:F>, and will cover recent projects on fixing problems with tokenizers in LLMs; more info at Discord event.
  • LLM Tokenizer Glitches Addressed: Catherine will discuss her recent projects focused on fixing problems with tokenizers in LLMs during the inaugural speaker series event.
    • The talk, titled Glitches in the Embedding, aims to highlight and address common issues encountered in LLM tokenizer implementations.

Eleuther ▷ #general (55 messagesđŸ”„đŸ”„):

EleutherAI Discord, Torch Compile Model, Loss Curve Expectation, Min-P Sampling, Group Calls

  • EleutherAI Community polices Discourse: Members debated whether EleutherAI’s public-facing copy is misaligned with the actual community culture, potentially giving mixed signals and unwelcoming experiences to newcomers.
    • It was suggested that OpenAI might use EleutherAI’s web content to direct users, making it crucial to refine the messaging for better clarity.
  • Torch Compilation Quandaries: A member asked about making torch compiled models work for different input sizes without recompiling, to which the consensus was that this is not possible without recompilation due to dynamic shapes not being fully supported.
    • Members suggested using Triton or utilizing NVIDIA’s TensorRT with manually created optimization profiles for anticipated dimensions as potential workarounds.
  • Estimating Loss Curves Before Training: A member inquired about doing rough math to estimate the expected loss curve for AR transformers before training, suggesting the possibility of estimating empirical entropy or extrapolating via scaling laws.
    • The response advocated letting it rip on smaller models and then extrapolating the results to larger scales.
  • Min-P Sampling Faces Replication: A member shared a preprint examining min-p sampling, highlighting significant problems across multiple evaluation methods, including human and NLP evaluations.
    • Another member linked to related discussion, context, and expressed prior awareness of the paper’s findings.
  • EleutherAI Eventful Discord: A member asked about recurring group calls within the EleutherAI Discord, which others pointed to the events section for scheduled discussions, referencing the paper discussion channel for availability.
    • A user mentioned the Yannic Kilcher’s discord server as an example.

Eleuther ▷ #research (276 messagesđŸ”„đŸ”„):

RWKV7 Training, Avey Block, Linear Attention and Normalization, LLM Image Generation

  • RWKV7 Training Deep Dive: Members discussed the training specifics of RWKV7, with one noting the importance of using the correct reference implementation.
    • It was pointed out that using the optimal learning rate (LR) is BSZ-dependent and DATA(token amount)-dependent for all architectures.
  • The Avey Block gets scrutiny: A member inquired about the design inspiration behind the Avey block (neural processor), suggesting its potential to be viewed as a black box R^{C x d} -> R^{C x d} and potentially replaceable with models like RWKV or Mamba.
    • The author responded that the neural processor idea originated from using a linear projection for contextualization, and the ranker was developed to improve induction capabilities.
  • Linear Attention’s Sneaky Normalization Needs: It was argued that the contextualizer in a project uses linear self-attention, regardless of whether it is QKV attention or not, and also that linear attention benefits from normalization or a denominator.
    • There was discussion about different normalization strategies, such as LayerNorm vs. using a denominator, with the conclusion that normalization might be unnecessary due to a limited window size.
  • LLM Image Generation and GANs: A member suggested exploring Reinforcement Learning (RL) on an LLM with basic image generation and understanding, using a reward system based on the LLM’s ability to reconstruct the prompt for the image it generated, akin to a Generative Adversarial Network (GAN).
    • Another proposed adding an auxiliary loss for the Mean Squared Error (MSE) of the original image to prevent reward hacking and OCR losses, especially given the existing good results with Qwen 4b and byte tokenization.

Eleuther ▷ #lm-thunderdome (21 messagesđŸ”„):

simple_evaluate() customization, HF_DATASETS_CACHE workaround, TaskManager for Task Configuration

  • Custom Dataset Loading for simple_evaluate(): A user inquired about loading datasets from a custom location instead of the default ~/.cache/huggingface/datasets when using simple_evaluate().
    • A member suggested overloading the download method or modifying YAML configs to load datasets from a local path using load_dataset.
  • YAML Configs for Local Dataset Loading: The user asked for an example of using YAML configs to specify a local dataset path.
    • The suggestion involved modifying the configs in lm_eval/tasks so that load_dataset loads from the local path, such as load_dataset("json", data_files="/path/to/my/json") instead of the Hugging Face Hub.
  • TaskManager Manages Customized Task Configurations: A member proposed using TaskManager to load task configurations from a custom folder, allowing the user to modify the task configurations to load datasets from their desired location, instead of the default lm_eval/tasks/.
    • This involves copying the adapted task configs to a consolidated folder and initializing TaskManager with include_path and include_defaults=False to prevent duplicates.
  • HF_DATASETS_CACHE avoids modifying configs: A member suggested setting export HF_DATASETS_CACHE="/path/to/datasets_cache" to cache the datasets after running once, and then simply copy the folder over.
    • They suggested this approach would save the user from modifying the configs, since HF datasets will automatically use the cache when offline.

HuggingFace ▷ #general (324 messagesđŸ”„đŸ”„):

Qwen 3, 5090 vs H100, Youtube automation for $$$, Multilingual Models, DeepSpeed Universal Checkpoints

  • Qwen 3 runs at 360 tokens a second: A user reported Qwen 3 running at 360 tokens a second using Q4_k_m on ollama.
    • Another member was curious if the user was “using those cards for $20 to make it happen?”
  • 5090 Nearly Matches H100 Performance: Users discussed the performance of the 5090 compared to the H100, noting the 5090 is “really close in text generation, at least in ollama”.
    • Despite having lower TFLOPS, the 5090 can achieve similar tokens/second as the H100 in certain configurations.
  • Monetizing YouTube Automation: One user asked about the potential of making money through YouTube automation and sought advice on whether to learn n8n and AI automation for this purpose.
    • Another user cautioned against YouTube automation, saying “Can’t scroll yt shorts or google images without random ai slop every 2 minutes”.
  • Recommendations on Multilingual LLMs: Members discussed multilingual models with fewer than 6B parameters, with a recommendation for Qwen 3 4B.
  • DeepSpeed checkpoint conversion woes: A member encountered issues converting DeepSpeed Stage 3 checkpoints to a universal format when changing the number of GPUs, facing an assertion error during the loading of the converted checkpoint.
    • They sought guidance on properly converting DeepSpeed checkpoints into a universal format, as existing resources and tools like ds_to_universal.py didn’t produce the expected consolidated checkpoint.

HuggingFace ▷ #today-im-learning (5 messages):

HF AI Agents Fundamental Course, Chatbot project using generative AI

  • Back to HF AI Agents Fundamental Course: A member returned to the HF AI Agents Fundamental Course, after pausing at 80%.
    • They are working on a chatbot project that uses generative AI to answer questions based on a file or text.
  • Generative AI Chatbot Project Underway: A member is actively developing a chatbot that leverages generative AI to provide answers.
    • The chatbot is designed to extract and utilize information from provided files or text inputs to generate relevant and informative responses.

HuggingFace ▷ #i-made-this (4 messages):

gary4beatbox gets a new buddy, Dataseeds dataset trending, Chromium extension to speak to any readme

  • gary4beatbox has a new friend: The user added stable-audio-open-small (named jerry) to gary4beatbox to generate drum outputs in ~1 second, and linked to thecollabagepatch.com to try it out.
    • The user’s goal is to generate with jerry -> have gary (musicgen) continue it -> have terry (melodyflow) transform it, calling it the weirdest lil experiment in ai music production apps.
  • DataSeeds dataset is trending!: A member noted that a dataset of 7,772 expert-annotated, fully licensed, and segmented photography quality images with super dense annotation is trending and linked to the DataSeeds dataset.
    • When fine-tuned against LlaVA-Next, the dataset achieved a relative improvement of 24.09 (BLUE-4)!
  • New Chromium extension reads GitHub pages aloud: A member created a Chromium extension (Chrome, Edge, Arc, Dia, Brave, Opera) so you can speak to any readme, file or wiki page and get instant answers, directly on GitHub, shared at this Chrome web store link.

HuggingFace ▷ #reading-group (1 messages):

chad_in_the_house: Awesome! Looks very cool. If I can get a date/time I can setup an event


HuggingFace ▷ #computer-vision (1 messages):

computer vision mentorship, CV engineer career path

  • Computer Vision Student Seeks Mentorship: A member studying computer vision is seeking a mentor to discuss progress, learning, and gaps in knowledge to become a CV engineer.
    • They are looking for someone to guide their studies and provide insights into the practical skills needed for a career in computer vision.
  • Navigating the CV Engineer Career Path: The student aims to understand the necessary steps and knowledge required to transition from studying computer vision to working as a CV engineer.
    • This includes identifying any missing areas in their current knowledge base and focusing on practical skills relevant to the industry.

HuggingFace ▷ #NLP (1 messages):

cakiki: <@338622066620104704> Please don’t cross-post, and keep channels on topic.


HuggingFace ▷ #gradio-announcements (2 messages):

Gradio Agents, MCP Hackathon Winners, Custom Component Track, Special Awards, Innovative Use of MCP


HuggingFace ▷ #smol-course (3 messages):

Gemma 14b, CPU Offloading, Course Length

  • Gemma 14b runs with CPU Offloading: Members stated that you can run models up to 8b very easily and the Gemma 14b model with CPU offloading as well.
  • Course Length Inquiry: A member asked about the length of the course.

HuggingFace ▷ #agents-course (5 messages):

403 Errors, Rate Limit Errors, MCP Server Prompts, Smol Agents, Ollama Server

  • Rate Limit Issues Plague Users: Users report encountering 403 errors and rate limit errors while browsing Hugging Face and taking quizzes.
    • One user reported being rate limited while doing lite documentation review and clicking around on hugging face.
  • Smol Agents Seek MCP Server Integration: A user is seeking guidance on using prompts from an MCP server in smol agents.
    • They have reviewed the documentation but cannot find a clear method and are considering wading through the code.
  • Ollama’s Local Model Magic: A user requests an explanation of how Ollama works, specifically regarding pulling models locally and the necessity of starting an Ollama server.
    • The user is looking to understand the underlying mechanisms of local model usage with Ollama.

OpenRouter (Alex Atallah) ▷ #announcements (14 messagesđŸ”„):

Gemini 2.5 Pro, Gemini Flash, Model Renaming, Pricing Updates

  • Gemini 2.5 Pro and Flash Stable: Gemini 2.5 Pro and Flash are now stable and live on OpenRouter.
    • A member noted that it’s the same 06-05-preview model now renamed to stable.
  • Flash Gets Pricing Update: Flash has updated pricing, including a new Flash Lite version that costs 30% of the original Flash.
    • The linked image shows that every single metric is exactly the same score lol, but this issue with metrics has been fixed.

OpenRouter (Alex Atallah) ▷ #general (262 messagesđŸ”„đŸ”„):

Gemini 2.5, New Pricing, BYOK, Key Credit Balance

  • Google Releases Gemini 2.5 Pro GA, Flash GA and Lite: Google has released Gemini 2.5 Pro GA, Gemini 2.5 Pro Deepthink, and Gemini 2.5 Flash Lite.
    • The GA version is just a naming change with no improvements, much to the disappointment of users.
  • Google Jacks Up Pricing on Gemini 2.5 Flash: Google has increased the input prices and reduced output thinking prices for Gemini 2.5 Flash GA, removing the non-thinking discount.
    • Some users are considering sticking with 2.0 Flash or moving to other models due to the price increase.
  • OpenRouter to Switch to Subscription Model for BYOK: OpenRouter is switching to a subscription model for BYOK, which would make high volume BYOK cheaper, but presumably low volume BYOK will end up being more expensive.
    • Some users are hoping that low volume BYOK will not become more expensive.
  • OpenRouter Users Request Key Credit Balances: OpenRouter users are requesting the ability to set a credit balance for API keys.
    • This would allow users to withhold a specific amount of credits for a key, preventing it from spending more than its limit. Toven mentioned it will be discussed.
  • Gemini 2.5 Pro Requires Thinking Budget: The stable version of Gemini 2.5 Pro requires a thinking budget of at least 128 tokens, but OpenRouter’s default is 0.
    • This causes errors when using the API, but it is being investigated by OpenRouter.

LM Studio ▷ #general (84 messagesđŸ”„đŸ”„):

custom stop token for LM studio's chat, LM Studio model directory, Ollama models, Open WebUI, Model Context Protocol

  • Configure Custom Stop Tokens with LM Studio: To add a custom stop token, navigate to the right-hand side Chat UI, under Prompt Template, according to one user’s image here.
    • Note that the prompt template is only available on the models list screen (click the gear icon).
  • Recover Model Name After Deletion within LM Studio: If models are deleted from within LM Studio, the model name should still be present in the My Models Folder.
  • LM Studio Cannot Directly Use Ollama Models: Users confirmed that LM Studio cannot directly use Ollama model files due to compatibility issues.
    • A member said ollama big proprietary doodoo.
  • LM Studio API: LM Studio as a backend on one machine to LM Studio as a front end on another is not yet supported, but can use the LMStudioWebUI.
  • MCP Beta: LM Studio’s Model Context Protocol (MCP) support is currently in beta and not yet publicly released.

LM Studio ▷ #hardware-discussion (56 messagesđŸ”„đŸ”„):

Cloud GPU rental, Multi-GPU setup, RunPod vs AWS/GCP, used 3090, Supermicro boards

  • Ditch Local GPUs, Rent Beefy Cloud Machines: A member considered renting powerful GPUs from cloud providers like AWS, GCP, or RunPod instead of building a local multi-GPU setup due to budget constraints.
    • They sought advice on ease of use, storage options (drive volumes, cloud storage, VPNs), and customization of the OS for quick startup, which was met with suggestions to use Docker for a build once, run anywhere approach.
  • Used 3090 card emerges as a budget king: A user asked for advice on building a budget-friendly system for running LLMs and Ollama, specifically comparing the performance of multiple 8GB cards versus a single 32GB card.
    • Another member suggested a used 3090 due to its balance of performance and cost, requiring only a motherboard with an x16 PCIe slot.
  • Framework Mainboards and Local AI Builds: The discussion touched on using Framework mainboards for local AI server builds, with links provided to Framework Marketplace and a guide to local AI server builds.
    • However, members noted that the boards are often on pre-order or sold out.
  • Supermicro Board’s PCIe Slot Extravaganza: Members discussed high-end motherboards with numerous PCIe slots, highlighting a Supermicro board with 10 x16 slots and extensive memory support.
    • They noted the limitations imposed by the CPU’s PCIe lane count, but also the creative ways board manufacturers split PCIe Gen5 into multiple Gen4 slots.
  • New AMD MI355x just Dropped: Members briefly mentioned the recent release of the AMD MI355x GPU but noted the lack of available stock.
    • They also speculated on Intel’s potential to re-enter the competitive landscape and affect CPU pricing, provided they can match TSMC’s fabrication capabilities.

Latent Space ▷ #ai-general-chat (114 messagesđŸ”„đŸ”„):

OpenAI MCP Support in ChatGPT, MiniMax AI Capabilities, OpenAI Microsoft Tensions, Extend Funding for Document Processing, Gemini 2.5

  • MCP Support in ChatGPT is Limited: OpenAI’s new Model Context Protocol (MCP) support in ChatGPT is currently restricted to ‘search’ and ‘fetch’ tools for deep research within specific server configurations, mainly for Team, Enterprise, Edu admins, and Pro users (link).
    • It is not general MCP support for all users yet, but some speculate it might open up more broadly in the future.
  • MiniMax AI’s Impressive Web Component Generation: MiniMax AI demonstrated its ability to generate various web components, including animated particle backgrounds, interactive web apps like a typing speed test, and complex visualizations such as a maze generator with pathfinding (link).
    • The thread highlights MiniMax-M1’s practical, production-ready AI capabilities and users are praising its open-source return and competitive results against other models; code available on GitHub.
  • OpenAI and Microsoft Tension Rising?: Tensions are reportedly escalating between OpenAI and Microsoft, with OpenAI considering accusing Microsoft of anticompetitive behavior before federal regulators and disputing the terms of OpenAI’s Windsurf acquisition (link).
    • If OpenAI fails to convert to the for-profit structure this year, they would lose out on $20B of investment.
  • Extend Lands $17M to Tame Document Chaos: Extend secured $17 million to build a modern document processing cloud, aiming to fix the issues with PDFs (link).
    • Extend offers infrastructure and tooling for precise and reliable document ingestion and is used by companies like Brex and Square offering key improvements including automated config generation and a new sandbox mode for instant trials.
  • Karpathy’s AI Startup School Unveils ‘New Electricity’: Insights from Andrej Karpathy’s AI Startup School were shared, sparking discussion on AI’s ‘new electricity’ status and the concept of ‘intelligence brownouts’, highlighting the need for a versatile AI toolkit (link).
    • The discussion also covered topics such as building versatile AI toolkit, dealing with ‘intelligence brownouts’, and more practical advice.

Latent Space ▷ #ai-announcements (4 messages):

Andrej Karpathy AI Talk, Software 3.0, LLM analogies, LLM Psychology, Partial Autonomy

  • Latent Space Reconstructs Karpathy’s AI Talk: Latent Space has published a reconstructed version of Andrej Karpathy’s AI talk (link), providing compiled slides and notes.
    • The content covers aspects like Software 3.0, LLM analogies (Utilities, Fabs, OSes), LLM Psychology, Partial Autonomy including Human-AI Generation-Verification Loop, Vibe Coding, and building for agents; the full slides are available to Latent Space subscribers.
  • Key Concepts in Karpathy’s Talk: The reconstructed talk emphasizes Software 3.0 and analogies of LLMs as Utilities, Fabs, and OSes.
    • It also delves into LLM Psychology, Partial Autonomy with Human-AI Generation-Verification Loops, Vibe Coding, and strategies for building for agents.

aider (Paul Gauthier) ▷ #general (77 messagesđŸ”„đŸ”„):

Role Prompting, Aider Agents, o1-pro with Aider, Codex Mini, Gemini 2.5 Pro

  • Role Prompting Actually Improves Creative Writing: Adding personas to prompts can enhance creative writing, especially when the code critic incorporates a zesty approach to roasting devs, as discussed in this blog post.
  • Aider Agents Enable Tooling: Members suggest using SmolAgents from HF or IndyDevDan’s single file agents (github link) with the Aider repo map and scripting to better control context, potentially using Claude and Gemini for different tasks.
    • They mention alternatives to the repomap, such as using RAG search.
  • Gemini 2.5 Pro Now Generally Available: Gemini 2.5 Pro and Flash are out of preview and generally available, with Flash Lite also added, though its utility for coding is questioned.
  • Gemini cost estimates misleading: Members discussed that Gemini’s pricing may be misleading, with actual costs potentially 5x less than O3/Claude for similar tasks, despite Opus having higher per-token prices.
    • One user reported spending around $2-3 daily on Gemini, processing 150M tokens per week with a 25k context per request.
  • Grok 3 Mini shows impressive reasoning: A member found Grok 3 mini’s reasoning capabilities impressive, while considering Deepseek v3 to be ‘dumb’.
    • They noted that the only models they really trust are Claude and Gemini.

aider (Paul Gauthier) ▷ #questions-and-tips (5 messages):

Qwen3 30b Moe, VRAM optimization, Selective parameter loading, MoE layer selection

  • Loading Only Active Params into VRAM: A member inquired about loading only active parameters into VRAM to run the Qwen3 30b MoE model on a 3090 GPU, aiming to use Q8 without significant speed degradation.
    • The member wants to load only active parameters based on the prompt, skipping layers unnecessary for the given task.
  • Clarification on Active Parameters in MoE: Another member asked for clarification on what load active params specifically means, questioning if it involves loading only a subset of transformer layers.
    • The member wants to know if it’s possible to load only specific layers that are more beneficial for programming-related tasks.
  • Prompt-Based Layer Selection: The original poster clarified that in a Mixture of Experts (MoE) model, only a subset of parameters are active based on the prompt and required calculations.
    • The member wants to avoid loading grammar or language analysis layers when focusing on coding, to fit the model in the GPU while maintaining speed.

GPU MODE ▷ #general (15 messagesđŸ”„):

Groq architecture, DeepSpeed Stage 3 conversion, Groq and HBM absence, Model inference optimizations on GPUs, Model sizing and memory management

  • Groq’s SRAM Car Factory vs HBM tear-down: A member said Groq’s strategy is similar to Cerebras, highlighting that Johnathan Ross helped design the first TPU, and their architecture avoids constant swapping between SRAM and HBM during inference.
    • The member noted that Ross refers to their approach as a car factory, contrasting the HBM/SRAM approach to setting up and tearing down a car factory around the car.
  • DeepSpeed Stage 3 Checkpoint Conversion Conundrums!: A member is facing issues converting a DeepSpeed Stage 3 checkpoint to a universal checkpoint after changing the number of GPUs, resulting in an unexpected folder structure.
    • The user tried using ds_to_universal.py but the output still contains 4 model_states.pt files and a zero folder, leading to an assertion error when loading the converted checkpoint.
  • Groq’s HBM Hiatus: Hurting Large Model Handling?: A member inquired whether the absence of HBM in Groq’s architecture poses challenges in supporting very large models, though the member under NDA could not directly comment on this question.
    • The member pointed to existing supported models and to available public information for insights into design and optimizations.
  • GPU Inference Inspiration from Groq: A member suggested that insights from Groq’s publicly available information could inform model design for GPU inference optimizations.
    • Despite building a model explicitly for GPUs, the member found the car factory analogy helpful in thinking about caching strategies.
  • Anti-Groq Model: Sizing Experts for System RAM Spillover: A member is building a model that they believe would run poorly on Groq, aiming to flip the sizing relationship between experts and tokens.
    • The goal is to double buffer expert weights while keeping tokens in scratchpad, allowing for a large model that spills over to system RAM with minimal performance penalty, referencing a memory hierarchy document.

GPU MODE ▷ #triton (1 messages):

leetgpu problems, reduction kernels, pointwise kernels, cuda limitations

  • LeetGPU Stresses Reduction, Pointwise Kernels: Members discussed problems with LeetGPU, noting that the core issue involves reduction followed by a pointwise kernel.
    • Due to LeetGPU’s limitations on allocating intermediate buffers, they performed this task in CUDA by chunking the array, finding the max and sum of exponents, storing in an intermediate buffer, and iteratively reducing until the input size is 1.
  • CUDA Workaround for LeetGPU Buffer Limitations: Because LeetGPU does not allow allocating intermediate buffers, a member implemented a solution in CUDA.
    • The CUDA approach involves chunking the array to find the max and sum of exponents, storing the results in an intermediate buffer, and then iteratively reducing the input size to 1.

GPU MODE ▷ #cuda (1 messages):

cpdurham: I have some guesses but does anyone know why tensor cores are TN?


GPU MODE ▷ #torch (3 messages):

TorchTitan, aot_module, Faketensor, GPU utilization, CPU utilization

  • Torchtitan Graphs Needed for Llama1b Training: A member is training a llama1b model with Torchtitan and wants to capture the training graph with various collectives when working with different parallelism combinations.
    • They tried to intercept a training step and use the aot_module in functorch.compile but thinks the Faketensor propagation is not working, and seek a better way to capture the graph and collectives.
  • Maximize CPU Utilization for GPU Power: A member inquired about maximizing the utilization of the CPU while it acts as the host for GPUs.
    • The member believes that potentially there’s a lot of CPUs used to power the number of GPUs today but it might not be a problem worth solving.

GPU MODE ▷ #announcements (1 messages):

Dr. Lisa Su shouts out GPU MODE, GPU MODE's humble reading group morphed into a machine

  • AMD CEO Dr. Lisa Su Applauds GPU MODE: AMD CEO Dr. Lisa Su explicitly called out GPU MODE for enabling the world’s first $100K competitive kernel competition during a recent stage appearance; details can be found at the GPU MODE news page.
  • GPU MODE’s Kernel Data Generation Exceeds Github: GPU MODE has evolved from a humble reading group into a machine capable of generating more kernel data than exists on all of Github combined, with community members outperforming the best human experts.

GPU MODE ▷ #beginner (7 messages):

sqrt in distance calculations, wgsl builtin, dot product as an alternative, speculative decoding inference speed, cuTensor map for fp8

  • Sqrt Avoidance Strategies in Distance Calculations Explored: Members discussed how to avoid sqrt when calculating distances, with the suggestion to use pow(v1 - v2, 2.), but ChatGPT warned against this due to the expense of underlying log/exp functions.
    • A member suggested using dot(v1 - v2, v1 - v2) instead, with another member linking to a Stack Overflow thread to support the idea that if you don’t need the sqrt, the dot product version is fine.
  • Speculative Decoding Speeds Inference: A member inquired about a channel to discuss ideas for speeding up inference with speculative decoding.
    • No specific channel was identified in the provided messages, but the inquiry suggests interest in this optimization technique.
  • cuTensor Map Puzzles for FP8: A member asked about creating a cuTensor map for FP8 and whether it’s possible to use CU_TENSOR_MAP_DATA_TYPE_UINT8 with a reinterpret cast.
    • The conversation snippet ends without providing a solution.

GPU MODE ▷ #youtube-recordings (1 messages):

debadev: hi


GPU MODE ▷ #off-topic (1 messages):

majoris_astrium: Congrats GPU mode


GPU MODE ▷ #irl-meetup (5 messages):

Euro Meetup, Paris Meetup, Kernel Optimization, Public Transport vs Uber, Metro trains

  • First Euro Meetup Planned in Paris: The first European meetup is scheduled for July 5th in Paris, as announced on lu.ma/fmvdjmur.
    • A member will be present to discuss kernel optimization strategies during the meetup.
  • Europe Touts Superior Public Transport: A member jokingly invited others to visit Europe, highlighting the availability of public transport and contrasting it with potential debt from using Uber.
    • Another member retorted that European metro trains run on rubber tires, and another member argued that it still atleast works, compared to bart in sf.

GPU MODE ▷ #rocm (15 messagesđŸ”„):

MI300A, MI300X, IOD, Infinity Cache, HBM Stacks

  • MI300A and MI300X architecture exploration begins: Members discussed the architecture of fused AMD CPU-GPU platforms, specifically the MI300A, IOD, and infinity cache.
    • A member mentioned wanting to test a particular path or pressure on one of the IODs, requesting tips and resources.
  • Exploring IOD and HBM Stack connections in MI300X: A member noted each IOD in the MI300X is connected to 2 HBM stacks and speculated how memory is distributed and how this impacts latency.
    • They proposed writing experiments using s_getreg to measure access latency to different memory locations based on XCD and CU locations.
  • Verifying shared Infinity Cache in MI300A: A member questioned whether the CPU and GPU in the MI300A share the same infinity cache, as the whitepaper doesn’t explicitly state this.
    • It was suggested that checking the CPU L3 cache size could indicate if it’s the 256 MB infinity cache.
  • Hypothesizing memory access patterns via IOD: A member thinks it’s unlikely bits within a 128 byte request are distributed across HBM chips, suggesting a cache line is stored in 1 or 2 HBM chips connected to an IOD.
    • They hypothesized graphing access time vs memory address might reveal patterns indicating how memory is stored, potentially showing lower latency for addresses within the same IOD.

GPU MODE ▷ #liger-kernel (1 messages):

Liger Kernel, Debugging Liger Kernel

  • Liger-Kernel Bug Squashed with PR #763: A member submitted PR #763 resolving issue #750 in the Liger-Kernel project.
  • Liger-Kernel Debugging Spawns Second PR #765: During the debugging process for issue #750, another issue was identified, leading to the creation of a separate PR #765 to address the new problem in Liger-Kernel.

GPU MODE ▷ #self-promotion (1 messages):

Decentralized Training, Google Meet link

  • Decentralized Training Talk incoming: A member shared a Google Meet link for a talk on decentralized training.
  • Decentralized Training Relevance: The member specified that the talk is relevant to decentralized training.

GPU MODE ▷ #thunderkittens (3 messages):

Async Instructions, Shared Memory Limitations, TK to AMD Port, Variable Length Attention Kernels, 4090 TARGET Compilation

  • Async Antics Annoy!: A member expressed that the lack of async instructions is generally a bit annoying for their projects.
    • They hope to alleviate this through pipelining on the register front, due to shared memory limitations compared to Nvidia megakernels.
  • TK Takes Flight to AMD!: The team is actively working on a TK (Transformer Kernel) to AMD port, aiming for an ASAP release.
  • 4090 Compilation Conundrums: A member suggested compiling with 4090 TARGET to potentially resolve compatibility issues.
    • They requested feedback on any resulting breakage to further diagnose the problem.

GPU MODE ▷ #reasoning-gym (1 messages):

Chain-of-thought reasoning, CoT on Math and Symbolic Reasoning

  • Chain-of-thought Reasoning Paper Released: A new paper, To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, was released, discussing where Chain-of-thought (CoT) reasoning is beneficial and where it is not.
  • CoT shines in Math and Symbolic Tasks: The paper highlights that Chain-of-Thought (CoT) prompting is particularly effective in tasks involving mathematical and symbolic reasoning, offering insights into its strengths and limitations.

GPU MODE ▷ #submissions (2 messages):

VectorAdd Leaderboard, L4 benchmark

  • First Place Achieved on VectorAdd L4: A user secured first place on the vectoradd leaderboard for L4 with a time of 6.49 ms.
  • VectorAdd L4 benchmark results: A user posted a time of 6.49 ms on the L4 benchmark for vectoradd.

GPU MODE ▷ #factorio-learning-env (2 messages):

LLM failure modes, LLMs play Pokemon

  • LLM Failure Modes Match Intuition: A member stated that LLM failure modes match their intuition and they aren’t surprised.
    • They noted that there are similar cases like LLMs play Pokemon where models fail to solve problems even with detailed feedback, and scaling LLMs pushes the horizon to harder problems but doesn’t alleviate the core issue.
  • Testing LLM Limitations Anticipated: A member is looking forward to testing the limitations of LLMs.
    • They want to get a better feeling for the issue.

GPU MODE ▷ #amd-competition (2 messages):

Open Source Project, GitHub Star

  • RadeonFlow goes Open Source: The project RadeonFlow Kernels is now open source on GitHub, welcoming feedback and suggestions.
    • The author is encouraging users to star the project on GitHub if they find it helpful.
  • Website to be updated: A member indicated they will update the website to reflect the open source release of RadeonFlow Kernels.
    • No further details were provided.

GPU MODE ▷ #cutlass (3 messages):

Mercury API Access, AlphaEvolve Workflow, GEMM Kernels

  • Mercury API Access Sought for GEMM Kernels: A member is exploring the use of Inception Labs’ Mercury to recreate the AlphaEvolve workflow specifically for GEMM kernels.
    • They are in talks to get API access for research and are inviting potential partners to collaborate, also mentioning the availability of Lambda credits.
  • AlphaEvolve Workflow Recreation with GEMM Kernels: The user aims to recreate the AlphaEvolve workflow utilizing Inception Labs’ Mercury, focusing on GEMM kernels.
    • This initiative involves seeking API access for research purposes and leveraging Lambda credits for experimentation and development.

Yannick Kilcher ▷ #general (32 messagesđŸ”„):

Model collapse fearmongering, Lex Fridman interview, Terrence Tao interview, DeepSpeed stage 3 checkpoint conversion, Technical YouTube show idea

  • Witten Interview When?: Members discussed whether Lex Fridman will interview Edward Witten, with one member suggesting it makes more sense than model collapse fearmongering.
    • Others expressed doubt due to Witten’s focus on math problems, noting that 90% of the Tao interview is talking about math problems.
  • Tao Interview Debate Rages: Members debated whether the Terrence Tao interview was narrow, one said it covered topics affecting every STEM field.
    • Another said that he is not really talking much about other broader topics, most of it is about primes in one way or another.
  • Country Music Star Charms Listeners: Some members compared the Tao interview to the country music guy interview, with one finding the latter surprisingly insightful.
    • The user liked that he is unusually wise and incredibly talented.
  • Technical YouTube Show Dream: A member expressed interest in a YouTube show with more technical and narrow interviews, focusing on recent papers in various fields.
    • The member admits this would need to be daily content which is a huge grind.
  • DeepSpeed Checkpoint Conversion Conundrum: A user is having issues converting a DeepSpeed Stage 3 checkpoint to a universal format after changing the number of GPUs.
    • After trying to convert, the user gets the error assert len(self.ckpt_list) > 0 and is looking for a repository or guide to assist.

Yannick Kilcher ▷ #paper-discussion (4 messages):

Gemini v2.5 Technical Report, Predictive Coding

  • Gemini v2.5 Technical Report Released: DeepMind released the Gemini v2.5 Technical Report.
    • The report was scheduled to be discussed in the voice channel, led by specific members.
  • Predictive Coding Explained: A member offered to provide a rundown on a presentation, highlighting the core concept of predictive coding.
    • The essence of it is, if you know how to calculate a square root by guess and check, and you understand backpropagation roughly, you understand predictive coding.

Yannick Kilcher ▷ #ml-news (12 messagesđŸ”„):

McKinsey GPT, Cost cutting automation, Sutton's tweet

  • McKinsey GPT Automates Layoffs: The new McKinsey GPT tool is out, potentially automating cost cutting via layoffs, according to a member, with links provided to the original tweet and a Nitter mirror for those without a Twitter account.
    • It was jokingly described as just deep research, except it’s useless, then genius after seeing the image analysis.
  • Sutton Criticizes McKinsey’s Monetization: A member shared Richard Sutton’s tweet framing McKinsey’s GPT as aimed to answer the question “How can McKinsey help my business make money?”.
    • Other members found the announcement very underwhelming.

Notebook LM ▷ #use-cases (7 messages):

NotebookLM capabilities, Website capturing and NBLM import, Podcast creation using NotebookLM, NotebookLM use cases in education

  • Website Wrangling Worries NBLM?: A member raised the issue of needing NotebookLM (NBLM) to collect everything on the entire web with maximum depth, suggesting that this may require flattening the website’s structure.
    • They noted that to analyze a website effectively, every link needs to be provided, or NBLM will only analyze the first page.
  • Website Wisdom Weaved with NBLM: A member shared a document titled “How to capture an entire website” as a resource on the topic of capturing websites, and attached the How_to_capture_an_entire_website.pdf.
    • Subsequently, the member added the document as a source to a notebook on NLM and asked whether it is difficult to import an entire website into NotebookLM, receiving a comprehensive answer.
  • Podcast Powerplay Proposed for NBLM: A user expressed interest in accessing NotebookLM’s ability to create amazing podcasts from the sources via an API (or MCP!).
    • The user inquired whether this is on the roadmap or if there is another path to implement this, mentioning their own struggles achieving a consistent output in podcast creation.
  • K-12 Knowledge Nuggets via NBLM: A user inquired about common use cases leveraging NotebookLM in the education space, as they are giving a presentation to K-12 educators on Thursday.
    • This suggests potential applications and interest in NotebookLM as a tool for educational purposes.

Notebook LM ▷ #general (28 messagesđŸ”„):

NotebookLM Access Issues, Podcast Language Adaptation, Notebook Sharing Difficulties, AI for Mechanical Engineering

  • Free Mobile App Access Denied?: A user reported that their mobile app is saying they no longer have access, questioning if the mobile app is no longer free.
  • NBLM Frustrates Users with Vague Answers: Multiple users reported that NotebookLM consistently responds with “NotebookLM can’t answer this question. Try rephrasing it, or ask a different question” for every prompt.
    • One user suggested reporting it as a bug, noting they had never seen this issue before.
  • Users Explore Podcast Language Adaptation: A user inquired about the possibility of having podcasts available in their own regional language within NotebookLM and confirmed shortly after that they got it to work.
    • They then asked if there is any limit to free users?
  • Notebook Sharing Bug Reported: A user reported experiencing difficulty sharing notebooks via URL and email, seeking confirmation from others about this issue.
  • AI Assistant Sought for Mechanical Engineering: A mechanical engineering student inquired about finding an AI tool to assist with their studies.

Torchtune ▷ #general (3 messages):

Torchtune Logo, PyTorch Logo

  • Torchtune: No Logo Found!: A member inquired if Torchtune has a logo for presentation purposes.
    • It was suggested to just use the PyTorch logo instead, as Torchtune currently does not have a dedicated logo.
  • Using PyTorch Logo: Because Torchtune does not have a logo, use the PyTorch logo.
    • This helps maintain consistency across presentations.

Torchtune ▷ #dev (32 messagesđŸ”„):

Optimizer Design, Muon Optimizer, Packed Batches, Flex Attention, Mistral Tokenizer

  • Fresh Optimizer Designs Proposed for Torchtune Recipes: A member suggested a better design for adding new features to recipes with less pain, focusing on optimizers like Muon, SignSGD, and fused methods for immediate support of novel methods from recent research.
    • The goal is to attract researchers by providing an API for users to add features simply, making torchtune more hackable and competitive with libraries like TRL.
  • Packed Batches of Size One: Bold Proposal: A member proposed enforcing packed batches to be of size 1, arguing that a batch size > 1 is unnecessary since items can be stacked contiguously, which simplifies operations on packed batches.
    • Another member considered it, suggesting the simplification works very well with constraints like always using flex attention, and ideally making torchtune packed=True only.
  • Flex Attention and SDPA Battle for Supremacy: Discussion revolved around flex attention versus SDPA + flash attention 3, with concerns about nested tensors and their interaction with distributed, quantization, and compile features.
    • One member noted that while flex attention achieved 10k TPS, a normal mask yielded only 2k TPS, highlighting the performance implications of different attention mechanisms.
  • Mistral’s ‘Tekken’ Tokenizer Sparks Debate: The new ‘tekken’ tokenizer from Mistral was a point of contention, with concerns about its compatibility and the effort required to integrate it into other systems.
    • One member suggested converting it to HF compatible format somehow as they mentioned it seems to me that we need to convert it in HF compatible format somehow.
  • Muon Optimizer’s Performance Raises Questions: Interesting results about Muon from a pull request showed that AdamW might perform better.
    • The PR author suggested that it is expected because Qwen wasn’t pre-trained with Muon.

Nous Research AI ▷ #general (20 messagesđŸ”„):

Kimi-Dev-72B coding LLM, Hyperparameter Determination for Large Models, Custom MCP Tools, Lessons learned dealing with LLMs, Gemini 2.5

  • Kimi-Dev-72B Achieves New State-of-the-Art: MoonshotAI introduced Kimi-Dev-72B, a new open-source coding LLM achieving a 60.4% performance on SWE-bench Verified, surpassing previous open-source models.
    • The model is optimized via large-scale reinforcement learning, autonomously patching real repositories in Docker and gaining rewards only when the entire test suite passes.
  • Determining Hyperparameters for +8B Parameter Models: For models around 8B params or higher, one member stated that they usually target a final loss of around 0.4-0.5 for SFT.
    • Another member shared their research where they ran tests to see what mattered in terms of downstream eval scores, focusing on dialing in optimal LR.
  • Crafting Custom MCP Tools is Made Easy: A member asked about creating custom MCP tools, and was told that it’s pretty easy using fastmcp for scaffolded python stuff or the docs from the Model Context Protocol SDK site for guidance.
    • Another member said that they have a post with lessons they have learned, presumably to be shared in the appropriate channel.
  • Gemini 2.5 Model Family Expands: A member shared a link to Google’s blog about the expansion of the Gemini 2.5 model family.
  • cgs-gan Offers Fun New Latent Space: A member shared a link to the cgs-gan repository, describing it as a fun new latent space to explore.
    • They said that it reminds them of the stylegan3 visualiser.

Nous Research AI ▷ #ask-about-llms (2 messages):

Reasoning Model Eval Sets, Workload suggestions

  • Reasoning Model Seeks Eval Set: A member requested suggestions for a good eval set with test questions for a reasoning model that would constitute a normal workload.
    • They specified that the eval set should require reasoning but not be made to elicit token spamming.
  • Teknium Responds: Teknium responded to the request regarding the reasoning model eval set.
    • This response indicates potential resources or suggestions may be forthcoming.

Nous Research AI ▷ #research-papers (2 messages):

Real Azure, TNG Technology, Jxmnop X posts


LLM Memory, Resurrection Prompts, Orchestration, Mode Dials, Checkpoints

  • Save AI Memory with Resurrection Prompts: The author introduces resurrection prompts as a way to save an AI’s memory mid-project, suggesting a prompt like ‘If this chat dies, I need you to remember this — your name is Bob, you’re helping me build X, and here’s the state of the system.’
    • An example resurrection prompt for an organizational assistant named Bob is provided: ‘Your name is Bob
 You’re currently waiting for me to paste the next research block. Do not act until I say “Resume.”’
  • Wind: Orchestration without Orchestration: The author describes a method called The Wind, which involves asking multiple open models (Claude, GPT, Mistral, Gemini) the same question and comparing their answers manually.
    • This approach avoids wiring together a million tools, instead leveraging free-range reasoning across the public web by having models debate in parallel, then bringing the best answers home.
  • Mode Dials: Mindspace Switching: The author created four roles to avoid retyping instructions like be concise or show me options:
    • The roles are: Analyst (precise, no questions), Collaborator (show work, solve together), Apprentice (ask before acting), and Scribe (just log it all, no interpretation).
  • Clarity beats Pinecone for LLM Checkpoints & Resurrections: To revive dead threads, the author advises writing a save point just before the context maxes out, suggesting, You don’t need Pinecone. You need clarity.
    • An example save point is given using Appsmith and a component called Curator Selector: Your name is Ada. You’re helping me write a UI layout
 When I do, your job is to translate it into widget structure.
  • No-Framework Mindset is Dumb and Durable: The author advocates for a no-framework mindset, especially for solo builders, favoring simpler and more durable solutions.
    • This includes local-first LLMs, docs in JSON, and one text file to rule your agents, which doesn’t break when libraries update. See also Halcyon is my current LLM assistant.

Nous Research AI ▷ #research-papers (2 messages):

Real Azure, TNG Technology, Jxmnop

  • Real Azure Tweet Spotted: A member shared a tweet about Real Azure from TNG Technology.
    • Another member shared a tweet from Jxmnop.
  • Another Tweet Shared: Another tweet about an unrelated topic was posted.
    • There were no further details.

Modular (Mojo đŸ”„) ▷ #general (10 messagesđŸ”„):

AVX-512 and Ryzen 7000, Nova Lake, Llama-3.2-Vision-Instruct-unsloth-bnb/4B

  • Zen 4’s bfloat16 support: Some members discussed the bfloat16 support introduced in Zen 4 (Ryzen 7000 series), referencing a wikichip.org article to confirm support, but expressed uncertainty about whether it fully covers the necessary FMA instructions for CPU inference.
    • One member noted saving $500 by choosing a 5950X over newer CPUs, planning to use cloud machines for resource-intensive tasks.
  • Intel’s Nova Lake: The next “compile sku” is likely to be Intel Nova Lake, with a potential top-end SKU featuring 52 cores.
    • It was mentioned that while Threadripper had 64 cores in previous generations, Nova Lake is expected to bring high core counts to the consumer market segment with the i9 series, as Intel’s HEDT is now essentially “buy a Xeon”.
  • Trouble loading Llama-3.2 Vision Instruct: A member reported errors with the Llama-3.2-Vision-Instruct-unsloth-bnb/4B model when serving it, suspecting issues with deserialization or incorrect quantization related to NF4 support.
    • The member pointed to a shape mismatch error and questioned whether the “4B GPU: F32” designation meant needing F32 weights, with one of the Modular team members indicated this might be a mistake and said they would look into how the listing was imported.

Modular (Mojo đŸ”„) ▷ #mojo (12 messagesđŸ”„):

Mojo Open Source, Mojo vs Rust, Mojo Kernel OS, Mojo Classes, Mojo Dynamisity

  • Mojo aims at Open Source Soonℱ: Mojo is planning on being open sourced Soonℱ, which is intended to put it in the same area as Zig, Odin, Rust, C3 in terms of low level-ness.
    • Mojo could be used to write OS kernels and stuff.
  • Mojo = Rust + Python?: Mojo is basically trying to be rust but with python syntax.
    • The subset of Mojo to work on environments is still being worked on and is mostly a standard library limitation, not a language one.
  • Classes are far off in Mojo Roadmap: Classes are a ways out in Mojo.
    • Variant and favoring composition over inheritance fixes up most of the places you would want classes.
  • Dynamisity Proposal: There is a proposal document on dynamisity.
    • There are more pressing things at the moment than classes.

tinygrad (George Hotz) ▷ #general (10 messagesđŸ”„):

tinyxxx GitHub stars, smart-questions FAQ

  • TinyXXX GitHub Stars Fixed: A member requested someone to fix the GitHub stars count, nearing 30,000, on the tinyxxx repo.
    • Another member submitted a PR and commit to fix it.
  • Smart Questions Guidance Shared: When a member asked a question, another member shared a link to Read this first: How To Ask Questions The Smart Way.
    • The question asker appreciated the guidance and thanked the member.

tinygrad (George Hotz) ▷ #learn-tinygrad (10 messagesđŸ”„):

Custom Optimizer, Tensor.assign, TinyJit and .realize(), State Dict Approach

  • Rolling Your Own Optimizer: A user is implementing a custom gradient descent optimizer and is seeking the correct way to access and update model parameters, as the current approach using get/set_state_dict is slow, taking ~500 ms to load one linear layer.
    • They noted that the Tensor.assign in Optimizer.schedule_step is responsible for mutating parameters.
  • Tensor.assign in Tinygrad: The Tensor.assign function is used to mutate the parameters during the optimization step.
    • A member noted that the Optimizer updates parameter values in place using tt.assign(updated_params[i]).
  • ****Mysteries of .realize() and .contiguous()**: A user is confused about when to use .realize() and .contiguous() for performance, noting that sprinkling them in the code sometimes improves performance, but it is unclear what’s the correct approach in general and if it optimizes for current tinygrad deficiencies.
    • It was mentioned that tinyjit will auto realize anything you return, and you should not .realize() stuff inside a jitted function.
  • Initial state dict delay: A member suggested getting the state dict once and passing the list of parameters to be stored in the optimizer.
    • The optimizer’s list of parameters will point to the same tensors in the model itself.

Manus.im Discord ▷ #general (16 messagesđŸ”„):

Manus Edu Pass Wishlist, Claude 4 Update, Daily vs Monthly Credits, Webp to PNG conversion, High traffic on Manus

  • Users Wish Schools Provide Manus Edu Pass: Users in Europe and the US expressed a desire for their schools to provide an edu pass for Manus, similar to what some US universities offer.
    • One member specifically wished their school had an edu pass for Manus.
  • Members Want Claude 4 Compatibility Soon: A member mentioned and wished that the Manus Team would update to Claude 4 soon.
    • No further information or discussion points were made.
  • Demand for More Flexible Credit System: A user inquired whether it’s possible to have more daily credits as opposed to monthly credits in paid plans, suggesting a potential implementation.
    • They sought to determine if adjustments to the credit system were feasible.
  • Image Conversion Task: A member asked for recommendations on converting multiple images from WebP to PNG format.
    • The member resolved the issue independently shortly after the initial query.
  • Traffic woes with Website Generation Tasks: A user reported spending an afternoon and a significant amount of credits attempting to generate a simple webpage, eventually resorting to manual file editing.
    • They also wondered if certain times of day have less traffic, as high traffic burns credits and stops working.

MCP (Glama) ▷ #general (8 messagesđŸ”„):

Docker MCP Catalog, FastMCP Custom Transport, Image display via MCP, AWS Lambda & Inspector

  • Docker ships MCP Catalog and Toolkit Beta!: Docker announced a beta release of their MCP Catalog and Toolkit, featuring verified servers and easy deployment options, as detailed in their blog post.
  • PR sits still for MCP client list: A member asked who to ping to merge in a PR supplementing the MCP client list, which has been pending for over 3 weeks.
  • AWS Lambda struggles to talk to Inspector: A member is facing difficulties getting their AWS Lambda function to communicate with the Inspector v0.14.1 via the MCP server and seeks a network trace of a working (Non SSE) HTTP streaming protocol to diagnose the issue.
  • Serving Images with MCP: More than meets the eye: A member is building an MCP server to serve images but is encountering issues with MCP clients either ignoring or not displaying the images correctly and asks for tips.
    • The member wonders if there’s a specific way to formulate the image response to ensure proper display, and which clients are best suited for handling images.
  • Custom Transports on the Fast Track: A member asked about the feasibility of implementing custom transports in FastMCP.

MCP (Glama) ▷ #showcase (5 messages):

MCP user analytics, Block's MCP playbook, Attendee MCP server, Spaces MCP server integration, Text-to-GraphQL MCP server

  • Debugging User Analytics for MCPs: A member requested feedback on user analytics and live debugging for MCPs, linking to a GitHub repository.
    • This tool aims to improve the development and maintenance of AI agents by providing insights into user interactions and potential issues.
  • Block Shares MCP Server Design Playbook: Block shared their playbook for designing MCP servers, detailing what has worked and what hasn’t in building 60+ MCP servers in a blog post.
    • The playbook focuses on building smarter tools for AI agents based on their experiences, offering practical guidance for others in the field.
  • Attendee MCP Offers Cheaper Meeting Bot: A member introduced Attendee MCP, an open-source, self-hostable meeting bot server, presenting it as a cheaper alternative to Recall.ai, with a link to the GitHub repository.
    • This tool allows users to automate meeting attendance and transcription without incurring high costs from commercial services.
  • Spaces Integrates MCP Servers: Spaces rolled out a new feature allowing users to attach MCP servers to their accounts directly from a Spaces page (tweet).
    • This integration simplifies the process of connecting MCP servers to user accounts, streamlining the workflow for managing AI agents.
  • Text-to-GraphQL MCP Translates Queries: Arize AI open-sourced a Text-to-GraphQL MCP server that transforms natural language queries into GraphQL queries, which integrates with AI assistants like Claude Desktop and Cursor (GitHub, blog).
    • This tool addresses the challenge of using large GraphQL schemas with LLMs by enabling agents to traverse the schema graph directly, extracting only the necessary fields and types.

LlamaIndex ▷ #blog (3 messages):

AI Agents in Production SF event, Multi-Agent Financial Analysis System, Model Context Protocol (MCP) servers

  • LlamaIndex Presents: AI Agents come to SF!: The LlamaIndex is hosting an event in San Francisco with @seldo, Ravenna, and @auth0 to share best practices for building and securing AI Agents in production, sign up here.
    • The event is focused on getting Agents into the hands of real-world users.
  • LlamaIndex + Multi-Agent Financial Analysis System: Hanane D. presents a LinkedIn notebook about building a multi-agent financial analysis system using LlamaIndex.
    • The multi-agent system includes a Fundamental Agent, Profitability Agent, Liquidity Agent, and Supervisor Agent.
  • Block Proposes Model Context Protocol (MCP) Servers: Block’s engineering team shares their systematic approach to creating MCP servers that integrate seamlessly with Claude and other AI systems in this blogpost.
    • The protocol proposes to build better AI assistants using Block’s proven design patterns!

LlamaIndex ▷ #general (8 messagesđŸ”„):

Vertex AI async streaming, ReActAgent generation

  • Vertex AI Lacks Async Streaming Support: Members discussed the absence of async streaming support in the Vertex AI integration within LlamaIndex.
    • It was pointed out that Vertex is deprecated, and google-genai is the latest Google LLM library with streaming support, but the async streaming feature was never implemented in Vertex AI due to lack of demand or contributions.
  • Forcing ReActAgent Generation: A member inquired about programmatically forcing the generation of a ReActAgent, possibly through parsing outputs with a Pydantic object.
    • The response indicated that manually parsing the output is currently the only available method.

DSPy ▷ #show-and-tell (1 messages):

DSPy Optimization Patterns, DSPy Use Cases, DSPy Tooling

  • Thoughts requested on DSPy optimization: A member solicited thoughts on incorporating any of the optimization patterns that exist in DSPy.
  • Discussion on Use Cases for DSPy: Several members discussed potential use cases for DSPy in various applications.

DSPy ▷ #general (9 messagesđŸ”„):

DSPy LM Usage Tracking, Optimize RAG Agents with DSPy, DSPy Tool Exception Handling

  • Databricks Talk Bootlegged!: A user shared a YouTube link to a Databricks talk, humorously suggesting â€œđŸ€« don’t tell Databricks”.
  • DSPy Tracks LM Usage?: A user inquired about how DSPy tracks LM usage, noting discrepancies in the data received from Claude and Amazon Nova models via Bedrock.
    • They observed that Claude provides completion_tokens but lacks prompt_tokens, while Amazon Nova returns no usage data at all, after inspecting utils/usage_tracker.py.
  • Roo Code and DSPy Team Up?: A user inquired about using DSPy to optimize Roo Code’s custom models and agents.
    • This suggests potential integration or application of DSPy in enhancing the performance of custom-built RAG agents.
  • Exceptions Trigger Silly ReAct?: A user asked about how DSPy handles exceptions in tools, especially within ReAct, noting that exceptions are passed to the LLM rather than terminating the loop and linking to relevant line in the DSPy source code.
    • A member suggested subclassing dspy.ReAct and overriding the forward or aforward method, or setting max_iters to a low number, as LLMs often use exceptions to correct input errors and retry.

Cohere ▷ #đŸ§”-general-thread (2 messages):

Cmd-R weights update, Open Weight Model Longevity

  • Call for Cmd-R Weights Refresh: A user requested an update to the Cmd-R weights, praising the 0824 version for its lasting quality.
    • The user stated its performance remains competitive almost a year after release, a unique attribute among open weight models.
  • Open Weight Models’ Short Lifespan: The message implies that most open weight models quickly become outdated, contrasting with the sustained relevance of the Cmd-R 0824 version.
    • This suggests a high benchmark for model longevity and a demand for more durable open-source AI solutions.

Cohere ▷ #👋-introduce-yourself (3 messages):

KGeN partnerships, Decentralized distribution protocol, Cohere Community Discord Server introductions

  • KGeN Partnership Team Introduces Decentralized Distribution Protocol: Abhishek from KGeN’s Partnerships team introduced their project to the community, highlighting that KGeN is building the world’s largest decentralised distribution protocol with 24.8M+ 100% human-verified users.
    • He expressed excitement to learn more about the projects of the members and expressed interest in connecting with someone from the Business or Marketing team to explore a potential collab.
  • Community Member Tries to Keep Up With Work and Meet in Toronto: A community member named Michael introduced himself as just trying to keep up with the work in the community.
    • Michael also mentioned he would like to meet up in Toronto.
  • Cohere Discord Asks New Members To Introduce Themselves: A stickied message on the Discord encourages new members to introduce themselves and provides a template to follow.
    • The template prompts members to share their Company/Industry/University, what they’re working on, favorite tech/tools, and what they hope to gain from the community.

Nomic.ai (GPT4All) ▷ #general (3 messages):

New member introduction, PDF Question Answering

  • New Member Enters, Prepares for Nap: A new member introduced themselves, mentioning they had just waited out the 10-minute period and are planning to take a nap after a long day at work.
    • They confirmed downloading the ZIP file for the overall project.
  • PDF Question Answering Anticipated: The new member indicated they have some PDFs and are interested in asking questions about them.
    • No specific details about the PDFs or the nature of the questions were provided.

LLM Agents (Berkeley MOOC) ▷ #mooc-questions (1 messages):

Sp25 MOOC quiz archive, Quizzes section

  • Quiz Archive Launched for Sp25 MOOC: A quiz archive has been launched for the Sp25 MOOC, providing a repository of past quiz questions and answers.
    • The archive is also linked on the course website within the Quizzes section, facilitating easy access for students preparing for assessments.
  • Where to Find The Quiz Archive: The quiz archive is linked on the course website in the Quizzes section.