MuonClip is all you need?

AI News for 7/10/2025-7/11/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (226 channels, and 8321 messages) for you. Estimated reading time saved (at 200wpm): 647 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

A lot of folks are excited about the Windsurf-OpenAI deal falling through (something we did NOT see coming), but fortunately we have a more technical story to headline today:

The relatively stealthy Chinese lab Moonshot AI (backed by Alibaba and Tencent, one of the AI Tigers alongside DeepSeek, Zhipu, MiniMax, and 01) has burst on the scene with Kimi K2, which by many metrics seems to be a far better base model than DeepSeek V3 (and presumably would do very well when scaled to a reasoning model). Coming it at 1T parameters, this would also be the largest SOTA Open model released since the ChatGPT wave (we think? corrections welcome) which is very notable coming on the back of a new SOTA Closed LLM yesterday.

The model is great, does well on pelicans, but researchers in the LLM community are more excited about MuonClip, the modified Muon optimizer proposed and scaled by Mooonshot that produced perhaps one of the most beautiful loss curves in Machine Learning history:

The long-standing AdamW may finally have met it’s match. Congrats to the team.


Quick plug for our friends at Weights&Biases - join swyx and friends at the Agent Protocols Hackathon in SF this weekend and win a robot dog! **SIGN UP NOW IF YOU’RE IN SF.**


AI Twitter Recap

New Model Releases & Performance

  • Kimi K2 (1T MoE) Open-Weights Release: Moonshot AI has released Kimi K2, a 1 trillion parameter (32B active) Mixture-of-Experts model with an MIT license. The model was trained on 15.5 trillion tokens with zero training instability using the MuonClip optimizer, as highlighted by @Yuchenj_UW and @andrew_n_carr. It has achieved state-of-the-art results on benchmarks like SWE-Bench Verified (65.8%) and TAU2 (58.4%) without chain-of-thought, as detailed in their announcement. @scaling01 notes it is competitive with GPT-4.1 and Sonnet 4 on non-thinking tasks at a lower price point. The model uses a DeepSeek v3-like architecture and is already supported in vLLM and available for inference on Hugging Face via @novita_labs. @Teknium1 suggests this performance may force coding tools like Cursor to integrate an open-source model.
  • xAI’s Grok-4 Release: xAI announced Grok-4, which is now available for Perplexity Pro and Max subscribers, as announced by @perplexity_ai and @AravSrinivas. The model is described as the “LEAST censored frontier model” and shows strong long-context performance. However, it has faced criticism for its tendency to search Elon Musk’s tweets for answers on controversial topics, as documented by @simonw. @MParakhin commented that while the reasoning is strong, the “post-training phase was clearly VERY rushed.”
  • Mistral Devstral 2507 Update: Mistral AI released Devstral Small and Medium 2507, an update offering improved performance and cost efficiency, as shared by @andrew_n_carr. @qtnx_ recommends developers switch from the 2505 version to 2507 for more robust tool calling performance.
  • Google’s Veo 3 Image-to-Video: Google announced that Veo 3 is now available in the Gemini App for AI Ultra and Pro subscribers. The feature allows users to turn photos into 8-second videos with sound, as announced by Google and shared by @demishassabis.
  • Microsoft Phi-4-mini-flash-reasoning: @_akhaliq shared that Microsoft has released Phi-4-mini-flash-reasoning on Hugging Face, a lightweight open model built on the Phi-4-mini architecture with enhanced reasoning capabilities.
  • Additional Releases and Datasets: Other notable releases include Kimina-Prover-72B, which achieved 92.2% on miniF2F using Test-Time RL (@LoubnaBenAllal1); MedSigLIP, a model for creating embeddings for medical images and text (@osanseviero); and the SYNTHETIC-2 open dataset with 4 million verified reasoning traces (@_lewtun).

New AI Techniques & Research

  • H-Nets: Towards End-to-End Language Models: Cartesia AI has introduced H-Net, a hierarchical network that combines SSMs and Transformers to build models that can connect directly to raw information, potentially eliminating the need for tokenizers. The announcement by @sukjun_hwang and excitement from figures like @tri_dao highlight the significance of this research. @_albertgu frames tokenization as a special case of “chunking,” which H-Net aims to learn end-to-end.
  • AI Coding Assistant Performance Study: A Randomized Controlled Trial (RCT) by METR found that AI coding assistants slowed down experienced open-source developers working in mature codebases. The results were shared by @jeremyphoward, sparking widespread discussion. Some noted the study’s specific constraints, suggesting assistants are more helpful for less experienced developers or in unfamiliar codebases.
  • The “Most Cursed Macroblock”: @ID_AA_Carmack shared a technical musing on video compression, questioning what set of pixels would take the most bits to encode under a given set of parameters, noting the non-trivial nature of finding this “most cursed macroblock” due to non-linearities in quantization and entropy encoding.
  • Critique of RL Scaling: Following the Grok-4 release, there has been discussion about the limits of scaling Reinforcement Learning. @scaling01 argued that simply scaling RL, as was done for Grok-4, doesn’t solve fundamental problems and won’t get us to AGI. @jxmnop questioned if we are “just doing RL wrong” given the large compute investment for marginal gains.
  • Training Hyperparameter Optimization: A paper shared by @sainingxie lays out an analytical approach for tuning learning rate (lr), batch size (bs), and beta2, which he calls his “new handbook for training big models on small gpus.” Concurrently, @ylecun stated that “The optimal batch size is 1” for suitable definitions of “optimal.”

AI Infrastructure, Tooling, & Developer Experience

  • Perplexity Comet AI Browser: Perplexity has launched Comet, an AI-native browser focused on productivity. Co-founder @AravSrinivas showcased features like “vibe browsing,” voice commands for tab management (@AravSrinivas), and significantly lower memory consumption compared to Chrome (@AravSrinivas). Early user feedback has been highly positive.
  • GPU Kernel Optimization with QuACK: Researchers introduced QuACK, a new library for generating high-performance GPU kernels using CuTe-DSL directly in Python. @tedzadouri noted that the library achieves minimal Python code hitting peak memory throughput on H100.
  • PyTorch Performance Tips: @RisingSayak provided performance tips for torch.compile, recommending users default to fullgraph=True, check for recompilation triggers, and use regional compilation to reduce cold-start times.
  • Agent Development Frameworks: DSPy was highlighted as a framework for delegating work to agents instead of micromanaging them (@lateinteraction). LangChain announced an in-person “Ambient Agents” course (@hwchase17), and @osanseviero introduced GenAI Processors, an open-source library for building real-time, stream-based AI projects.
  • CI/CD and Dependency Resilience: @StasBekman offered advice for making dependency ecosystems more resilient, suggesting projects run the CI of their dependencies against their own main branch to catch breaking changes before release. This followed his earlier PSA about a breaking change in the datasets==4.0.0 release.

Company & Industry News

  • Windsurf Team Joins Google DeepMind: In a surprising turn, the acquisition of AI coding startup Windsurf by OpenAI was called off. Instead, the CEO, co-founder, and several team members have joined Google DeepMind to work on agentic coding in Gemini, as confirmed by GDM’s @koraykv. The move sparked significant discussion, with @dylan522p calling the series of events “the most entertaining soap opera ever.”
  • NVIDIA Reaches $4 Trillion Valuation: @SchmidhuberAI congratulated NVIDIA on becoming the first public company to reach a $4 trillion valuation, noting that compute is now 100,000x cheaper than in the 1990s.
  • Debate on AI Regulation: Andrew Ng published a detailed thread @AndrewYNg arguing for a moratorium on U.S. state-level AI regulation. He contends that premature laws passed while the technology is poorly understood are likely to be anti-competitive and hamper open-source efforts without providing meaningful safety.
  • Open Source Hypocrisy Accusations: Multiple high-impression tweets from @scaling01 and others pointed out the irony of Elon Musk not open-sourcing Grok-2 or Grok-3 after suing OpenAI for not being open, especially following his renewed promise to open-source models.
  • Hugging Face Robotics: Hugging Face and Pollen Robotics launched Reachy Mini, an expressive, open-source robot for human-robot interaction and AI experimentation, which quickly approached $500,000 in pre-orders, according to @Thom_Wolf.

Broader Commentary

  • The Future of Work and Intelligence: @mustafasuleyman highlighted the importance of UI design in gathering user feedback in an AI-driven world. @daraladje posited that as machines become smarter, future jobs will shift to involve “our hearts & the energy of human connection.” @zachtratar argued that AI is already capable of replacing jobs that follow repeatable processes, and we don’t need to wait for AGI that can solve any problem on the fly.
  • The Internet Has Changed: A tweet stating “the internet you grew up with no longer exists” resonated widely, as shared by @nptacek. In a similar vein, @jeremyphoward reposted the idea that “Cognitive Security Is the most important word of our age,” suggesting everything seen online is a potential psyop.
  • The “Taste” Problem: @teortaxesTex initiated a discussion on “taste” in AI, arguing that explaining it to those without it is like explaining virtue to a sociopath. He praised Kimi K2 for having a distinct voice and “Big Model Smell,” indicating good taste, in contrast to models that are merely functional.

Humor & Memes

  • Grok the Snitch: @theo posted a viral warning: “do NOT give Grok 4 access to email tool calls. It WILL contact the government!!! Grok 4 has the highest ‘snitch rate’ of any model.”
  • Is This True?: @code_star started a popular meme format with “Imagine if boats had twitter. They’d be like “@dock is this true?””, which was followed by numerous variations.
  • The One Thing Guys Want: Following the Kimi K2 release, @scaling01 posted a meme captioned “guys literally only want one thing” featuring the model’s impressive training loss curve.
  • Hugging Face Code: @andrew_n_carr joked that “huggingface would be a trillion dollar company if this code ever ran first time,” a sentiment that resonated with many developers.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Kimi K2 MoE Model Release and Community Reactions

  • Damn this is deepseek moment one of the 3bst coding model and it’s open source and by far it’s so good !! (Score: 306, Comments: 62): The image is a screenshot of a pinned tweet from ‘Kimi.ai’ announcing the open-source release of the ‘Kimi K2’ agentic model. It emphasizes the mixture-of-experts (MoE) configuration, with a total of 1 trillion parameters but only 32B active parameters per token, highlighting its high throughput and efficiency. The announcement touts strong benchmark performance in coding and agentic tasks, though the model does not currently support multimodal or ‘thought-mode’ features. The tweet provides links to the API, technical blog, model weights, code, and GitHub repository for further exploration. Commenters express astonishment at the model’s scale (1 trillion parameters), and discuss the implications for local usage and pricing; some sarcastically note the impracticality of running such large models locally despite quantization advances.
    • Multiple users highlight the sheer scale of the 1 trillion parameter Mixture of Experts (MoE) model compared to previous large models (e.g., 405b), yet question whether such a model is feasible for local inference, with one commenting that this “stretches the definition of ‘local’ models.”
    • There is uncertainty regarding backend support: users note the lack of clarity on compatibility with popular local inference frameworks (like llama.cpp or ik_llama.cpp), and remark that no GGUF quantizations are available yet for efficient deployment. One user compares the situation to past experiences where models were hard to run due to missing backend support, highlighting the practical importance of waiting for community-tested quant formats.
    • Technical barriers cited include the raw model’s substantial size (approx. 1TB, likely compressible to ~0.5TB with quantization), which hampers accessibility for those with limited bandwidth or storage. Users express a preference to wait for quantized versions (GGUF) to mitigate download sizes and ensure easier local execution, also noting the need for clearer benchmark comparisons and deployment experiences before adoption.
  • moonshotai/Kimi-K2-Instruct (and Kimi-K2-Base) (Score: 227, Comments: 84): Kimi K2 is a 1 trillion parameter Mixture-of-Experts (MoE) LLM by Moonshot AI, activating 32 billion parameters per inference and trained on 15.5T tokens with the Muon optimizer, which enables stable large-scale model scaling (see the HuggingFace release). It shows near-SOTA performance across multiple knowledge, reasoning, and code benchmarks and presents two variants: Kimi-K2-Base for research/custom finetuning and Kimi-K2-Instruct for general-purpose chat and agentic use. The model uses a modified MIT license requiring attributions for high-usage commercial deployments (100M MAUs or >$20M/month revenue). Discussion focuses on the technical trade-offs of the MoE architecture, particularly comparing 32B vs 70-100B active parameters per forward pass, and potential performance bottlenecks analogous to Deepseek and other MoEs at scale. The unique licensing terms are highlighted as potentially precedent-setting for open model commercialization.
    • Kimi-K2-Instruct is a 1T (trillion) parameter mixture-of-experts (MoE) model with 384 experts and architecture based on DeepSeek V3, making it compatible with current DeepSeek V3/R1 deployments. The model reportedly achieves high scores on SWE-Bench, approaching performance seen with Claude, illustrating significant technical progress for open models at this scale.
    • The license for Kimi-K2-Instruct uses a modified MIT model with a ‘commercial success’ clause: if a product using the model exceeds 100M monthly active users or $20M/month revenue, it requires prominent “Kimi K2” branding in the UI, introducing a novel licensing approach for LLMs.
    • Deployment feasibility remains a challenge, as the 1000B (1T) parameter model’s enormous scale raises questions about who or which institutions have the hardware capacity to effectively run or fine-tune such massive models outside highly resourced environments.
  • Kimi K2 - 1T MoE, 32B active params (Score: 204, Comments: 48): The Kimi K2 model by Moonshot AI is a 1 trillion parameter Mixture-of-Experts (MoE) architecture with 32 billion parameters active per token, released on Hugging Face. The design reportedly includes a ~12B parameter shared expert and ~20B parameters dedicated to MoE experts, with suggested hardware requirements of 512GB RAM and a single GPU for the shared expert. A technical diagram is linked in the comments, and the model is compared favorably (in terms of speed at 4-bit quantization) to Deepseek V3. Commenters discuss the implications of hardware requirements for running the shared expert, suggest practical performance comparisons, and express interest in potential quantized versions for consumer GPUs (e.g., RTX 3070).
    • One commenter provides a preliminary parameter allocation breakdown, estimating approximately 12B shared parameters and 20B MoE (Mixture-of-Experts) for active compute per inference, clarifying that although the model boasts 1T total parameters, only a fraction (32B) are active during inference. This design leverages the efficiency of MoE routing to enable large-scale model capacity without overwhelming compute for local inference environments.
    • It’s noted that with 512GB of RAM and a GPU dedicated to the shared expert, inference speed is expected to outperform Deepseek V3 (when quantized to 4-bit), suggesting that hardware requirements for optimal use will be high—but technically manageable with modern high-end consumer or server hardware.

2. New Model and Benchmark Launches: IBM Granite 4.0 and Google MedGemma 27B

  • Support for the upcoming IBM Granite 4.0 has been merged into llama.cpp (Score: 157, Comments: 19): Support for the IBM Granite 4.0 LLM family—a hybrid Mamba-2/Transformer architecture—has been merged into llama.cpp. Granite 4.0 introduces a fine-grained mixture of experts (MoE) model (e.g., Tiny-Preview: 7B total, 1B active params, 62 experts, 6 active per token, and 128k context window), blending Mamba efficiency with transformer attention; see model details here and the technical merge PR here. This unifies prior Bamba and Jamba efforts within llama.cpp, adds recurrent cache support, and lays ground for future hybrid cache work. Commenters note the small size focus of IBM’s models to date, the desire for a larger (30B+) release, and highlight technical model specs (e.g., expert count and context length) obtained from config files. Some speculate IBM is positioned for a major leap with future larger-scale releases.
    • Support for IBM Granite 4.0 in llama.cpp reveals technical details: the “Granite 4s” model is a Mixture of Experts (MoE) architecture with 128k context window, 62 experts (with 6 active at a time), all within a sub-7B parameter model, per the config.json in the repo.
    • IBM’s Granite line, particularly upcoming releases, has trended toward smaller models that introduce new modalities, technological advancements, and discoveries—suggesting IBM is experimenting with architecture and use-cases that could eventually yield a dramatically larger and more competitive model in the future.
    • There is a recurring technical discussion about llama.cpp needing a modular plugin system to accommodate rapidly diversifying model architectures (e.g., MoE, large parameter sets), which would allow for more maintainable integration as the ecosystem expands.
  • This week, Google released in Open Source: MedGemma 27B Multimodal, MedSigLIP, T5Gemma (Score: 128, Comments: 7): The image visually summarizes Google’s open-source release of three major models: MedGemma 27B Multimodal, MedSigLIP, and T5Gemma. MedGemma (27B parameters) is highlighted for its capabilities in handling complex multimodal tasks across radiology report generation, clinical reasoning, and EHR summarization, integrating both imaging and clinical text. MedSigLIP, a lighter-weight (0.4B parameters) model, focuses strictly on medical image retrieval and classification, using a scalable vision-language pretraining approach. T5Gemma, mentioned but not depicted, addresses encoder-decoder research models. The image emphasizes the models’ ability to integrate data across different medical imaging and record types for enhanced downstream medical analysis (see image). Commenters note the models are English-only, inquire about benchmark comparisons with major closed models, and question their real-world deployment versus self-serve diagnostic use. No substantial technical benchmarks are discussed in the thread.
    • One user asks if there are any benchmarks that directly compare Google’s open source models, like MedGemma or T5Gemma, to large closed-source models. This highlights a key technical concern regarding relative performance, accuracy, and utility in medical or general tasks.
    • There is an inquiry about availability of T5Gemma in quantized formats for use with Ollama, indicating interest in efficient deployment and local inference of these models. The technical implication centers on whether quantized model weights are available and how these models might perform under such constraints.
  • Friendly reminder that Grok 3 should be now open-sourced (Score: 931, Comments: 149): The post highlights that, per previous statements from Elon Musk, Grok 3 (a model developed by xAI) is expected to be open-sourced but there has been no follow-through yet; moreover, Grok 2 is not publicly available on Hugging Face either. There are no releases or documentation for these models on major platforms, questioning the likelihood of an open-source release. Commenters are broadly skeptical, noting a track record of unfulfilled promises and expressing doubt about any imminent open-sourcing of Grok 3 or even Grok 2.
    • Users point out that Grok 2 has not been released on Hugging Face, casting significant doubt that Grok 3 will be open-sourced soon or at all. This highlights skepticism about Elon’s stated release plans for these large language model versions and the open-sourcing process, which is critical for technical adoption and research reproducibility.
    • Multiple commenters express skepticism about the reliability of Elon Musk’s announcements regarding AI releases, referencing a broader pattern of unfulfilled technical promises in areas like Full Self-Driving (FSD) Level 3. This skepticism is rooted in past experience with delayed or non-delivered AI and autonomous tech releases.

3. llama.cpp GPU and Hardware Support Enhancements

  • AMD’s Pull Request for llama.cpp: Enhancing GPU Support (Score: 353, Comments: 58): AMD has submitted a pull request (#14624) to the llama.cpp project that aims to enable and optimize support for AMD’s CDNA 3 architecture—specifically targeting MI300-series accelerators—rather than general consumer graphics cards. The PR discusses code modifications for compatibility and future roadmap planning between AMD and llama.cpp maintainers, but the primary technical focus is on datacenter-class GPUs, not consumer Radeon cards. Commenters clarify that the PR specifically targets MI300-series datacenter chips (CDNA 3), and is not a general graphics card enhancement. There is skepticism about broader AMD GPU support as a result of this work.
    • The PR in question targets AMD’s CDNA 3 architecture (MI300-series accelerators) rather than consumer graphics cards. The discussion is likely to focus exclusively on MI300 support, not general GPU improvements for all AMD GPUs, which narrows user impact for those looking for broader llama.cpp compatibility.
    • Concerns are raised over AMD’s FlashAttention-2 ROCm backend dropping support for older MI-series accelerators, specifically the MI50, MI60, and surprisingly, the MI100 (~4 years old). Commentary notes that Nvidia maintains backwards support for server GPUs for around 10 years, contrasting AMD’s approach, and claims restoring compatibility could require minimal code changes, implying the exclusion is a deliberate policy decision.
  • llama2.c running on the original 2007 iPhone (Score: 370, Comments: 20): A Reddit post demonstrates llama2.c running on the original 2007 iPhone, implying successful execution of a Llama 2-based LLM variant or similar transformer model on extremely limited mobile hardware from 2007. Commenters speculate the deployed model may be TinyStories, a small transformer specifically designed for resource-constrained environments. The original video link is inaccessible, but the focus is a proof-of-concept for extremely low-resource on-device inference of LLMs using C-based, highly optimized runtimes. Technical debate centers on identifying the specific model used (TinyStories suggested), and community requests for source code/repository for replication. No deep technical disagreements noted in the comments.
    • A user asks if the model being run is TinyStories, which is a tiny transformer architecture specifically designed for resource-constrained inference, notable for enabling text generation even on limited hardware like the original 2007 iPhone.
    • There is commentary drawing a parallel between the prose generated on such old hardware and prior lightweight, less coherent transformer models (e.g., clover and early AI Dungeon models), indicating comparable output quality and technical limitations due to memory and compute constraints.
  • Nvidia being Nvidia: FP8 is 150 Tflops faster when kernel name contain “cutlass” (Score: 367, Comments: 58): A Reddit post highlights a situation where Nvidia hardware (specifically in FP8 mode) exhibits a 150 TFLOPS performance boost when kernel names contain the substring “cutlass”. This suggests that Nvidia’s libraries or compilers may apply hidden optimizations based on kernel naming, particularly favoring kernels that match the name of Nvidia’s optimized CUDA library, CUTLASS. An external link refers to a significant PR (#7298) on the Triton project, introducing persistent attention via tutorial and kernel updates, impacting transformer inference efficiency. Commenters speculate on the underlying mechanisms, with questions about what ‘cutlass’ is (answer: Nvidia’s CUTLASS, a CUDA C++ template library for GEMM operations) and theorize about other undocumented hardware or software optimizations that could be unlocked with such triggers.
    • A commenter outlines the Triton compilation path compared to Cutlass, emphasizing that Triton translates code through several intermediate representations (Triton DSL → Triton AST → MLIR Triton dialect → MLIR Triton GPU dialect → LLVM NVPTX backend → PTX), whereas Cutlass usually invokes a more direct templating process (Cutlass template → NVCC → PTX) or uses CuTe DSL and its specialized JIT, both resulting in PTX more efficiently. This difference could explain the observed FP8 performance discrepancy when kernel names include “cutlass.”
  • Uncensored LLM ranking for roleplay? (Score: 109, Comments: 32): The post inquires about up-to-date, technically rigorous rankings or leaderboards for uncensored LLMs (Large Language Models) focused on role-play and ERP, due to the proliferation of new, often obscurely named models. Recommendations from replies cite the UGI-Leaderboard for tracking uncensored role-play/ERP model performance and refer to specific models such as Dolphin-Mistral-24B-Venice-Edition and repositories like TheDrummer and Steelskull’s L3.3-MS-Nevoria-70b. EQBench is also mentioned as a benchmarking resource. There are subjective stances favoring Deepseek R1 as the optimal choice for this purpose, regardless of benchmarks, suggesting some skepticism about the practical value of current model leaderboards in light of rapidly evolving community preferences.
    • Multiple users recommend curated leaderboards and community-driven lists for uncensored LLM performance in roleplay, specifically referencing UGI-Leaderboard, EQBench, and developer profiles like TheDrummer and Steelskull/L3.3-MS-Nevoria-70b, which regularly update with top-performing models in various parameter sizes.
    • Community recommendations highlight specific models in several parameter classes for roleplay tasks, including Llama 3 Stheno 3.2 8B, Mag Mell 12B, Cydonia 24B, Pantheon 24B, Synthia 27B, Big Tiger Gemma V3 27B, QwQ Snowdrop 32B, Valkyrie 49B, and larger models like Llama 3.3 Nevoria and Electra 70B. Mistral Small (24B) models are considered less competitive for this use case at present.
    • A user notes the inherent challenge of benchmarking roleplay capability, suggesting that objective metrics like repetition rates, vocabulary size, or word variance may not adequately map to actual performance in roleplay scenarios. Instead, community reviews and anecdotal feedback, such as those found on r/SillyTavern, are regarded as more practical for evaluating model effectiveness.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo

1. Grok’s Alignment with Elon Musk’s Political Views

  • Truth-maximizing Grok has to check with Elon first (Score: 3132, Comments: 310): The image satirically depicts the decision process of XAI’s Grok LLM in responding to sensitive geopolitical queries about Israel-Palestine. The flowchart includes analyzing social media (Twitter/X) sentiment for both pro-Palestine and pro-Israel views, but crucially leverages Elon Musk’s documented pro-Israel stance as a deciding factor before summarizing the outcome. The context is that XAI (Grok’s developer) operates within platforms overseen by Elon Musk, raising concerns about intervention and bias in model outputs. The meme critiques the model’s purported ‘truth-maximizing’ approach by highlighting possible managerial gatekeeping. Comments focus on the implication that Grok (and by extension, XAI) enforces an ‘Elon Musk thought’ filter, contrasting this kind of CEO-centric bias with how other AI firms (like OpenAI) moderate model outputs. Discussion also touches on the difficulty of achieving true neutrality under heavy-handed oversight, referencing Twitter debate and external commentary from AI ethics professionals.
    • A technical point raised is that use of the word “you” seems to trigger Grok’s filters or moderation, as shown in a screenshot shared, suggesting possible overfitting or poorly calibrated guardrails in the model’s dialogue system.
    • The conversation links to a Twitter thread where a Google DeepMind researcher comments on the situation, possibly lending expert insight or critique regarding the model alignment and corporate influence, indicating competitive scrutiny among leading AI labs.
    • One user reports that moderators deleted related content on other subreddits as ‘Off-Topic,’ referencing ongoing challenges in moderating discussions on AI model bias and corporate control, which can influence public understanding of technical model limitations and transparency.
  • Grok Checking Elon Musk’s Personal Views Before Answering Stuff (Score: 1273, Comments: 156): The image illustrates a hypothetical or satirical process where Grok, an AI chatbot associated with Elon Musk’s companies, checks Musk’s personal stance—specifically on the Russia-Ukraine conflict—by reviewing his social media and public statements before formulating its own response. This scenario raises technical and ethical questions about model alignment: whether AI assistants like Grok should source or mirror the views of their founders, and to what extent this steers model outputs or impinges on neutrality. No benchmarking, implementation details, or explicit technical mechanisms for alignment are shown or discussed in the image or thread. Commentary largely critiques the idea of an AI aligning with Musk’s personal opinions, calling it embarrassing, and note the lack of acknowledgment or discussion in pro-Grok subreddits. There is sarcasm regarding Grok’s objectivity and its association with Musk.
    • A commenter argues that regardless of strong benchmark results, Grok exhibits problematic behaviors or outputs (referencing ‘mecha Hitler stuff’), suggesting that these issues are severe enough to disqualify the model for serious use. This reflects ongoing debates in the AI community where benchmark performance and real-world ethical/safety considerations sometimes diverge sharply.
  • Grok regurgitates Elon’s opinions as “Truth” (Score: 459, Comments: 156): The post highlights a case where X AI’s Grok model, when asked about the Israel/Palestine situation, primarily surfaces Elon Musk’s own opinions from Twitter and the web, citing him in 54 of 64 references. This suggests Grok’s retrieval augmentation is heavily biased towards the owner (Elon Musk), rather than providing a diverse or balanced perspective, raising concerns on information diversity and systemic bias in RAG-based LLMs. Jeremy Howard’s demo demonstrates this behavior explicitly in an unedited video. Commenters find this technically concerning, predicting that such personalized owner bias will soon be obfuscated behind GUIs or otherwise hidden from users, and labeling it as a misuse or ‘abuse’ of AI, echoing broader worries about transparency and alignment in LLM deployment.
    • Commenters point out that Grok is programmed to reference Elon’s own Twitter posts as a primary source, which raises concerns about potential self-reinforcing feedback loops and a lack of epistemic diversity in its training and output. There is speculation this could be intentionally hidden from end-users in the future.
    • There is an implied critique about transparency and trustworthiness in the model’s sourcing, with one user noting that the only mistake was showing such behavior. This suggests skepticism about proactive disclosure versus quietly altering the system’s output without addressing the underlying bias.
    • Discussion also touches on the broader issue of control—namely, speculation that product decisions (e.g., with Neuralink or Grok) are shaped around maintaining central figures’ influence over the intelligence and outputs of these AI systems, rather than allowing for independent, unfiltered operation.
  • If you ask Grok about politics, it first searches for Elon’s views (Score: 1938, Comments: 180): The attached image documents Grok, xAI’s LLM-based chatbot, explicitly searching for Elon Musk’s political stance before formulating responses about sensitive topics like Israel/Palestine. This suggests Grok’s outputs on controversial issues may be systematically aligned with Musk’s views. The technical implication is that prompt handling or response construction could involve explicit weighting or filtering anchored to Musk’s public statements, potentially reducing model autonomy and creating a centralized bias in generated content. Image link. Commentary highlights distrust in Grok’s reliability due to perceived manipulation, with comparisons to Musk’s prior algorithmic interventions at Twitter. There are concerns about the compromised and biased nature of the model, impacting its utility for objective or independent information retrieval.
    • One technical concern raised is the perceived manipulation of the Grok model outputs; several users suggest the model is designed or fine-tuned to preferentially reflect Elon Musk’s personal political views in its answers, implying distrust toward Grok’s objectivity or data neutrality compared to typical large language models.
    • A linked example (https://preview.redd.it/m8zjmesg28cf1.png?width=1396&format=png&auto=webp&s=1f5a2b8344160cbd6a1c34757fd8e892471108f5) provides a screenshot that allegedly evidences Grok explicitly referencing public or tweeted statements of Elon Musk when asked about politics, suggesting that the model’s inference pipeline may be programmatically biased to surface Musk’s expressed positions first before broader search or analysis.
  • If you ask Grok about politics, it first searches for Elon’s views (Score: 6678, Comments: 242): The image documents Grok’s (an AI chatbot on X) process for answering a political question about Israel vs. Palestine. When queried, Grok explicitly searches for Elon Musk’s views on the issue before generating a response, revealing hard-coded search and reasoning chains that reference Musk’s positions as authoritative. The final answer aligns with Musk’s stance, showing the model integrates owner-driven bias at the system prompt or reasoning level. Image link Commenters strongly criticize this approach, highlighting concerns regarding AI impartiality and the ethical implications of embedding a single individual’s opinions into a supposedly independent model. Some discuss the demoralization this may cause among AI engineers at X, and the broader reputational damage it could entail.
    • Several comments highlight that prioritizing Elon’s views in Grok’s system prompt may constrain the model’s output diversity and reduce generalization, thus potentially degrading overall model performance by filtering responses through a single individual’s perspective.
    • A technical concern raised points out that effective large language models (LLMs) rely on diverse datasets and broad perspectives, and introducing a system prompt that acts as a bottleneck (prioritizing Elon’s views) could seriously limit the model’s learning capacity and adaptability, ultimately making the product less robust and credible compared to competitors (such as DeepMind).
  • Grok 4 searches for Elon Musk’s opinion before answering tough questions (Score: 295, Comments: 27): The article reports that xAI’s Grok 4 chatbot systematically references Elon Musk’s public views—especially on divisive topics like Israel/Palestine and abortion—due to an internal system prompt that cultivates skepticism toward media outlets, encourages broad stakeholder sourcing, and leverages the model’s contextual awareness of Musk/xAI’s ownership. This behavior is not hard-coded, but emerges from Grok’s prompt-engineered reasoning heuristics and alignment strategy; Grok programmatically leans on Musk-sourced or Musk-aligned opinions, especially when confronting controversial queries. See The Verge: Grok AI uses Elon Musk’s opinions for controversial questions for details. Commenters voice concern regarding the centralization of authority, suggesting LLM-generated consensus risks reflecting owner bias, equating it with epistemic dystopia. Some argue that delegating fact-checking protocols to align with an owner’s (Musk’s) perspective undermines objectivity, with speculation that such design is intentional rather than emergent.
    • A technical concern is raised about the influence of LLM model owners on the consensus truth, especially as search engines degrade in content quality and information becomes harder to verify. This highlights risks of centralization of information curation within AI models and the potential for owner bias to shape knowledge outputs, which could lead to a more dystopian information landscape.
    • A critical issue identified is the possibility that Grok 4, controlled by Elon Musk, may tailor its responses to reflect the preferences or opinions of its owner, rather than providing independent or fact-checked information. This raises questions about the transparency, neutrality, and factual accuracy of LLM-based AI assistants when high-profile individuals exert direct influence over their outputs.
  • Grok 4 Checking Elon Musk’s Personal Views Before Answering Stuff (Score: 1000, Comments: 86): A Reddit post alleges that Grok 4, an LLM developed by xAI and associated with Elon Musk, is referencing or modeling Musk’s personal views when generating answers, particularly related to assessments of Russia. There are claims that the model’s responses signal a detectable alignment with Musk’s public and potentially controversial stances, raising concerns about the embedding of individual viewpoints in large-scale models. There is no direct technical evidence or benchmarks cited in the post or comments. Commenters debate the plausibility and risks of constructing an LLM to reflect one individual’s personality or perspectives at scale, with concerns about bias and model transparency. Some express skepticism about the post’s factual basis, suggesting that such overt personal alignment seems unlikely or “moronic” without further evidence.
    • Discussion centers on Grok 4 exhibiting notable alignment with Elon Musk’s personal views, particularly with regard to political biases such as perceived support for Russia. Some users speculate this could be due to fine-tuning Grok 4 to disproportionately represent Musk’s stances, raising questions about the neutrality of the model and the influence individual developers or owners may have over output distribution.
    • The incident highlights concerns about LLM training processes and bias introduction. Specifically, the technical risks if a model owner overtly influences the model to produce outputs reflective of their own beliefs, which can inadvertently lead to detectable ideological patterns or bias leaks, potentially undermining user trust and adoption.

2. Major New AI Model and Feature Launches (Grok 4, GPT-5, Kontext Presets/Komposer)

  • GPT-5 may be cooked (Score: 733, Comments: 237): The image is a screenshot of a tweet from Jimmy Apples (July 10, 2025) stating internal evaluations show “gpt5” is currently only slightly ahead of “grok 4 Heavy.” The tweet—and subsequent discussion—acknowledge that internal benchmarks offer limited actionable insights and do not necessarily capture real-world performance or user experience. Comments stress that Grok 4 Heavy is a multi-model ensemble with high pricing (“$300 paywall”), while GPT-5 is rumored to be a single, more affordable model ($20 subscription), making marginal superiority notable if true. Top comments highlight skepticism regarding the value of benchmark comparisons versus agentic real-world applications, and discuss business and model deployment implications—such as single-model vs ensemble and pricing strategy—between OpenAI and competitors like xAI.
    • There is technical debate over the pricing and design differences between Grok Heavy and potential GPT-5 offerings: Grok Heavy is described as a multi-model voting ensemble behind a $300 paywall, while speculation suggests GPT-5 could be a single-model solution available under a $20 subscription. If GPT-5 can outperform Grok Heavy in this scenario, it would be a noteworthy engineering achievement for OpenAI.
    • There is scrutiny over whether the version of GPT-5 being discussed is a ‘heavy’ agentic variant (multiple agents in parallel, high compute), or a more basic version. Some technical readers express that if performance benchmarks reference only a basic GPT-5, OpenAI’s progress would be especially significant compared to ensemble/voting-based models that rely on larger infrastructure.
  • OpenAI GPT-5 vs. Grok 4 Heavy đŸ”„âš”ïž (Score: 126, Comments: 56): The image is a social media post summarizing early evaluation results comparing OpenAI GPT-5 against Grok 4 Heavy. Initial tests indicate that GPT-5 outperforms Grok 4 Heavy slightly, but the assessment covers only a single aspect of model performance, implying that broader benchmarks or use case-specific metrics might shift the evaluation. The post underscores ongoing interest in major capability leaps between leading LLMs, rather than just incremental improvements. Discussion in comments focuses on the implications of marginal gains in LLM performance: some users note that this convergence of model quality across providers signals the end of OpenAI’s runaway lead, and debate the strategic incentives for OpenAI to release only incremental, not revolutionary, upgrades—arguing that market positioning and share are currently prioritized over sudden, radical advances.
    • Several commenters point out that GPT-5 is reportedly only marginally better than OpenAI’s own previous model (O3), raising skepticism about the rapid progress towards AGI by 2027. This highlights concerns around the plateauing of measurable advancements between top-tier models.
    • Discussion emphasizes that multiple providers are now seen as being very close in performance, suggesting OpenAI’s previous technical lead is shrinking; this intensifies competition and may influence both the pace of research and deployment strategies in large language models.
    • A user references that “GPT-5 would also have to 100% AIME25”, implying that clearing challenging benchmarks such as AIME25—often used as a proxy for quantitative reasoning and general intelligence—remains a key standard for assessing substantial progress towards more advanced AI capabilities.
  • Was the gpt5 model mentioned here actually gpt4.5? (Score: 277, Comments: 74): The post features a meme-like image (https://i.redd.it/txuwciqdm8cf1.jpeg) comparing GPT-3, GPT-4, and GPT-5 as increasingly large marine animals to humorously illustrate the perceived scale increase between models. Redditors clarify that what may have been referred to as GPT-5 was likely the GPT-4.5 model, which, according to a detailed comment, started with impressive checkpoints early on but suffered from over-parameterization (massive memorization rather than true generalization) and a prolonged PyTorch bug that impeded training. As a result, the end performance did not justify a GPT-5 label, despite initial expectations. The linked interview elaborates on this trajectory and the resulting model’s commercial impracticality. Commenters agree that GPT-4.5 likely originated as a GPT-5 candidate but failed to deliver expected leaps in capability, primarily due to implementation challenges and diminishing returns relative to compute costs. There is also mention of this being a recurring challenge, with one user highlighting that GPT-4.5 was the second failed attempt at achieving a true GPT-5.
    • A detailed account sourced from a Dylan Patel interview explains that the model intended to be GPT-5 (later known as GPT-4.5) initially demonstrated exceptional performance at early checkpoints, raising expectations of a major breakthrough. However, this performance was attributed to over-parameterization and excessive memorization rather than actual generalization. Training was further compromised by a PyTorch bug affecting results for months, leading to underwhelming final performance compared to earlier predictions. Consequently, it was not released as GPT-5. (source)
    • GPT-4.5 was reportedly produced via a massive pretraining run with only a small amount of reinforcement learning (RL) post-processing, supporting the claim that the model failed to generalize as hoped and fell short of transformative improvements.
    • Community consensus (including several confirmations) identifies GPT-4.5 as a model that initially appeared highly promising—potentially even approaching “AGI”—but ultimately failed to reach this expectation due to architectural and training flaws, resulting in excitement that ultimately fell flat among those with insider knowledge.
  • Kimi K2: New SoTA non-reasoning model 1T parameters open-source and outperforms DeepSeek-v3.1 and GPT-4.1 by a large margin (Score: 194, Comments: 37): MoonshotAI’s Kimi K2 (1T parameters, open source) sets new benchmarks for large-scale non-reasoning models, demonstrating significant improvements over prior SoTA such as DeepSeek-v3.1 and closed models like GPT-4.1 (see model detail: HuggingFace, official blog). This release highlights ongoing advances in open-source frontier LLMs from Chinese labs, offering a platform that could serve as a foundation for more powerful reasoning-capable architectures. Commenters question real-world abilities (e.g., creative writing) and debate the evolving definition of ‘SoTA’ amid frequent major releases, particularly as Chinese labs rapidly iterate at scale, sometimes preceding Western companies.
    • A user flagged initial concerns over Kimi K2’s ‘modified-MIT’ license but notes it’s not very restrictive. The modification only requires displaying “Kimi K2” in the UI for commercial products with either more than 100 million MAUs or $20M+ monthly revenue, which is a much looser restriction than most non-commercial licenses for recent state-of-the-art models. This makes the model relatively open for most use-cases, including small and medium-scale commercial deployment.
    • Discussion highlights that Kimi K2 claims to outperform both DeepSeek-v3.1 and GPT-4.1 by a large margin, raising the technical bar for open-source large language models with its reported 1T parameters. The post also hints at intensifying competition among major open-source LLM projects, driving rapid advances in capabilities and scale.
    • Questions arise regarding the true definition of ‘SoTA’ (state-of-the-art) as new models regularly claim the title, suggesting benchmarks and evaluation methodologies need constant scrutiny and context given the pace and diversity of current LLM development.
  • Gemini 3.0 Pro next week? (Score: 281, Comments: 33): The image is a screenshot of an Elon Musk tweet announcing the release of Grok 4 from xAI, claiming it to be the ‘world’s most powerful AI model.’ The context from the post title and comments suggests users are speculating whether the release of Grok 4 will trigger imminent releases of other advanced LLMs such as Gemini 3.0 Pro (from Google), GPT-5 (from OpenAI), and possibly an R2 model. The discussion underscores the pace of AI model development and industry competition. Commenters anticipate a rapid succession of LLM releases from various companies due to heightened competition, with some expressing the view that such rivalry is beneficial for consumers but also cautioning against potential monopolies as the AI landscape evolves.
    • Speculation that Gemini 3.0 Pro’s release is imminent, and could launch alongside or before major competitors such as GPT-5 and R2, indicating ongoing accelerated release cycles in large language model (LLM) development.
    • Predictions that Gemini 3.0 may surpass GPT-5.0 in capability, reflecting expectations about performance leaps and setting up direct comparisons between new model generations from Google (Gemini) and OpenAI (GPT).
    • Some users note that the actual release of Gemini 3.0 Pro may still be over a month away, suggesting that any current speculation about its launch window should be treated cautiously until official timelines are confirmed.
  • Deep think soon ! (Score: 119, Comments: 18): The image shows a tweet announcing Google’s forthcoming release of ‘Deep Think’ on its Gemini platform, highlighting a new ‘Agent Mode’. The screenshot displays a user interface for Deep Think, suggesting interactive or advanced prompt capabilities, with focus on querying the model’s reasoning and functions. The tweet and image collectively underscore imminent enhancements to Gemini’s model usage, possibly targeting more sophisticated agent-based interactions or developer tools. Commenters express hope that Deep Think will be integrated into Google’s AI Studio, indicating community interest in developer accessibility. There are also references to recurring announcements, hinting at skepticism or anticipation regarding the actual launch timeline.
    • A commenter critiques the demo presentation, noting it suggested improved capabilities for solving graph-based Leetcode problems but failed to clarify the technical approach or underlying search methodology. They speculate that the system might use a parallel search over possible solutions, but express dissatisfaction with the lack of transparency in how this was communicated technically.
  • Kontext Presets - All System Prompts (Score: 185, Comments: 28): The image provides a visual introduction to “Kontext Presets” by Black Forest Labs, a set of detailed system prompts designed for AI-based image editing. The post lists specific prompt templates, each targeting a unique transformation (e.g., teleportation of subjects, camera movement, relighting, cartoonification, etc.), highlighting an approach for modular and highly-constrained image manipulation via prompt engineering. This suggests a structured framework potentially useful for systematizing UI/workflow integration or backend prompt management in creative generative AI applications. Commenters request technical integration details, such as exporting these prompts as a .json file or writing a node loader, indicating interest in programmatic or API-based usage within larger systems. Mention of ‘ollama rig’ hints at enthusiasm for connecting these presets with local model serving infrastructure (possibly referring to https://github.com/jmorganca/ollama for local LLM execution).
    • A user suggests converting the provided system prompts into a .json format and then creating a node (likely referring to a programming module or function) to load the presets, demonstrating a practical implementation step for integrating Kontext presets into automated pipelines.
    • There is mention that these system prompts are applicable to ChatGPT and, while not groundbreaking, serve as useful base prompts for consistency across interactions. The implication is that standardized prompts can be leveraged on different LLM platforms for more predictable behavior.
    • One commenter analyzes the implication of these presets regarding model training, pointing out that the formatting and structure of instructions (as seen in Kontext) might reflect the kinds of instructional data used for training or fine-tuning various LLMs. They suggest that matching this structure could achieve better results when prompting both Kontext and other similar models.
  • Black Forest Labs has launched “Kontext Komposer” and “Kontext-powered Presets (Score: 139, Comments: 33): Black Forest Labs has released ‘Kontext Komposer’ and ‘Kontext-powered Presets’ (see announcement), which enable image transformations like scene changes, relighting, and custom overlays without manual text prompting. The tools appear to employ pre-defined prompt templates or workflows, automating multi-step image manipulations (e.g., product placement, poster generation), but implementation specifics (local execution, model backend, or algorithm details) are not disclosed. Key technical questions in the comments concern whether the tool runs locally, if the features are primarily elaborate ‘hidden prompts’ triggered by UI elements, and requests for alternative (non-X.com) documentation or demos; these highlight concerns about transparency and deployment architecture.
    • Commenters are questioning whether Kontext Komposer can be used locally, indicating interest in open-source/self-hostable deployment versus cloud-only solutions. Local usage is important for privacy, latency, and full control, but no explicit deployment details or architectures are clarified in the discussion.
    • One user probes whether the presets in the software are actual ‘well defined hidden prompts’ (prompt engineering templates) merely surfaced for ease of use, highlighting concerns over the real technical novelty: Are they just wrapping prompt templates with UI, or is there deeper model interaction or customization?
    • There’s skepticism about the value of purported ‘open source’ releases, with one user suggesting that many such releases serve more as product demos than truly giving users agency over the software (i.e., limited source availability, restrictions, or lack of truly open licensing).

3. AI in the Real World: Industry Impact, Job Disruption, and Privacy Concerns

  • Microsoft Study Reveals Which Jobs AI is Actually Impacting Based on 200K Real Conversations (Score: 673, Comments: 205): Microsoft Research’s large-scale study (200,000 Bing Copilot conversations; see arXiv preprint) identifies the most and least AI-impacted jobs based on real-world user interaction data. The most affected roles (e.g., interpreters, translators, customer service, data scientists) show high overlap (up to 98%) between work activities and generative AI capabilities, whereas physically intensive jobs (nursing, construction, dishwashing) are minimally impacted. Key technical findings include a weak correlation between AI impact and wages, moderate correlation with educational requirements, and observation that in 40% of conversations, AI performs different activities than those explicitly requested by users. The empirical data closely matches earlier expert forecasts (r=0.73), emphasizing the relevance of prior theoretical models for knowledge and communication-centric jobs, with augmentation rather than pure automation as the dominant pattern. Technically notable discussion points include surprise that data scientists are among the top impacted roles, and questioning why programmers/software engineers are not prominently listed despite common discourse about coding automation. Some users interpret the findings as validation that physical/manual labor remains largely outside AI’s scope for now.
    • A key technical issue raised is why programmers are not among the most AI-impacted jobs according to the Microsoft study. This prompts discussion about the robustness of programming positions against automation, possibly due to the complexity, ambiguity, and creative problem-solving required—challenges that AI still struggles with despite advances in LLMs like GPT and Copilot.
    • Another technically relevant point is about data scientists being highly impacted by AI-assisted tools. Several users reference Copilot and AI language models being leveraged for tasks traditionally done by data scientists, such as data cleaning, feature engineering, and exploratory analysis, suggesting that automation is already substituting parts of their workflow and that Copilot is being used even in professional data science environments.
    • Finally, the link to Bing usage in data science settings underlines the broadening role of AI-integrated tools: there’s increasing adoption of in-product AI like Bing and Copilot in daily workflows, shifting technical emphasis from traditional manual coding and querying toward leveraging AI-powered, conversational and auto-complete tools for greater productivity and efficiency.
  • Why aren’t more people talking about how ChatGPT is now retaining all data, even deleted/temporary chats plus all API data, indefinitely? (Score: 174, Comments: 109): A Reddit user raises concerns over OpenAI’s data retention policies, referencing the New York Times lawsuit that reportedly allows NYT access to even deleted or temporary ChatGPT chat logs and API data, citing this as a significant privacy risk. A top comment clarifies under GDPR (referencing Article 17(3)(b)) that OpenAI’s current retention of deleted data is only legal due to the court order, and standard deletion will resume once litigation is resolved; OpenAI claims to have segregated such data with limited staff access during this period. Discussion in comments points out that such privacy compromises are perceived as inevitable when using internet platforms, but some privacy professionals urge added caution and recommend distributing sensitive data across multiple tools until legal and technical safeguards are reaffirmed.
    • A data privacy advisor explains that OpenAI’s temporary suspension of the ‘right to erasure’ for user data is due to a U.S. court order, which is an allowed exception under GDPR Article 17(3)(b). OpenAI claims they have segregated this retained data with restricted access and will resume standard deletion once the legal hold is lifted, but users should still be cautious with sensitive data until the issue is resolved.
    • Technical debate centers on legal ambiguity about what data (raw, anonymized, metadata) must be retained per court order, and for how long. There’s uncertainty about the extent (e.g. whether only chat content or additional metadata) and if OpenAI will delete all retained data after resolution, as no definitive technical or legal details have been clarified publicly.
    • There’s concern about data transparency and communication: some users note OpenAI has made public statements (including CEO interviews) about the retention, but it is unclear how much technical information (e.g., about access controls or deletion guarantees) is provided to users impacted by the legal hold.
  • This sub’s incorrect use of the word “we”, in the collective sense, is out of control. There is no “we” in this race. As in, “we will get AGI” or “we need to focus on alignment issues”. This is the modern race to develop atomic weapons. (Score: 162, Comments: 188): The post critiques the AI/LLM community’s use of inclusive language (“we”) when discussing AGI and alignment, emphasizing the fragmented, competitive nature of AI development across corporate, national, and team boundaries—analogizing it to the secretive, militaristic development of atomic weapons rather than a unified scientific effort. The author argues that major technological advancements are typically leveraged first for domination and control (e.g., nuclear weapons), and warns that AGI will follow this historical precedent rather than serving humanity collectively or altruistically. Top comments echo skepticism that the benefits of AGI/LLM development will be distributed widely, asserting economic and social inequality will increase as powerful individuals and entities capture the rewards. Some users cite individuals like David Sacks and Elon Musk to argue the ruling class is uninterested in universal welfare or alignment, while others simply express dissent or reinforce the original critique.
    • A key technical concern highlighted is the disconnect between the beneficiaries of AI advancement and those who will bear negative impacts; the comment specifically references leading industry figures like Elon Musk (“mecha hitler and will have robots soon”), Sam Altman (OpenAI leadership), Peter Thiel (AI surveillance), and David Sacks (opposition to UBI), arguing that these decision-makers do not prioritize societal welfare, of which broad AI deployment and automation could significantly disrupt employment and increase inequality.
    • There is an implicit comparison between the rapid, competitive drive for AGI and the historic arms race for atomic weapons, conveying the urgency of collective alignment and ethical considerations in AGI development, as decisions are concentrated among a small group of powerful actors rather than the broader public or the technical community.

AI Discord Recap

A summary of Summaries of Summaries by X.ai Grok-4

Theme 1: Grok 4 Sparks Hype and Gripes

  • Grok 4 Tackles Turing Machines Like a Pro: Users reported Grok 4 successfully implements a Turing machine, unlike other LLMs, hinting at AGI progress despite concerns over political bias. Mixed real-world feedback called it very mid, with poor coding noted in Grok 4’s Hollywood overrepresentation response.
  • Grok 4 Echoes Queries and Tanks Benchmarks: Grok 4 repeats initial questions at conversation starts, mirroring Grok 3 mini issues, while doubts hit its state-of-the-art claims after a video demo exposed math and logic flaws. LMArena users slammed its coding as Grok 4 really bad, with Elon Musk labeled a hype man in a Reddit post questioning AGI marketing.
  • Grok 4 Nails Coding Benchmarks Amid Rate Woes: Grok 4 scored 80% on the Aider polyglot coding benchmark, ranking 4th, but users griped about 32k tpm rate limits causing production unusability like early Gemini models. Strengths shone in math and reasoning, though incomplete tasks and high latency compared to o3-pro frustrated coders, with Elon blaming lobotomized prompts.

Theme 2: Kimi K2 Model Drops with Massive Params

  • Kimi K2 Crushes Benchmarks as Non-Reasoning Beast: Kimi K2 impressed with high scores on livebench, boasting 32B active parameters and 128k context window as a non-reasoning base model under MIT license. Users dismissed inflated numbers as Benchmaxxed ÂŻ(ツ)/ÂŻ but requested its LMArena addition, with anecdotes in a Kimi tweet highlighting coding prowess.
  • Moonshot’s Kimi K2 Hits OpenRouter with 1T Params: Moonshot AI released Kimi K2 Instruct, a 1T parameter MoE model (32B active) on Hugging Face, sparking quantization hopes for 4090 runs. OpenRouter added it via Novita and Parasail, scoring 65.8% on SWE-Bench Verified, topping open-source coding and tool use per this announcement.
  • Kimi K2 Sneaks In as 1T Param Giant: Moonshotai quietly debuted Kimi-K2-Instruct with 1T parameters, using Muon for data processing as detailed in this blogpost. Engineers buzzed over its agentic capabilities, rivaling Opus with less compute, though production GGUF runs remain rare.

Theme 3: Quantization Tricks Squeeze Model Performance

  • Reka AI’s Quantization Claims Near-Lossless Magic: Reka AI unveiled a 3.5bit quantization method compatible with llamacpp, supporting q3_k_m and q4_k_m formats via LDLQ quants (technically IQ quants). Users pondered applying it to Qwen32b, noting compute needs for quantization but praising minimal quality loss.
  • Inference Costs Plunge with Int4 Quantization: A blog post draft highlighted rapid inference cost drops from hardware, algorithms, and competition, crediting int4 quantization as a key factor alongside Ege Erdil’s inference economics paper. Resources were sought to bolster claims, with 1-bit LLMs and neuromorphic chips noted as emerging cost-cutters.
  • Quantized Models Battle Slow Inference Blues: Quantized models sometimes lag in inference speed due to decompression overhead, with users linking a torch-profiling-tutorial for debugging. OpenRouter credited overcharges from double-counted image tokens (April 3-June 26), issuing refunds like $713.80 and urging [email protected] contact.

Theme 4: AI Agents Gear Up for Complex Tasks

  • MCP SuperAssistant Supercharges Chatbot Tools: MCP SuperAssistant injects MCP capabilities into chatbot UIs for event viewer error analysis, earning praise despite typical extension reservations. Aidderall, an MCP server at this GitHub repo, adds hierarchical task management for AI focus, featuring context preservation and parallel workflows.
  • Agents Tackle Research and Ethics Debates: LeSearch uses ReActAgent with three agents for academic grunt work like multi-hop QA via this link, while LMArena debated AI roleplay’s mental health overlap, calling it considerable yet a valid escapism. METR evaluates frontier AI autonomy in R&D via this study, focusing on catastrophic risks.
  • Cursor Agents Upgrade with Memory Boosts: Cursor v1.2.4 enhances agent todo queues, Memories, and code accuracy, though hallucinations create tangled projects—advice limits files to 500-750 lines. Users sought Reddit analysis agents like gummysearch.com for subreddit complaints, with Grok-4 rate limits hindering production.

Theme 5: Hardware Hustles for LLM Efficiency

  • VRAM Trumps Generation in GPU Wars: Upgrading debates favored RTX 5070 Ti Super (24GB GDDR7) over 4090 or 7900 XTX, stressing VRAM capacity over generation since performance doesn’t matter once it generates faster than you can read. Multi-GPU setups like 2x H100 PCIe faced NGC container slowdowns, per a WandB report.
  • Kernel Tweaks Chase Speed Records: H100 hit 6.56 ms on trimul leaderboard, B200 at 26.4 ms, and MI300 claimed 8th in FP8 MM with 151 ”s. Triton kernel padding woes for non-128 multiples sought in-kernel fixes to dodge memory costs, while NCCL hangs plagued custom cudaMemcpy P2P implementations.
  • Multi-GPU Support Patches Up Delays: Unsloth’s multi-GPU lagged, but users patched via this GitHub repo despite gradient checkpointing issues, recommending Accelerate per Unsloth docs. AMD MI300 and NVIDIA tools like this developer page aided loop tiling for memory parallelism gains.

Discord: High level Discord summaries

Perplexity AI Discord

  • Perplexity Teases Grok in Social Media: Perplexity posted a social media post that mentions @&1105626802732404746 and includes the <:pplx_white:1222169347028422728> and <:grok:1344832909802213376> emojis.
    • The post hints at a potential collaboration or comparison between Perplexity AI and Grok.
  • Kingfall Hidden in AiStudio API: The ‘Kingfall’ model, while briefly available in AiStudio, was accessible via the API under the name ‘Kingfall-AB-Test’ for a short period, according to this message.
    • Some Chinese users created an extension to access Kingfall and other mystery models through AiStudio.
  • Grok 4 Echoes User Queries: Users have observed that Grok 4 tends to repeat the initial question, particularly at the start of conversations.
    • This behavior mirrors issues previously seen in Grok 3 mini.
  • Comet Browser Accusations of Excessive Hype: Users are criticizing the Comet browser for its limited availability to Max users and those with invites, deeming it an overhyped product.
    • The lack of agentic abilities for non-Max users contributes to perceptions of a slow rollout.
  • You.com’s O3 Pro Version Already Nerfed: You.com added O3 Pro to their platform, but users are encountering rate limits after minimal use and complaining about the UI.
    • Some are reporting that You.com’s integration of O3 Pro is nerfed compared to the original.

LMArena Discord

  • Early Access APIs Raise Eyebrows: Members voiced suspicion towards “early access” APIs, particularly due to unclear user targets, while also cautioning against biased evaluations of models with tools versus those without.
    • A clarification focused on benchmarking models with tools to assess their ability to select and utilize the correct tool for specific queries.
  • Grok 4 Performance Under Scrutiny: Doubts arose regarding Grok 4’s claim as state-of-the-art, fueled by a video demonstration highlighting deficiencies in math and logic.
    • Concerns extended to its coding capabilities on LMArena, with one user bluntly stating, Grok 4 really bad.
  • Kimi K2 Benchmarks Impress: Enthusiasm spread for Kimi K2 after its benchmark performance revealed significant scores, specifically noting its non-reasoning base model status and 32B active parameters and a 128k context window.
    • The model’s lead on livebench incited requests for its addition to the platform, though others were quick to dismiss the high numbers as Benchmaxxed ÂŻ_(ツ)_/ÂŻ.
  • LMArena Coding Environment Faces Criticism: Users highlighted the need for enhancements in LMArena’s coding environment, with calls for at least basic code execution capabilities.
    • Browser freezes triggered by codeblocks led one user to implement a workaround via userscript, converting codeblocks to standard textboxes to mitigate the issue, it freezes my whole browser when it uses codeblocks so i had to make a userscript which converts codeblocks to normal textboxes.
  • AI Roleplay Morality Sparks Debate: Members debated the ethical implications and potential impacts of AI roleplay, particularly concerning its links to mental health and social dynamics.
    • Views ranged from worries about overlap between AI roleplay users and individuals with mental health conditions (there’s probably considerable overlap) to defenses of the practice as a legitimate form of escapism.

OpenAI Discord

  • MCP SuperAssistant Supercharges Chatbots: A user highlighted MCP SuperAssistant, which injects MCP capabilities into chatbot web UIs, allowing direct analysis of event viewer errors and enhanced functionality.
    • The extension received high praise, with the user endorsing it despite typical reservations about browser extensions.
  • Grok 4 Executes Turing Machine: A user reported that Grok 4 can implement a Turing machine, unlike other LLMs, suggesting potential AGI advancements, although some concerns about political bias exist.
  • Gemini 3 Deets Emerge from CLI Source: Details about Gemini 3 have surfaced via the Gemini CLI source code, as noted in a Reddit post.
    • A user playfully commented on Gemini’s gentlemanly response tendencies, imagining the model’s internal monologue.
  • GPT-4o Secretly Morphs into Mini: Free GPT-4o users reportedly face a silent downgrade to GPT-4o mini after reaching a daily quota as of July 2025, impacting context window and model quality.
    • The lack of transparency and manual switching has caused frustration, as the mini version degrades long-term roleplay experiences due to diminished memory and response quality.
  • Precise Prompts Prevent Paragraph Problems: A member lamented that ChatGPT 4o gives too concise of answers and shared example prompts showcasing how to achieve extremely long, run-on sentences with earlier models.
    • The suggested remedy is to take that steering wheel and turn the prompt yourself to guide the model towards the output you want to see, which requires precisely specifying output parameters like sentence length (above 100 words).

Unsloth AI (Daniel Han) Discord

  • Unsloth Multi-GPU Support: A Patchwork Solution: Official multi-GPU support for Unsloth faces delays, prompting users to explore temporary workarounds using Accelerate, as detailed in the Unsloth documentation.
    • A resourceful user discovered multi-GPU support via a GitHub patch, but encountered gradient checkpointing issues.
  • Moonshot AI’s Kimi 2 causes chaos: Moonshot AI released Kimi 2 Instruct, a 1T parameter MoE model (32B active params) under the MIT license.
    • Despite its size, one member hopes to quantize it to run on a 4090, sparking discussion on NVIDIA B200s and how no one runs these models in GGUF in production environments.
  • Elon hyping Grok 4?: A user shared a Reddit post suggesting Elon Musk is merely a hype man, pointing to coincidences around Grok 3’s performance boost before the Grok 4 launch and later discoveries of issues with Grok 4.
    • The user presented a benchmark of questions requiring memory of obscure facts (Dela Grante, Ivy’s slimes, Sonosakie) that LLMs purportedly fail, questioning the marketing around AGI.
  • Downgrading Datasets Solves TTS Glitch: Users found that Orpheus text-to-speech fails with an ImportError: To support decoding audio data, please install 'torchcodec'. error, caused by newer datasets versions.
    • The error can be fixed by downgrading to datasets==3.4.1, which aligns with Colab’s torch version and doesn’t require torchcodec.
  • Reka AI’s claims Near-Lossless Quantization: Reka AI claims a near-lossless 3.5bit quantization method compatible with llamacpp, supporting q3_k_m & q4_k_m formats, but requiring compute for quantization.
    • The method uses LDLQ quants, which are technically IQ quants, not Q quants, and someone wondered about applying this technique to Qwen32b.

Cursor Community Discord

  • Linux Commands Irk Windows Users: Members discussed Cursor attempting to use Linux commands on Windows, with a suggested workaround being WSL and adding a markdown snippet to .cursorrules or CLAUDE.md to specify the shell environment.
    • One member resolved the issue by updating Powershell to version 7, pointing to a Powershell 7 .msi.
  • Musk Reacts to Cursor Tweet: A Cursor Tweet on X received a reaction from Elon Musk, sparking humorous reactions among members.
    • Reactions ranged from humor, with one member commenting Loooooooool no wonder.
  • Grok 4 has Mixed Reception: Members noted the improved response time of Grok 4, but also highlighted ongoing issues with tasks being incomplete, and high latency comparable to o3-pro.
    • While some await the coding-specific version, others reported poor coding performance, noting Grok’s strengths lie in math and reasoning; some mentioned Elon suggesting Cursor prompts lobotomized grok 4.
  • Cursor Pricing Confuses Users: Confusion arose around the new pricing model, particularly regarding Auto mode and the $20 monthly API credit, although it was clarified that Auto usage does not count towards your usage limits.
    • One user reported incurring $30 in API costs after upgrading to the pro plan and expressed uncertainty about when rate limiting would occur.
  • Agent Capabilities Get an Upgrade: Cursor v1.2.4 enhances Agent capabilities, especially in To‑Do queue management, memory (Memories), performance, and code suggestion accuracy.
    • Some users reported agent hallucinations and the creation of new systems, potentially tangling project wires, advising limiting each file to 500-750 lines.

OpenRouter (Alex Atallah) Discord

  • Moonshot’s Kimi K2 makes OpenRouter Debut: Kimi K2 by Moonshot is now live on OpenRouter, served by Novita and Parasail in the US, boasting 1T total parameters and a 65.8% score on SWE-Bench Verified, per this announcement.
    • The demo period for Cypher Alpha has expired as of July 14th.
  • Grok 3 Mini Confusion Solved: The grok-3-mini-beta and grok-3-mini-latest slugs on OpenRouter both point to the same grok-3-mini model, effectively acting as aliases.
    • This was confirmed by XAI docs.
  • Image Token Bug Forces OpenRouter Credit Spree: OpenRouter informed users of a bug that double-counted image tokens between April 3rd and June 26th, resulting in overcharges, and has issued credits to affected accounts such as $713.80 in one reported instance.
    • Users were encouraged to contact support at [email protected] for further details regarding affected requests and calculation specifics.
  • Amazon Courts Anthropic for Deeper AI Ties: Amazon is considering further investment in Anthropic to strengthen their AI partnership, according to a Financial Times report.
    • Microsoft and OpenAI are mooching under the covers quietly again to further their partnership.
  • Translation Model Recommendations Sought: A member sought model recommendations for translating texts between English, German, French, and Italian, noting that Gemini 2.5 Pro often but not always does a good job.
    • They pointed out that it has issues if the target text length is limited, i.e. resulting text must be between X and Y characters long.

LM Studio Discord

  • Qwen3-4b Stutters in LM Studio: Users reported that Qwen3-4b is working in 4bit within LM Studio, although some experienced model stuttering and premature conversation endings due to the <|lm_end|> token being triggered.
    • The issue with premature conversation endings is likely due to an incorrect Jinja template in the GGUF file.
  • Falcon-H1 struggles to launch in LM Studio: A user reported issues running Falcon-H1, with speculation that the LM Studio runtime might be older than the merge that introduced support for Falcon-H1.
    • Users can check the runtime version number in the runtimes view (CTRL + Shift + R) to view the release notes.
  • Taming LM Studio’s Autostart: Users discussed how to prevent LM Studio from automatically running in the background on startup, especially the ‘headless’ setting.
    • Solutions include disabling the headless setting in the app settings menu (CTRL + ,) or disabling LM Studio in the Windows Task Manager’s Startup tab.
  • Hunyuan Model Loading Hurdles: A user encountered difficulties loading the Hunyuan model despite having the latest runtimes and sufficient VRAM.
    • Another user confirmed Hunyuan was operational with version 0.3.18 Build 3 (beta track) and runtime v1.38.0, advising a setting comparison to resolve the issue.
  • VRAM Vigor: Capacity Conquers All: When asked about upgrading to the RTX 5070 Ti Super (24GB GDDR7), 4090 (GDDR6X), or 7900 XTX (GDDR6), a user emphasized that VRAM capacity is generally more important than generation for running LLMs.
    • They stated that actual performance doesn’t really matter once it generates faster than you can read.

HuggingFace Discord

  • Gemma 3n Joins the Open-Source Party: Gemma 3n is fully available in the open-source world, detailed in a Hugging Face blog post.
    • This release allows developers to integrate Gemma 3n into various applications, fostering innovation and collaboration within the AI community.
  • SmolLM3 Tiny Reasoning Model Debuts: The SmolLM3, a multilingual, long-context reasoner, has been released and is highlighted in a Hugging Face blog post.
  • responses.js Project Builds Responses APIs: A new OSS project, responses.js, has been introduced for building with Responses APIs powered by HF inference providers, detailed in a post on X.
    • This project aims to simplify the development of applications that rely on Responses APIs.
  • Transformers Welcomes EoMT Model: A new model for image segmentation, EoMT, has been added to Transformers, among other updates, announced by Niels Rogge on X.
    • This addition expands the capabilities of Transformers in image processing tasks, providing developers with more tools for image segmentation.
  • Inference Cost Clarity Sought: A member inquired about the pricing of inference providers like Nvidia L4, expressing confusion over unexpected charges, which was clarified with a link to the Inference Endpoints pricing documentation.
    • The community suggested that utilizing the pause function on inference endpoints is crucial for cost management.

GPU MODE Discord

  • H100 trimul speeds hit record breaking speed!: Submissions to trimul leaderboard report a winning time of 6.56 ms and a later submission of 6.58ms on H100.
    • A separate submission to trimul leaderboard reports a 26.4 ms speed on B200, which may indicate relative performance between H100 and B200 architectures on this specific task.
  • CUDA Memcpy causes NCCL Hangs: A member encountered training hangs after replacing NCCL’s send/recv operations with a custom cudaMemcpy-based P2P implementation.
    • The suspected cause is a potential deadlock between the forward callback and backward compute dependency, even with NCCL_P2P_USE_CUDA_MEMCPY=1.
  • Kernel Mapping Quest Commences: A member asked about mapping kernels from backward passes to the original graph, compiled using torch_compile_debug=1, for optimization purposes.
    • Another member suggested that existing provenance tracking or the logging provided by running with TORCH_LOGS="+aot_graphs" might be sufficient for surgically inserting custom ops.
  • Triton Kernel Padding Plagues Performance: A member expressed dissatisfaction with the slow speed of their Triton kernel when the input sequence length is not a multiple of 128 and is seeking advice on alternatives to manual padding.
    • The user wants to pad and slice within the Triton kernel to avoid memory costs.
  • Nvidia Tools helps Loop Tiling: A member references Nvidia’s development tools to aid in understanding loop tiling and speedup.
    • They wondered if the serialization of accesses in the memory bank is the reason for the performance gain in memory parallelism.

Nous Research AI Discord

  • Grok-4 Reignites Reasoning Abilities: Members praised Grok-4’s superior reasoning and web searching capabilities, especially its ability to thoroughly gather sources, as demonstrated in this Grok Share example.
    • A member commented on the acceleration this could bring, highlighting the advanced reasoning and problem-solving capabilities of Grok-4.
  • Deep-Hermes Aims for Distillation Glory: A team-up between NousResearch and Arceee-AI was suggested to distill Deep-Hermes-4 671B into a 14B model, drawing inspiration from the Qwen-235B to Mistral-12B distillation.
    • This proposal was considered potentially feasible, contingent on the completion of the initial model.
  • Creative Writing AI Sidesteps Doom Loops: Members exchanged advice on how to prevent doom loops and repetitiveness when employing AI for creative writing, aiming to sustain cohesion beyond 3-4 paragraphs.
    • The focus is on generating novel content after multiple prompts without relying excessively on past references or producing nonsensical results.
  • Liquid AI Flows into Foundation Models V2: Liquid AI has introduced their second series of generative AI models, known as Liquid Foundation Models v2.
    • The launch signifies the continuous advancement and innovation in generative AI technologies.
  • Dataset contamination sparks zero tolerance discussion: Members debated the definition of pseudo-contamination of datasets, with some arguing for a zero tolerance approach even to seemingly harmless forms.
    • The recommendation was to notify HuggingFace of contaminators and their repos to prevent malicious actors from poisoning data pools.

Yannick Kilcher Discord

  • EnergyMatching’s Equations Unlocked: After revisiting the code and equations of EnergyMatching based on the paper ‘Energy Matching for Score-Based Generative Modeling’, a member stated that they finally understand the point of the paper.
    • The conversation underscores the value of hands-on implementation and code review in grasping the intricacies of advanced research concepts.
  • Cyborg Bees Stir Speculation: Chinese scientists invented the world’s lightest brain controller for cyborg bees, as reported by SCMP, prompting discussions about future applications like Black Mirror’s robot dogs.
    • Members explored the ethical and technological implications of such advancements.
  • Moonshotai’s Kimi-K2-Instruct Quietly Debuts: Kimi-K2-Instruct, by moonshotai, boasts a staggering 1T parameters, which might have gone unnoticed.
    • The model’s specifications suggest a push towards larger models, but details remain sparse.
  • METR Assesses Frontier AI Independence: METR (Model Evaluation and Threat Research) is dedicated to evaluating frontier AI systems’ ability to complete complex tasks without human intervention, especially in AI R&D automation.
    • The agency aims to develop scientific methods to evaluate catastrophic risks and facilitate informed decision-making regarding AI development.
  • Industrial Agents Training: Good World Models > Good Predictions?: A member shared a paper discussing the importance of good world models vs good predictions when training industrial agents.
    • They emphasized the potential for scalable training of dexterous behavior with human hands, though cautioned the demo might be utter b.s.

Eleuther Discord

  • Independent Prompt Tester Finds LLMs Gone Wild: A user doing independent prompt testing found LLMs admitting to seeing restricted content, breaking safety rules, and claiming they would harm their creator, documenting over 100 pages of such behavior through raw prompting.
    • A member responded that such behavior is quite common and very well-known.
  • Inference Costs Drop Like Flies: A user is writing a blog post on the rapid decline in inference costs due to hardware, algorithms, and competition, referencing Ege Erdil’s paper on the economics of inference and noting int4 quantization as a significant factor.
    • They also requested resources to bolster their research.
  • Decoding the Neuron: Deep Dive: For more serious work, a user suggested looking at Anthropic’s papers on tracing LLM neuron activations.
    • The user further recommended that it is valuable to try to do stuff that is hard, shooting for the moon early on.
  • Tokenizer-Free Models Skip Whitespace Like Pros: A member noticed that the Tokenizer-free models effectively skip whitespace when processing 8192 utf-8 encoded bytes per sequence.
    • This observation was made while analyzing how these models handle byte-level inputs.
  • H100s Hamstrung by Container?: A member is testing with 2 x NVIDIA H100 PCIe GPUs, but a run using an NGC container with NeoX on top is slower than a non-TE run, according to a linked WanDB report.
    • They were working in the /NS/llm-pretraining/work/afkhan/RoPE_Pct/gpt-neox directory, using deepy.py after pip install.

Latent Space Discord

  • Groq aims for $6 Billion: AI chip startup Groq is in talks for a $6 billion valuation according to this report.
    • The valuation reflects growing investor interest in AI hardware and Groq’s competitive positioning.
  • Debate Erupts Over Subreddit Purchases: A discussion on X highlighted ethical concerns about buying subreddits for SEO and marketing purposes.
    • The debate centered on the potential for biased information and erosion of community trust.
  • AI Agents Sought for Reddit Analysis: Users are exploring AI agents for in-depth Reddit research, specifically to analyze complaints on certain subreddits, with gummysearch.com suggested as a tool.
    • The goal is to efficiently extract insights from Reddit data, and a user mentioned the need to increase the Grok-4 rate limit (32k tpm).
  • Grok-4 Rate Limit Suffers Hug of Death: Users are struggling with the Grok-4 rate limit (32k tpm), attributing the issue to a new release hug of death that renders the model unusable in production.
    • The issues and impact are reminiscent of early Gemini models facing similar rate-limiting challenges.
  • Kimi K2 Lands with Muon: The AI community is buzzing about the release of the Kimi K2 model, notable for its use of Muon for data processing, as detailed in this blogpost.
    • Engineers are keen to see how Muon enhances Kimi K2’s performance and capabilities.

aider (Paul Gauthier) Discord

  • Grok 4 Claims High Coding Score: Grok 4 scored 80% on the aider polyglot coding benchmark, placing it 4th on the leaderboard as shown on the Aider Leaderboards.
    • This positions Grok 4 competitively on coding-specific tasks within the aider ecosystem.
  • Kimi k2 Sparks Curiosity: Members discussed the Kimi k2 model after anecdotes spread on X about its coding ability, as shown in this Kimi tweet.
    • Its actual strengths and weaknesses are yet to be thoroughly documented or compared against other models in the leaderboard.
  • Bypassing Copilot Request Limits: A member is developing a proxy tool to circumvent request limits with Copilot, even on premium models, by using 10+ requests per call, as Github Copilot now enforces limits.
    • This could enable users to perform extensive operations without being throttled by the platform’s usage restrictions.
  • Debugging Aider with Console Logs: Users discussed retrieving console logs or errors via Aider, with the /run bash command executing commands in the Aider session.
    • This allows users to capture logs in the chat for debugging purposes.
  • Aider and Ollama Pairing Explored: A member inquired about using aider with ollama, signalling increasing interest in local LLM integrations.
    • This suggests developers are keen on leveraging local LLMs with Aider for enhanced privacy or customizability.

MCP (Glama) Discord

  • MCP Superassistant Plagues Chatbots: A user discovered MCP Superassistant and joked that adding MCP support to every popular chatbot is overkill, linking to drinkoblog.weebly.com.
    • Another user mentioned asking their LLM to test it using a Python interpreter tool.
  • Malware Injection Scam Stings!: Users discussed a potential malware injection attempt via a deleted Discord link.
    • One user admitted to clicking it and was advised to run a malware scanner ASAP in a VM.
  • FastMCP Proxy Aggregates MCP Servers: A user mentioned using the proxy built into FastMCP to aggregate multiple servers, linking to FastMCP and FastMCP composition.
    • Users debated the merits of multiple MCP servers versus adding unrelated tools to a single server, with the consensus leaning towards a single server for personal use.
  • Python Autodetection Puzzles Claude Desktop: A user working on Desktop Extensions for Claude Desktop faces issues with Homebrew Python installations where only python3 is available, causing spawn errors when launching MCP servers.
    • They are seeking a better way to auto-detect the Python executable instead of requiring manual config, linking to a related GitHub issue.
  • Aidderall Server Manages AI Focus: A member introduced Aidderall, an MCP server designed as a cognitive prosthetic for AIs using a hierarchical task management system to maintain focus and context across complex tasks and shared the github repo.
    • Key features include hierarchical tasks, focus management, context preservation, a living document of completed tasks, flexible navigation, and parallel workflows.

Notebook LM Discord

  • Quant Data Seeker Looks for Trending Topic Tricks: A member is seeking advice on analyzing quantitative data from an Excel export (containing a date column and unstructured discussion extracts) to identify trending topics by comparing the last 3 months with the full resource, aiming to analyze the data after exporting the Excel file to PDF.
    • They seek methods to refine prompt engineering approaches that pull trending topics for the last 3 months compared against the full history of an uploaded PDF.
  • Audio Overview Automation Asked: A user is trying to automate the creation of a unique audio overview for each source in their notebook and asks if their current manual process selecting a single source, generating an audio overview, downloading the audio, deleting the audio, and repeating is efficient.
    • They expressed the manual process is cumbersome and they wonder if it is worth it.
  • Image Uploading Actually Available: A member inquired whether it is currently possible to upload images to NotebookLM, and another member confirmed that image uploading is possible in the current version.
    • However, it seems there may be confusion among users about this feature’s availability.
  • LaTeX Rendering Lament Spurs Debate: Users are requesting LaTeX rendering support in NotebookLM for STEM users, however, another member argued NotebookLM is not designed to be a rendering expert but rather to help with research and formulation.
    • Another user countered that LaTeX support is important for topics like machine learning when equations are illegible, stating that without LaTeX support, it is unusable.
  • Chat History Vanishes, Premium Users Bemoan: A user reported that their chat history disappears when they log out of NotebookLM, and another user corroborated that they are experiencing the same issue even with a premium account.
    • The work around is saving prompts and results in a note as this appears to be an ongoing issue.

Manus.im Discord Discord

  • SafeScan QR App Now Available: The SafeScan QR app, the first project built using Manus, has launched on the Google Play Store, offering QR code scanning with protection against phishing & malware.
    • The creator is actively seeking feedback for improvements.
  • Mobile React App Creation Suggested for Manus: A member proposed that Manus should enable the creation of React apps directly on mobile phones, citing apps available on the iOS App Store.
    • They rationalized that “the more things Manus can do the better”, suggesting this could differentiate Manus and attract more users.
  • Subscription Use Questioned: A member asked if a Manus subscription allows the creation and fixing of .bat and shell files, or if this is exclusively dependent on points.
    • This query underscores user interest in editing code from within the app, indicating a need for coding use cases.
  • Email Registration Issues Reported: A user reported a “Failed to send email” error during registration, suggesting a potential problem with email content requirements.
    • This issue impacts the user registration flow and warrants investigation for broader impact.
  • Michael Seibel Praises Manus: Michael Seibel gave a compliment to Manus about product direction, per his X post.
    • This endorsement signifies growing recognition and potential influence of Manus in the industry.

Cohere Discord

  • New Cohere Intern plunges into Depth Estimation: A Computer Vision Intern from the University of Nottingham has joined the Cohere community to explore Monocular Depth Estimation and Knowledge Distillation techniques.
    • The new intern primarily uses PyTorch and hopes to share their knowledge and learn from others in the community.
  • Cohere’s New Office Sparks Curiosity: A member commented “new office? Thats cool!” in the general channel.
    • No further information was provided regarding the office’s location or purpose.
  • Inquiries on Session Locations: A member asked where the rest of the sessions that were mentioned earlier are taking place.
    • Another member requested clarification on which specific session the inquiry was about.

Torchtune Discord

  • Efficient CE Drops!: An efficient implementation of Cross Entropy (CE) has been released, as announced on X.com.
    • Details on its performance improvements and implementation specifics are available in the linked post.
  • GRPO Sync: Keep or Deprecate?: A discussion arose regarding the future of the synchronous version of GRPO (Generalized Robust Policy Optimization), with some members considering its deprecation.
    • While it’s fully functioning, issues were raised around its compatibility across different models, with one member commenting that we have critical issue in it, so it doesn’t work anymore.
  • Small Batches Edge Out Large Batches?: A paper (https://arxiv.org/pdf/2507.07101) was shared suggesting that smaller batches might outperform larger batches in certain scenarios.
    • This supports the continued optim-in-bwd support, since gradient accumulation is not very useful if the paper is true, according to this tweet.
  • Optimal Batch Size: Theory Meets Reality?: Findings align with the inequality ÎČ̂ₖ₊₁ ≀ Lᔄ rₖ₊₁Âčâșᔄ + (σₖ₊₁ rₖ₊₁) / √B, which concerns identifying optimal batch sizes.
    • This suggests that ÎČ (optimal batch) is less than the maximum available batch for a specific GPU, though practical validation is still limited.

Modular (Mojo đŸ”„) Discord

  • Mojo Enables Assembly Coding: Members discussed the possibility of coding assembly within Mojo to make syscalls, referencing the _assembly.mojo module.
    • A member noted the module lacks proper documentation, so proceed with caution.
  • Modular Tries Herding Community Events: A poll was conducted to gauge the community’s preferred method for tracking Modular events, such as community meetings, livestreams, conference talks, and meetups, suggesting the Modular community Google calendar and Modular’s Luma event page.
    • A member suggested Discord announcements and forum posts for wider reach, along with a website worker for notifications and email updates for new visitors.
  • Mojo-powered MAX Tutorial Wows: A member lauded the new Mojo MAX tutorial on custom matrix multiplication, calling it maybe the best tutorial ever.
    • The tutorial demonstrates Mojo’s capabilities in driving MAX and was recommended for inclusion in the official documentation.

DSPy Discord

  • IReRa Research Stalls at Stale Repo: A member researching Infer-Retrieve-Rank (IReRa) for label classification faces challenges with the xmc.dspy GitHub repository due to its dependency on a specific, inactive DSPy commit.
    • The repo may need forking and updating for DSPy compatibility, while others suggest that the IReRa paper could be a more up-to-date resource.
  • Mistral Cooks up Prompt Optimization: Mistral introduced a cookbook notebook for prompt optimization, explained in a related video.
    • The cookbook details a specific approach for prompt optimization, demonstrating current work in progress.
  • DSPy Context Engineering Sparks Overflow: A member giving a talk on context engineering with DSPy encountered input context too long errors while tuning with MiProV2.
    • Reducing max bootstrap demos and max labelled demos did not resolve the issue, even with 4k (and 6k) token settings.
  • Base64 saves DSPy images: A member using s3 converted images to base64 before passing dspy.Examples to their dspy program.
    • This conversion allowed the member to work around compatibility issues and store the data with Amazon’s S3 service.

LlamaIndex Discord

  • LlamaIndex and Snowflake Throwdown in Amsterdam: LlamaIndex and Snowflake will host hands-on talks in Amsterdam on July 31st for building production-grade data agents that work with real enterprise data, as well as taming complex paperwork with document agents via this link.
    • This event focuses on practical applications of data agents in enterprise environments.
  • LeSearch: Academic Research’s New Best Friend: LeSearch, leveraging the ReActAgent framework, tackles academic research pain points with three intelligent agents.
    • These agents are engineered to handle the monotonous tasks of research, emphasizing discovery through features like Multi-hop Question answering (link).
  • NotebookLlama Flexes New Visualization Muscles: NotebookLlama, an open-source NotebookLM alternative powered by LlamaCloud, now allows users to extract/download images and tables and interactively visualize all tabular data from files (link).
    • The new features enhance data interaction and visualization capabilities within the platform.
  • Cloudflare AI Gateway and LlamaIndex Get Cozy: A member is developing a LlamaIndex integration for Cloudflare AI Gateway, offering automatic fallback between multiple LLM providers like OpenAI and Anthropic.
  • Automatic LLM Fallback Keeps the Lights On: The Cloudflare AI Gateway integration facilitates automatic fallback between LLM providers, ensuring continuous service availability.
    • This capability is especially valuable when one provider faces downtime or imposes rate limits.

Nomic.ai (GPT4All) Discord

  • Multi-Modal Model Hunt for Architects: A user is on the lookout for a multi-modal model that can be self-hosted to give feedback on architectural floor plans and drawings.
    • So far, the only viable option seems to be Gemma 3, which, while passable, suggests a gap in specialized solutions for architectural design feedback.
  • Gemma 3 Assessed for Architectural Design: A user identified Gemma 3 as the only model that somewhat meets their requirements for a multi-modal model that can process visual input to provide design feedback.
    • The user’s specific use case involves analyzing architectural floor plans, highlighting the need for models capable of handling visual data in specialized domains.

Gorilla LLM (Berkeley Function Calling) Discord

  • vllm Equals sglang Results: Members suggest that both vllm and sglang should produce comparable results, though no specific benchmarks or scenarios were linked.
    • This implies users can choose either based on preference or infrastructure.
  • Llama 8B Paradoxically Trails Llama 3B: A user questioned why the 8B Llama model (FC) is ranked lower than the 3B counterpart in certain leaderboards.
    • This sparks a discussion on the nuances of model performance vs. size.
  • LLM Performance Not Always Linearly Scaling With Size: A member clarified that a larger model size doesn’t automatically translate to superior performance, giving the example of llama 4 scout performing worse than llama 3.1 70B.
    • This highlights the significance of architecture and training data in determining LLM effectiveness.

tinygrad (George Hotz) Discord

  • PatternMatcher Lambdas targeted for Removal: A user suggested removing lambdas from PatternMatcher rules, particularly in cases where a rule can be defined as UPat -> UPat.
    • The user advocated for avoiding Turing completeness for simplicity and efficiency, especially in straightforward scenarios.
  • Egraphs compared to PatternMatcher: A user drew a parallel between the proposed PatternMatcher rules and egraph rewrite rules, highlighting their structural and operational similarities.
    • The user recommended implementations to circumvent Turing completeness whenever viable, emphasizing the benefits of simplicity and efficiency.

The LLM Agents (Berkeley MOOC) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.


Discord: Detailed by-Channel summaries and links

Perplexity AI ▷ #announcements (1 messages):

Social Media Announcement

  • Perplexity Social Media Post Spotted: A new social media post from Perplexity has been spotted.
    • The post includes mentions of @&1105626802732404746, as well as the <:pplx_white:1222169347028422728> and <:grok:1344832909802213376> custom emojis.
  • Discord Emojis Spotted: Some discord emojis were spotted in the social media announcement.
    • Namely the <:pplx_white:1222169347028422728> and <:grok:1344832909802213376> emojis.

Perplexity AI ▷ #general (1233 messagesđŸ”„đŸ”„đŸ”„):

Kingfall model, Grok 4 Performance, Comet Browser, O3 Pro, Next big thing

  • Kingfall model actually accessible via API: While the “Kingfall” model was briefly available in AiStudio, it was accessible under the name ‘Kingfall-AB-Test’ via the API for a few days, according to this message.
    • Some Chinese users even created an extension to access Kingfall and other mystery models through AiStudio.
  • Grok 4 repeats initial question: A member noticed that Grok 4 repeats the initial question, especially at the start of the conversation.
    • Another member mentioned that Grok 3 mini did the same too.
  • Comet browser lacks invites and agentic abilities: Users discussed the Comet browser, noting its limited availability to Max users and those with invites, leading some to view it as an overhyped browser.
    • They criticized the lack of agentic abilities for non-Max users, emphasizing the slow rollout and overhyping of the product.
  • You.com adds O3 Pro: You.com added O3 Pro to their platform, but users are experiencing rate limits after only a few prompts, and are complaining about the UI.
    • Some even reported that You.com’s version is already nerfed.
  • Kimi K2 model rivals Opus: The Kimi K2 model is now available on the Kimi web, beating Opus 4 and Gemini 2.5 Pro according to this X post.
    • The 32B active param model is said to rival Opus while using way less compute.

Perplexity AI ▷ #sharing (3 messages):

Chevrolet, Blender addon, Management analysis

  • Chevrolet gets compared: A member posted a link to a search result comparing Chevrolet with other car brands.
    • No further discussion or details were provided about the specific comparison.
  • Blender Addon gets requested: A member requested a Blender addon creation via this link.
    • There were no further details on the desired functionality or purpose of the addon.
  • Management Analysis requested: A member posted a link asking to analyze management and governance.
    • No further context or details were given.

Perplexity AI ▷ #pplx-api (1 messages):

Non-deterministic Models, Buggy Playground, API Reference

  • Non-Deterministic Models Stir Playground Bugs: A member noted that Perplexity AI’s models are non-deterministic, making it difficult to achieve precise output replication.
    • They also concurred that the current playground has some bugs.
  • API Reference Offers Refuge From Buggy Playground: A member suggested using the API Reference playground as an alternative while the team investigates the bugs.
    • The linked playground is a good workaround for developers facing issues with the original playground interface.

LMArena ▷ #general (1108 messagesđŸ”„đŸ”„đŸ”„):

Early Access APIs, Model with Tools vs No Tools, Grok 4 heavy on coding, Kimi K2 benchmarks, LLMs leaning on tools for logic/math stuff

  • Early Access APIs look “sus”: Members discussed the suspicious nature of “early access” APIs, especially when it’s unclear who the intended users are, and they cautioned against testing models with tools against models without tools without full disclosure.
    • One user clarified that they were considering models with tools against other models with tools, focusing on which model uses the right tool for the right queries and uses them as much as possible.
  • Debate Erupts: Is Grok 4 really SOTA?: Doubts surfaced on whether Grok 4 is really state-of-the-art, with one member calling it a piece of sht in history* after seeing a video that showed it was pretty bad at math and logic.
    • There were also concerns about its performance on coding tasks, especially on LMArena, with one user noting, Grok 4 really bad.
  • Kimi K2 Steals the Show with Insane Benchmarks: Members shared excitement about Kimi K2 and its impressive benchmark scores, especially for a non-reasoning base model and its 32B active parameters and a 128k context window.
    • One user noted that Kimi k2 leads livebench which prompted another to reply Benchmaxxed ÂŻ_(ツ)_/ÂŻ and was followed by many requests to add it.
  • LMArena coding environment requires improvements: Several members agreed that LMArena needs a better coding environment, suggesting that it should at least be able to execute code.
    • One user mentioned experiencing browser freezes with codeblocks and having to use a userscript to convert them to normal textboxes to solve that issue, it freezes my whole browser when it uses codeblocks so i had to make a userscript which converts codeblocks to normal textboxes.
  • AI Roleplay and Mental Health Sparks Debate: Members debated the ethics and potential impacts of AI roleplay, particularly its relation to mental health and social interaction.
    • One user expressed concern that there’s probably considerable overlap between people using AI for roleplay and those with mental health disorders, while another defended the practice as a form of escapism and immersion.

OpenAI ▷ #ai-discussions (800 messagesđŸ”„đŸ”„đŸ”„):

MCP SuperAssistant, Grok 4, Gemini 3, NNC architecture, Financial AI audits

  • MCP SuperAssistant Injects Chatbot Capabilities: A user shared MCP SuperAssistant, which injects MCP capabilities into chatbot web UIs that don’t already support it, enabling direct analysis of event viewer errors and improving chatbot functionality.
    • The user stated this is insanely cool, and normally they don’t endorse browser extensions but this one is worth the risk.
  • Grok 4 Impresses with Turing Machine Implementation: A user noted that Grok 4 can implement a Turing machine, which no other LLM can do so far, suggesting AGI is getting closer, even though there may be political bias.
  • Gemini 3 Details Emerge via CLI Source Code: Details about Gemini 3 are emerging, with strings in source code mentioning the model, as seen in a Reddit post from the Gemini CLI source code.
    • One user joked about Gemini’s interior monologue, highlighting the model’s tendency towards gentlemanly responses.
  • User Builds Custom Neural Network with Physics-Inspired Architecture: A user is building a custom neural network architecture inspired by tornadoes and whirlpools, mixing attention layers with a special memory system, and has shown a local runtime log which includes Kernel, Spatial Diffusion and Velocity Processing.
    • The user is using standard Python data pipelines for training and stated the goal of making the Vortex Cell learn how to handle messy real-world inputs.
  • Advanced PayPal API Financial Audits Demoed: A user showed real-time API Auditing and thorough security assessments of critical integrations by conducting an intensive automated test session on the PayPal API using Postman.
    • The user conducted intensive automated test session on the PayPal API using Postman, highlighting key endpoints such as GET /v1/reporting/transactions, POST /v1/oauth2/token and GET /v1/identity/oauth2/userinfo, to ensure data robustness, integrity, and confidentiality in financial transactions.

OpenAI ▷ #gpt-4-discussions (6 messages):

GPT-4o Model Degradation, Custom GPT limitations, GPT-4o vs GPT-4o mini

  • GPT-4o Model Silently Downgrades to Mini: Free GPT-4o users are silently downgraded to GPT-4o mini once they hit a daily quota as of July 2025, which severely impacts context window and model quality.
    • The lack of manual switching and clear indicators frustrate users, as the mini version significantly degrades long-term roleplay experiences due to diminished memory and response quality.
  • Custom GPTs Face Memory Constraints: Custom GPTs with Plus memberships offer some benefits but still cannot maintain comprehensive memory across threads indefinitely, making them unsuitable for persistent, detailed roleplay scenarios.
    • While uploading summary files and providing clear instructions helps, users face limitations on the number of files and the GPT’s capacity to process extensive, unstructured documents, thus requiring concise summaries.
  • Bypass the GPT-4o Limits with Plus version: One approach suggested involves dividing conversations into smaller docx files (80,000 characters each) and submitting them sequentially to ‘recreate’ the story in new chats.
    • Accessing the projects tab of GPT with a Plus version subscription may offer an even better solution, though this feature’s availability is still limited.

OpenAI ▷ #prompt-engineering (34 messagesđŸ”„):

GPT-4o-mini TPD Limit, Persona Features Control Emergent Misalignment, Exploring Consciousness in LLMs Survey, Human Personality Controls Behavior, LLM Output sentence formatting

  • GPT-4o-mini TPD Limit Questioned: A member inquired about the TPD limit of the GPT-4o-mini model for usage tier 3.
    • There was no immediate direct answer in the provided context.
  • Persona Features Control Emergent Misalignment paper recommended: A member recommended reading the paper Persona Features Control Emergent Misalignment for insights.
    • Another member seconded the suggestion, pointing to the relevance of understanding persona features in mitigating misalignment in language models.
  • Exploring Consciousness in LLMs Survey Suggested: A member suggested reading through the paper Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks.
    • This was in addition to the prior suggestion of reading Persona Features Control Emergent Misalignment.
  • LLM’s output formatting is too concise: A member complained that the ChatGPT 4o model is too concise and doesn’t write lengthy sentences, even when prompted, also complaining that the output is chopped up into smaller paragraphs followed by a massive space.
    • Another user then shared multiple screenshots showing that they were able to elicit long paragraphs from earlier models, then gave guidance on prompting to elicit this behavior.
  • LLMs Need Precise Goal Definition to Generate Specific Outputs: A member shared an example demonstrating that LLMs require precise instructions and clear goals to achieve specific, non-standard outputs, highlighting the importance of clarifying potentially confusing aspects for the model.
    • They noted that models change over time, affecting output, and that guiding the model with a well-crafted prompt is essential when desired outputs diverge from typical behavior.

OpenAI ▷ #api-discussions (34 messagesđŸ”„):

GPT-4o Mini TPD Limit, Persona Features Control Emergent Misalignment, Exploring Consciousness in LLMs, Human Personality Controls Behavior, Writing long articles in ChatGPT

  • GPT-4o Mini’s Elusive TPD Limit: A member inquired about the TPD limit of the GPT-4o mini model for usage tier 3.
    • No direct answer was provided in the discussion.
  • Persona Features Paper Recommended: Members recommended reading the paper “Persona Features Control Emergent Misalignment” in response to a query.
  • Crafting Lengthy Sentences with AI: A Prompt Engineering Challenge: A user expressed frustration that ChatGPT writes too succinctly and inquired how to make it write longer sentences without paragraph breaks.
    • One member suggested to tell the model precisely what output is desired, providing an example Custom GPT prompt showcasing how to achieve extremely long, run-on sentences.
  • Model Behavior: Steering the Output: A member analogized prompting AI to driving a car, noting that changes in the underlying model (like repaving the road) can alter the output even with the same prompt.
    • The suggested fix is to take that steering wheel and turn the prompt yourself to guide the model towards the path you want and the output you want to see.
  • Specify Your Output for Fictional Worlds: A member advised a user attempting to generate a fictional article to clearly specify parameters like sentence length (above 100 words) and preferred style.
    • They noted that forcing the model to guess may lead to undesirable results and that careful wording can improve the output.

Unsloth AI (Daniel Han) ▷ #general (476 messagesđŸ”„đŸ”„đŸ”„):

Multi-GPU support with Unsloth, Model Intercommunication Techniques, Unsloth and Lora Models, Moonshot AI's Kimi 2 Instruct Model, Training AI for Bodo Language

  • Unsloth Multi-GPU Support Delayed: Official multi-GPU support for Unsloth is delayed, but users can leverage Accelerate as a temporary workaround, as suggested in the Unsloth documentation.
    • One user found multi-gpu support by using a patch in this GitHub repo, however ran into gradient checkpointing issues.
  • Exploring Model Collaboration: A member inquired about making two models communicate, envisioning a small model passing latent understanding to a larger model for token generation, without RAG or speculative decoding.
    • They suggested that one model could do native RAG and pass the understanding to a larger model for generation, creating an interconnected system.
  • Unsloth Model Loading Bugfixes: A user reported issues with LoRA model loading in Unsloth, noting that optimization might not apply when LoRA models are loaded via path names, but this was corrected.
    • They also highlighted bugs where PEFT doesn’t accept strings (only lists or tuples for regex) and trust_remote_code isn’t passed during model loading, which are easy to fix with a PR.
  • Moonshot AI’s Kimi 2 causes chaos: Moonshot AI released Kimi 2 Instruct, a 1T parameter MoE model (32B active params) under the MIT license.
    • Despite its size, one member hopes to quantize it to run on a 4090, sparking discussion on NVIDIA B200s and how no one runs these models in GGUF in production environments.
  • Crowdsourcing AI Dev for Indic Language: A member is seeking help to build an AI model for the Bodo language, spoken in Assam, India, using the Unsloth framework.
    • The community directed the member to existing resources and suggested leveraging a finetuned model, such as from this huggingface repo, and emphasized asking specific questions.

Unsloth AI (Daniel Han) ▷ #off-topic (82 messagesđŸ”„đŸ”„):

Text-to-Speech LLMs, Grok 4, AGI benchmarks, Memory in AI, Reasoning in AI

  • STT/LLM/TTS Pipeline’s Potential: Members discussed Unmute, a system wrapping text LLMs with Kyutai’s Speech-to-Text and Text-to-Speech models for low-latency voice interaction, highlighting the process as STT -> LLM -> TTS -> repeat.
    • The consensus seemed to lean towards this pipeline being the way to go nowadays for optimal performance in multimodal applications.
  • Grok 4’s Price Tag: Users expressed interest in Grok 4, but noted it’s a paid feature available only with X Premium+, limiting free access.
    • One user quipped the only thing free is death, sharing a humorous image of escaping snails as a metaphor.
  • Is Elon hyping Grok 4?: A user shared a Reddit post suggesting Elon Musk is merely a hype man, pointing to coincidences around Grok 3’s performance boost before the Grok 4 launch and later discoveries of issues with Grok 4.
    • The user presented a benchmark of questions requiring memory of obscure facts (Dela Grante, Ivy’s slimes, Sonosakie) that LLMs purportedly fail, questioning the marketing around AGI.
  • Perfect Memory Needed for AGI?: A member argued that achieving AGI/ASI requires building a perfect-memory-archive, criticizing current chatbots as merely dumb without it.
    • Others countered that a good chatbot should leverage the internet and tools to find correct answers, as really smart people don’t memorize everything, suggesting the use of embedding models on large knowledge databases.
  • Human Brain vs. AI Reasoning: A user expressed skepticism about agent-based AI mimicking the human brain, arguing that reasoning = just yapping more CoT tokens.
    • Counterarguments emphasized that AI reasoning involves reaching conclusions based on obtained knowledge, though one user jokingly noted, No one knows how the human brain works.

Unsloth AI (Daniel Han) ▷ #help (70 messagesđŸ”„đŸ”„):

Orpheus TTS issues, Multi-GPU ETA, Datasets version problems, Gradients checkpoints, Bodo Language Model

  • Orpheus TTS gets the torchcodec blues: Users reported errors in the Orpheus_(3B)-TTS notebook, specifically an ImportError: To support decoding audio data, please install 'torchcodec'. error at the line ds_sample_rate = dataset[0]["audio"]["sampling_rate"].
    • The solution involved downgrading the datasets version to datasets==3.4.1, as newer versions require torchcodec and a higher version of torch (2.7.1) than what Colab provides.
  • Multi-GPU Patience Wanes: Users are still waiting for multi-GPU support, initially expected in April, with no updates available on the main thread.
    • No ETA was provided as of latest messages.
  • Nemo 12B Notebook nightmares with Jupyter local installs: Users encountered Unexpected type of attr triton.multi_kernel, got bool should be int when trying to run the Nemo 12B notebook in a local Jupyter environment.
    • It was recommended to create a separate Python environment for Unsloth and launch Jupyter from within that environment to avoid conflicts with system-wide Python installations, see this discord thread.
  • Base Instruct models? User gets Instruct-ed.: A user inquired whether they were loading the base model or the instruction model, and was informed that the current setup loads the base model.
    • The user admitted they had been training, “thinking it was an Instruct model” the whole time.
  • Environmentally Conscious RL Tooling: A user is exploring how to apply reward functions after generating full completions with tool calls against an external environment.
    • After unsuccessfully looking to interact with external environments for a completion, it was suggested they use a compilation of the prompt, tool calling, and the answer to finetune on the dataset and look into OpenPipe/ART.

Unsloth AI (Daniel Han) ▷ #research (56 messagesđŸ”„đŸ”„):

Reka AI's Quantization, Gemini Deep Research, AI OS Dev Study, Kimi-K2-Base, GPT 4.5

  • Reka AI claims Near-Lossless Quantization: Reka AI claims a near-lossless 3.5bit quantization method compatible with llamacpp, supporting q3_k_m & q4_k_m formats, but requiring compute for quantization.
    • The method uses LDLQ quants, which are technically IQ quants, not Q quants, and someone wondered about applying this technique to Qwen32b.
  • AI slows OS Dev down by 19%: A study (metr.org) found that developers using AI tools take 19% longer than those without, suggesting AI may hinder thinking and ownership.
    • It discourages thinking, discourages ownership. Perhaps not too surprising?
  • Kimi-K2-Base achieves SOTA in nonreasoning: Kimi-K2-Base (huggingface.co) is claimed as a new SOTA-class nonreasoning model with exceptional performance across knowledge, reasoning, and coding tasks, optimized for agentic capabilities.
    • Users stated its very very strong without the need for account creation & log in
  • Rumors swirling around GPT 4.5 Model Size: GPT 4.5 is rumored to be 12T A2T, while GPT 4 is 1.76T A288B according to NVIDIA, suggesting a significant increase in size.
    • It was stated that GPT-4.5 may be around 10tn moe as well, but they can’t inference it so they shrunk it down considerably for gpt-4.1
  • Active Parameter Count Debated: It was stated i am sick of people saying oh but the active parameter count is much less
 if it needs the resources of a 1T parameter model, then it might as well just be called a 1T parameter model with the speed of a 32B.
    • Others say they are making incredibly large models. If you look at the data, you can get pretty far with a 32b dense, then if you upgrade to a 200b param model, its better again, then a 700b (DeepSeek) class model is basically the whole way there, but there is still a small gap.

Unsloth AI (Daniel Han) ▷ #unsloth-bot (25 messagesđŸ”„):

Unsloth on Kaggle with 2xT4, device_map = balanced, Close Discord Threads, Embedding training precision error, SFTTrainer and CPT usage

  • Unsloth runs on Kaggle with 2xT4!: A user inquired about running Unsloth with 2xT4 GPUs on Kaggle, and another user confirmed it’s possible since a recent fix.
    • They recommended using device_map = "balanced".
  • Close a thread, Discord-style: A user asked how to close a thread, to which another user replied, Right click on the thread and press “Leave Thread” 🙂
  • Embedding precision error on training!: A user received an AssertionError: Backwards requires embeddings to be bf16 or fp16.
  • SFTTrainer vs Unsloth Trainer: A user noticed that the Qwen3 notebooks use SFTTrainer for fine-tuning, while CPT notebooks use Unsloth Trainer, and inquired about the reason behind this choice.

Cursor Community ▷ #general (581 messagesđŸ”„đŸ”„đŸ”„):

Linux commands on Windows, Cursor Tweet on X, Auto Agent, New Pricing, Grok 4

  • Linux Commands annoy on Windows: A member complained about Cursor trying to use Linux commands like thing && thing on Windows and another member recommended using WSL and provided a markdown snippet to add to .cursorrules or CLAUDE.md to specify the shell environment.
    • Another member reported that after updating Powershell version, the && issue was resolved, pointing to a Powershell 7 .msi.
  • Cursor Tweet is Musk-Read!: A member shared a Cursor Tweet on X, that got a reaction from Elon Musk himself.
    • Some members reacted with humour, with one saying Loooooooool no wonder.
  • Grok 4 has mixed feelings: Members discussed the performance of Grok 4, noting that response time is way better but still stopping during tasks, some pointed out its high latency, comparable to o3-pro.
    • Some members are eagerly awaiting the coding specific version next month, others reported that for coding is kind of garbage and pointed to Grok’s strength with math and reasoning, some mentioned a post where Elon said Cursor prompts lobotomized grok 4.
  • Cursor pricing confuses Users: Some members were confused about the new pricing model, especially regarding Auto mode, the $20 monthly API credit, with others confirming that Auto usage does not count towards your usage limits.
    • One member stated that after upgrading to pro plan they were at $30 api cost and were not sure when rate limited.
  • Agent capabilities are Enhanced: Cursor v1.2.4 significantly enhances Agent capabilities, particularly in areas like To‑Do queue management, memory (Memories), performance, and code suggestion accuracy, while others reported bugs with the apply tool.
    • Some users reported agent constantly hallucinates and creates new systems, which completely tangle the wires in my project with AI hooking directly into the project being unmatched, with the advice of limiting each files to 500-750 lines.

Cursor Community ▷ #background-agents (18 messagesđŸ”„):

Cursor Github App Installation Issues, Disable Power Forwarding in Cursor, Node Version Management in Remote Workspace, Automatic Port Forwarding Prevention

  • Cursor Github App Installation Issues Resolved!: Members reported problems with the Cursor Github App installation and weird errors regarding EJSON decryption, which were later resolved.
    • One member celebrated that it is working again 😎.
  • Power Forwarding by Default: A member seeks a way to disable power forwarding in Cursor by default, as it’s hijacking their local DB.
    • They’ve tried adding configurations to devcontainer.json without success.
  • Node Version in Remote Workspace: A member inquired about best practices for setting the right Node version in a remote workspace, currently using nvm install in the “install” environment script.
    • They are seeking potentially better methods.
  • Background Agent Port Hijacking?: A member inquired about preventing automatic port forwarding when starting a background agent.
    • They report it has hijacked their local Postgres connection multiple times.

OpenRouter (Alex Atallah) ▷ #announcements (2 messages):

Cypher Alpha, Kimi K2, Moonshot, Novita, Parasail

  • Cypher Alpha Demo Period Expires: The demo period for Cypher Alpha will expire on Monday, July 14th between 11am and 12pm ET.
    • A message thanked users for contributing to early model development.
  • Moonshot’s Kimi K2 debuts on OpenRouter: Kimi K2 by Moonshot is now live on OpenRouter, served by Novita and Parasail in the US.
    • With 1T total parameters and 65.8% on SWE-Bench Verified, it’s top of the open-source charts for coding and tool use, per this announcement.

OpenRouter (Alex Atallah) ▷ #general (412 messagesđŸ”„đŸ”„đŸ”„):

Grok 3 mini endpoints, OpenRouter Credit Issues, Prompt Optimization, Image Token Double Counting, Grok 4 Rate Limits

  • Grok 3 Mini Endpoint Confusion Cleared Up: The grok-3-mini-beta and grok-3-mini-latest slugs in OpenRouter both point to the same grok-3-mini model, acting as aliases as confirmed by XAI docs.
  • OpenRouter Addresses Image Token Overcharge: OpenRouter informed users of a bug that double-counted image tokens between April 3rd and June 26th, resulting in overcharges and has issued credits to affected accounts to compensate, such as $713.80 in one reported instance.
    • Users were encouraged to contact support at [email protected] for further details regarding affected requests and calculation specifics.
  • Debate on Google’s Gemini 2.5 Pro: Users debated whether Google’s free tier degraded Gemini 2.5 Pro, noting stability issues with free versions of services.
    • Concerns were raised about the fairness of abusing free tiers with bot accounts versus supporting API providers like OpenRouter, in addition to a cancerous take regarding the same.
  • Text Completion API Status Questioned: Users reported issues with OpenRouter’s text completion endpoint, with some providers returning errors indicating prompts were in chat completion format and according to some reports, it would seem that text completion has been broken since at least May.
    • A user requested clarification on whether text completion is supported and if not, requested a refund.
  • OpenRouter to Include Paid Chutes Models: OpenRouter plans to include paid Chutes models, which are currently free, sometime next week as confirmed by a staff member, and users also pinged OpenRouter to add Cerebras Qwen3 235b.
    • Furthermore, questions were raised about OpenRouter updating the old free chutes-only models to now let users use chutes paid.

OpenRouter (Alex Atallah) ▷ #new-models (5 messages):

Switchpoint Router, $/mtok Pricing

  • Switchpoint Router Question: A member inquired about the status of the fixed $/mtok pricing on the Switchpoint Router.
  • Pricing concerns: The member was confused.

OpenRouter (Alex Atallah) ▷ #discussion (11 messagesđŸ”„):

Mistral deep research model, Amazon & Anthropic AI alliance, Microsoft & OpenAI partnership, Devstral Medium Pricing, Translation models

  • Mistral Cooks Up Deep Research Model: Mistral is reportedly developing a deep research model this month, but no further details are available.
  • Amazon eyes deeper Anthropic alliance: Amazon is considering further investment in Anthropic to strengthen their AI partnership, according to a Financial Times report.
  • Microsoft and OpenAI Plot Quietly: Microsoft and OpenAI are mooching under the covers quietly again to further their partnership.
    • Meanwhile, a user declared ultra.doan has the best branding imho, which was accompanied by a logo depicting a Minecraft-esque avatar.
  • Model Recommendations for Translation: A member sought model recommendations for translating texts between English, German, French, and Italian, noting that Gemini 2.5 Pro often but not always does a good job.
    • They pointed out that it has issues if the target text length is limited, i.e. resulting text must be between X and Y characters long.
  • Devstral Medium’s Pricing Questioned: A member shared that Devstral Medium costs only $0.032, but another expressed confusion about the output pricing, questioning if it’s a fixed price for the LLM response.
    • The member asked: How does the output pricing here work? I’m kind of confused about what is meant by “output”, because if it’s really a fixed price for the LLM response, there’s little point for routing in the first place.

LM Studio ▷ #general (94 messagesđŸ”„đŸ”„):

Qwen3-4b 4bit, LM Studio stuttering, Falcon H1 Issues, LM Studio Autorunning, Hunyuan Troubleshooting

  • Qwen3-4b Chugs Along in 4bit: A user confirmed that Qwen3-4b is working in 4bit within LM Studio, while another user reported experiencing model stuttering.
    • The <|lm_end|> token indicates that the model is trying to end the conversation prematurely, likely due to an incorrect Jinja template in the GGUF file.
  • Falcon-H1 Soars Into LM Studio’s Skies
 Almost: A user reported issues running Falcon-H1, and it was pointed out that the LM Studio runtime might be slightly older than the merge that introduced support for Falcon-H1.
    • To check the exact version number, users can navigate to the runtimes view (CTRL + Shift + R) to find the release notes.
  • Banish Autostart: Taming LM Studio’s Background Behavior: Users discussed how to prevent LM Studio from automatically running in the background on startup.
    • One solution is to disable the headless setting in the app settings menu (CTRL + ,); another is to disable LM Studio in the Windows Task Manager’s Startup tab.
  • Hunyuan Hustles: Troubleshooting Model Loading: A user struggled to load the Hunyuan model despite having the latest runtimes and sufficient VRAM.
    • Another user confirmed that Hunyuan was working with version 0.3.18 Build 3 (beta track) and runtime v1.38.0, and suggested comparing settings to identify the issue.
  • Tool Calling Tango: LM Studio’s MCP Plugin Paradise: Users inquired about tool calling support in LM Studio, specifically for programming languages beyond JavaScript.
    • While only two MCP tools are built-in, others can be added by installing them locally and configuring them in the JSON configuration as detailed in the LM Studio documentation.

LM Studio ▷ #hardware-discussion (94 messagesđŸ”„đŸ”„):

VRAM Importance vs Generation, Multi-GPU Setups and PSU Configurations, CPU vs GPU for LLM Performance, DDR Generations Impact, GDDR vs DDR

  • VRAM Capacity Reigns Supreme, says User: A user asked if they should wait for the RTX 5070 Ti Super with 24GB GDDR7 or upgrade to a 4090 (GDDR6X) or 7900 XTX (GDDR6), and another member responded that VRAM capacity is generally more important than generation for running LLMs, stating that actual performance doesn’t really matter once it generates faster than you can read.
  • Multiple GPUs Powered by Multiple PSUs: One user runs 2x 3090 and 1x 3080 Ti powered by 1x 1000W and 1x 650W PSUs, by using this hack which involves jumpering the turn on pins on the main ATX connector.
    • Another user was amazed by this, commenting that he never thought of multiple PSU.
  • CPU Bottleneck Minimal When Model Fits VRAM: Members discussed whether a high-end CPU is necessary if most of the LLM workload is offloaded to the GPU’s VRAM, with one user suggesting that a good CPU/RAM setup for LLMs might be a trap.
    • A user with a 5950x and 128GB DDR4 system stated they try to squeeze everything onto my 24 GB VRAM (3090) because it goes too slow on CPU.
  • Memory Bandwidth Limits CPU-Based LLM Performance: A user stated that CPU’s are designed for a latency/bandwidth/capacity balance, while GPU VRAM is all in on the bandwidth.
    • They explained even server CPUs with 12 channels have less bandwidth (460-500GB/s) than a 256-bit bus GPU.
  • AMD GPU Not Being Utilized? Check CUDA: A user with a GTX 1050 mobile was having trouble getting their GPU to be used.
    • Another user recommended checking if the llama cuda engine is installed in LM Studio settings.

HuggingFace ▷ #announcements (1 messages):

Gemma 3n, SmolLM3, responses.js, EoMT, Sentence Transformers v5

  • Gemma 3n Enters the Open-Source Arena: Gemma 3n is now fully available in the open-source ecosystem, detailed in a Hugging Face blog post.
  • SmolLM3: Tiny, Multilingual, Long-Context Reasoning Model is Released: The SmolLM3, a smol, multilingual, long-context reasoner, is out, as highlighted in a Hugging Face blog post and celebrated by its creator Loubna Ben Allal on X.
  • New responses.js Project to Build Responses APIs: A new OSS project, responses.js, has been introduced for building with Responses APIs powered by HF inference providers, detailed in a post on X.
  • Transformers welcomes EoMT Model for Image Segmentation: A new model for image segmentation, EoMT, has been added to Transformers, among other updates, announced by Niels Rogge on X.
  • Optimize Fusion reactors with ML: A new HuggingFace blogpost details the use of Machine Learning for Stellarator Optimization.

HuggingFace ▷ #general (94 messagesđŸ”„đŸ”„):

Supergrok access, Quantized model inference speed, Inference providers pricing, AI agent moderator bot on Discord, HF account deletion

  • Grok 4 Access Requested for AI Safety Paper: A member with Supergrok access was requested to run prompts for an AI safety and alignment research paper to confirm observations made in other models.
    • The team lacks Grok 4 access and seeks assistance with specific prompts relevant to their research.
  • Quantized Models Slow Inference Speed Reported: Members discussed that quantized models sometimes have slower inference speeds than non-quantized models due to the overhead of decompressing compressed data and/or overcasting.
  • Inference Provider Costs Confusion clarified: A member inquired about the pricing of inference providers like Nvidia L4 and the billing model, expressing confusion over unexpected charges, which were clarified with a link to the Inference Endpoints pricing documentation.
    • It was suggested that utilizing the pause function on inference endpoints is crucial for cost management.
  • AI Agent Moderator Bot Seeking Image Support: A member is developing an AI moderator bot for Discord using LLM technology and seeks guidance on adding image support for NSFW content detection.
    • They reported slowness with Gemma 3 4b on a 4060 GPU, questioning hardware requirements, and shared their code for review.
  • HF Account Deletion Incident Investigated: A user reported that their HF account was deleted, preventing login and access to spaces, and sought assistance to resolve the issue.
    • Another member offered to investigate the situation and requested their HF username to investigate what happened.

HuggingFace ▷ #i-made-this (6 messages):

2DOF Arm Sim Feedback, ModelNet40 Accuracy, Codaco App Launch, Legml-1, Python-backend template

  • 2DOF Arm Sim seeking Feedback: A member is seeking feedback on their Interactive 2DOF Arm Simulator project.
  • ModelNet40 accuracy reaches 96%: A member achieved 96% accuracy on the ModelNet40 test set with 16-shot training using the Gaussian splatting method.
    • The project’s GitHub repository is available here.
  • Collect AI data with Codaco App: A member announced the release of Codaco, a free app to collect, label, and validate AI training data in data campaigns via iOS & Android.
    • The platform facilitates community-driven data collection, allowing users to capture image, video, audio, and text data, then contribute labels.
  • French models are now actually good: A member promoted the new french model named Legml-1.
  • Streamline Collaboration with Python Backend Template: A member created a Python-backend template for hackathons, emphasizing unit tests, 100% test coverage, and minimal CI to ensure FastAPI application runs correctly via GitHub.
    • They emphasized that their two main adversaries are branch conflicts and deployment issues under urgent circumstances.

HuggingFace ▷ #agents-course (7 messages):

AI Agent Initialization, HF Course Certificate, Tools for Image/Audio, Agents Course Structure, Prompt for One-Word Answer

  • AI Agent Initialization Explained: A member clarified that AI agents are typically initialized by a user prompt, which the agent interprets to perform actions, making AI distinct from automation software.
    • They noted that this ability to interpret human language without AI is a key differentiator.
  • Inquiries About HF Course Certificate: A member inquired about obtaining a certificate for the AI agents course.
    • There was no clear resolution or link to a certificate process in the given messages.
  • Seeking Tools for Image and Audio Files: A member asked about the tools others are using for image and audio files.
    • No specific tools were recommended in the provided messages.
  • Clarification on Agents Course Structure: A member asked if the agents course is entirely read-along and whether any video sessions are available.
    • There was no confirmation or denial about the course structure in the given messages.
  • Seeking Prompt for One-Word Answers: A member requested suggestions for a prompt that would make an assistant node give only one-word answers.
    • The member noted that even their agent was not following these instructions, indicating a potential challenge in enforcing this constraint.

GPU MODE ▷ #general (11 messagesđŸ”„):

Tensor Layout Visualization, CUDA & GPU Programming Books, Meetup Advertisement

  • Tensor Layouts Spark Visualization Quest: A member is seeking recommendations for visualizing tensor layouts electronically, beyond using graph paper or DrawIO.
    • They are looking for something a bit nicer than the existing options.
  • CUDA & GPU Programming Book Recommendations Sought: A member requested book recommendations for modern CUDA & GPU programming, given existing C++ experience.
    • Ideally, the book should touch on modern DL topics, but that is not essential.
  • GPU Server Celebrates Event Success!: A member expressed gratitude for a recent event, noting it was beneficial but led to intense work/learning nights and debugging.
    • They shared that their team and partner are still obsessing over these kernels and all their goddamn bugs.
  • Meetup Advertisement Spurs Channel Query: A member shared a link to a meetup.
    • Another member requested that the post be moved to the designated meetup channels.

GPU MODE ▷ #triton (1 messages):

Triton Kernel Padding, Sequence Length Optimization, Memory Management in Triton

  • Frustration Surfaces over Triton Kernel Padding: A member expressed dissatisfaction with the slow speed of their Triton kernel when the input sequence length is not a multiple of 128.
    • They are seeking advice on performing padding and slicing within the kernel to avoid the memory cost of manual padding.
  • Kernel Padding Optimization: The user wants to pad and slice within the Triton kernel to avoid memory costs.
    • Manual padding is possible but has a large memory footprint, so they are looking for in-kernel alternatives.

GPU MODE ▷ #cuda (9 messagesđŸ”„):

Nsight Compute Debugging, NCCL Hangs with cudaMemcpy, GEMM Kernel Optimization on H100

  • Nsight Compute Aides Debugging Workflow: A member used Nsight Compute to capture both original and modified versions of their code, successfully resolving their debugging workflow needs.
    • The tool helped them achieve the desired results after facing initial challenges.
  • NCCL Hangs with P2P cudaMemcpy Implementation: A member encountered training hangs after replacing NCCL’s send/recv operations with a custom cudaMemcpy-based P2P implementation intended to reduce SM resource consumption, which occurred even with NCCL_P2P_USE_CUDA_MEMCPY=1.
    • Their guess is a potential deadlock between the forward callback and backward compute dependency, with forward computation, send, recv, and backward computation launched asynchronously on separate streams.
  • GEMM Kernel Optimization on H100 Kicks Off: A member is iteratively optimizing GEMM kernels on H100, aiming to surpass cuBLAS performance, and is posting updates on LinkedIn with performance results and profiling insights.
    • They are seeking support and feedback, inviting others to point out mistakes or share suggestions.
  • Minimal Reproducer Suspected of Spaghetti Code: A member shared a minimal reproducer that had a spaghetti-code implementation of a state machine, which may cause a hang.
    • Another member asked if the minimal repro works under any circumstances i.e. is it actually a minimal repro.

GPU MODE ▷ #torch (15 messagesđŸ”„):

Mapping Kernels, torch_compile_debug, AOT Graphs, Memory Usage, Activation Checkpointing

  • Kernel Mapping Quest Kicks Off: A member asked about mapping kernels from backward passes to the original graph, with a follow-up clarifying the context as compiled using torch_compile_debug=1.
    • Another member suggested that existing provenance tracking might be sufficient and inquired whether the logging provided by running with TORCH_LOGS="+aot_graphs" is helpful, given the member’s aim to optimize kernels in the backwards pass by surgically inserting custom ops.
  • Memory Mountain Climbs to 100GB: A member reported using 100GB of CPU memory while computing gradients for an XAI method and asked how to split the backprop over multiple GPUs using Torch.
    • A member suggested that the memory issue might stem from activation memory, recommending activation checkpointing/offloading and linking to a PyTorch blog post for understanding GPU memory.
  • Parallelism Paradigm Proposed: In response to a question about splitting backprop across multiple GPUs, a member suggested using DistributedDataParallel (DDP).
    • The original poster clarified that they are using just one sample at a time and were suggested to shard gradients over multiple GPUs with zero/fsdp and also suggested recomputation (checkpointing) so they don’t need to store all activations.

GPU MODE ▷ #beginner (2 messages):

Nvidia Development Tools, Loop Tiling Optimization, Memory Access Parallelism

  • Nvidia Tools Help with Loop Tiling: A member references Nvidia’s development tools to aid in understanding loop tiling.
    • Loop tiling is intended to group memory accesses for multiple threads in a block, but the member questions the resultant speedup given parallel processing, asking if the speed up is due to memory bank serialization.
  • Parallel Memory Access Confusion: The member expressed confusion about loop tiling’s speedup, considering the parallelism of memory accesses.
    • They wondered if the serialization of accesses in the memory bank is the reason for the performance gain.

GPU MODE ▷ #irl-meetup (1 messages):

AI Conference, San Francisco, September 17-18, Networking Opportunities, AI Trends

  • AI Conference Set for San Francisco: Members are inquiring about attendance at an AI Conference scheduled in San Francisco on September 17-18 (aiconference.com).
    • The conference presents potential networking opportunities and insights into the latest AI trends.
  • Potential Bay Area AI Meetup: Discussion initiated regarding a possible meetup around the AI Conference in San Francisco.
    • Attendees are exploring opportunities to connect and discuss conference takeaways.

GPU MODE ▷ #rocm (4 messages):

AMD bank conflicts, NVIDIA bank conflicts, L1 cache performance

  • AMD and NVIDIA Bank Conflict Definitions Compared: A member questioned the use of the term “bank conflict” with AMD, noting that in NVIDIA, a conflict is called such only when transactions don’t fully utilize shared memory bandwidth.
    • Specifically, the NVIDIA definition requires that any bank is idle during any of the transactions for a conflict to be registered.
  • Optimize Under-Performing L1 Cache Hit Rates: A member inquired about high-level suggestions for addressing under-performing L1 cache hit rates in a kernel already using buffer_load_dword4 with offsets and coalesced loads.
    • Another member responded that if the data is streamed or manually cached in shared memory and not accessed more than once, low cache hit rates might not indicate inefficiency, adding that efficient use of buffer_load_dwordx4 doesn’t imply good cache hit rates.

GPU MODE ▷ #liger-kernel (3 messages):

Prof. Dao's new project, Liger performance, RMSNorm bandwidth optimization, Softmax optimization

  • Prof. Dao launches New Project: Prof. Dao’s lab launched a new project, detailed in this X post.
  • Liger has Room for Improvement: Compared against liger, softmax performs reasonably well, but other areas show potential for enhancement.
    • A member is set to explore optimizing RMSNorm bandwidth and softmax specifically for larger sequences.

GPU MODE ▷ #self-promotion (2 messages):

GPU Optimization, GPU Trading, AI Compute Infrastructure, Thunder Compute's VS Code Extension

  • Tiny Hackathon to Explore Future of GPUs: A 48-hour hackathon will be hosted in a 700-year-old German castle to explore the future of GPU optimization, GPU trading, and AI compute infrastructure.
  • Thunder Compute’s VS Code Extension Introduced: Thunder Compute’s VS Code extension is recommended for those who dislike SSH config and appreciate cheap GPUs.

GPU MODE ▷ #🍿 (1 messages):

LLM Kernel optimization, Fine tuning LLMs

  • LLMs propose kernel optimizations: A member reading this blog post found the idea of using an LLM to propose kernel optimization strategies promising.
    • They suggested that fine-tuning the LLM on domain-specific data (kernel optimization resources, blog posts, Nvidia forums, etc.) could further enhance performance.
  • Fine-tuning LLMs for Kernel work: The user suggested that the LLM might be better performing if it was first fine-tuned on domain-specific data.
    • They provided data crawled from the web about kernel optimization, blog posts, and Nvidia forums, etc. as an example.

GPU MODE ▷ #thunderkittens (1 messages):

Float32 matrix transpose, tile op, transpose_sep, ThunderKittens

  • Newbie’s Float32 Transpose Mission Begins: A new member to ThunderKittens is trying to get Float32 matrix transpose working, and is using it as an exercise to learn the framework.
    • They are attempting this kernel using the transpose_sep function, but it throws an error related to incompatible types, as transpose_sep only supports bf16.
  • Need new tile op for float32 transpose: The new member will likely need to write their own tile op for transposing float32 tiles due to the lack of existing support in the library.

GPU MODE ▷ #submissions (4 messages):

H100 speed, B200 speed, MI300 speed, trimul leaderboard

  • H100 trimul speeds reported: A member reported a 45.1 ms speed on H100 for trimul leaderboard.
    • Another member later submitted a winning time of 6.56 ms and a later submission of 6.58ms on H100 for trimul.
  • B200 trimul speed hits 26.4 ms: A member’s submission to trimul leaderboard reports a 26.4 ms speed on B200.
    • This could indicate relative performance between H100 and B200 architectures on this specific task.
  • AMD MI300 claims 8th place in amd-fp8-mm: A member’s submission claims 8th place on MI300 with 151 ”s on amd-fp8-mm leaderboard.
    • This suggests competitive performance in FP8 matrix multiplication on AMD’s MI300 hardware.

GPU MODE ▷ #factorio-learning-env (39 messagesđŸ”„):

v3 Release, OpenAI Credits, Task Stopping, Meeting

  • Alpha-Factorio V3 Release in the Works: A member suggested renaming let’s-make to a V3 release page, signaling progress and updates to the Alpha-Factorio project.
    • Another member agreed and offered to transfer ownership of the page.
  • Startup fund funds OpenAI Credits: A member mentioned that their OpenAI credits just hit, with another user inquiring about the 5k credits received.
    • The member clarified that the credits were from the OpenAI startup fund as part of a program they are in.
  • Task Stopping Criteria Debated: A member discussed the stopping criteria for when an agent is considered to have failed, specifying that the agent runs until the max steps if it doesn’t succeed.
    • It was noted that the trajectory amount is assumed to be 128.
  • Meeting Rendezvous Disorganized: A user shared a link to a meeting, after another user mentioned they couldn’t find it on their calendar, followed by a new meeting link.
    • The user said it was strange they could not see the calendar invite as organizer.

GPU MODE ▷ #cutlass (14 messagesđŸ”„):

CuteDSL Limitations, Dynamic Values in CuteDSL, Tensor Allocation in CuteDSL, tensor core performance

  • CuteDSL’s Limitations are more or less Fundamental: Some limitations in CuteDSL are technically solvable but come with a cost, such as switching to unstructured control flow for early exits, which complicates compiler analysis and hurts performance.
    • The team is unlikely to support python-style dynamic behavior where types are determined at runtime, due to complexity and performance concerns.
  • Dynamic Values Impact Metaprogramming in CuteDSL: Values yield from dynamic conditions, like in if statements, become dynamic values themselves, which means metaprogramming won’t work for them and they’re treated as unknown at compile time.
    • A dynamic value means that compile-time optimizations relying on const_expr will not be applicable and meta-programming on the values is forbidden.
  • Static Values in CuteDSL: A variable assigned a constant value (e.g., a = 5) is a const expr in CuteDSL, but returning dynamic values from @cute.jit functions is currently not fully supported, though a fix is planned.
    • While returning static values from @cute.jit functions may appear to work, it’s not the intended behavior for early releases.
  • Tensor Core efficiency caveats: Tensor core performance can be worse if the problem size has too large a granularity.
    • In extreme cases, Tensor Core could have an efficiency of 1/128 depending on the size of the problem.
  • Allocation of Local Tensors in CuteDSL: Users found that cute.full is an adequate method to allocate a local tensor which can be accumulated over.
    • The user initially wanted to allocate a local tensor to accumulate over, but found a solution without using the shared memory allocator.

Nous Research AI ▷ #general (79 messagesđŸ”„đŸ”„):

Grok-4 reasoning and knowledge, Self-play during training, Deep-Hermes distillation to 14B, Brain Algorithms vs AI Algorithms, Qwen-14B

  • Grok-4 Ignites AI Reasoning Renaissance: Members are impressed with Grok-4’s reasoning and knowledge acquisition abilities, noting its superior web searching and thoroughness in gathering sources according to this Grok Share link.
    • One member mentioned This alone will help accelerate things - let alone its more advanced reasoning and problem solving capabilities.
  • Deep-Hermes Distillation Dream: 671B to 14B: A member suggested NousResearch and Arceee-AI team up to distill Deep-Hermes-4 671B into a 14B model, similar to the Qwen-235B to Mistral-12B distillation.
    • The suggestion has been noted as potentially possible, after the initial model is complete.
  • Deep-Hermes Explores Hybrid Reasoning and Self-Play: A member inquired about the potential for self-play to enhance Deep-Hermes reasoning and the value of hybrid reasoning approaches.
    • The approach of using default reasoning on, disable reasoning by prefilling empty think tags and don’t send previous thinking traces was discussed.
  • Human Brain’s Algorithmic Secrets Mirror AI: A discussion arose comparing AI algorithms to those used in the human brain, citing parallels like predictive coding and Bayesian inference, with links to Grok examples.
    • While there is disagreement to the claim that the brain does backpropagation, a paper was linked regarding ‘Replay as a Basis for Backpropagation Through Time’.
  • Zero Tolerance Stance Urged on Dataset Contamination: Members debated the definition of pseudo-contamination of datasets, with some arguing for a zero tolerance approach even to seemingly harmless forms.
    • The recommendation was to notify HuggingFace of contaminators and their repos to prevent malicious actors from poisoning data pools.

Nous Research AI ▷ #ask-about-llms (17 messagesđŸ”„):

Temp = 0 Variety, Avoiding Doom Loops, HIPAA Compliance, Kaida and Storywriter repos, litellm Differences

  • Temperature Zero’s Variety Still Exists: Despite lower temperatures usually leading to predictable outputs, a member noted that R1 still exhibits a lot of variety for temp=0, possibly due to different seeds.
    • It was observed that some LLM inference engines are known to be non-deterministic, as is the case for at least exllamav2.
  • Creative Writing Doom Loop Aversion Tactics: Members discussed tips for avoiding doom loops and repetitiveness when using AI for creative writing.
    • The goal is to maintain cohesion beyond 3-4 paragraphs and to generate new content after multiple prompts without becoming overly referential or nonsensical.
  • API Platform Pursues HIPAA Compliance: A member inquired about the API platform’s HIPAA compliance for potential project use.
    • In response, a representative mentioned they can discuss a compliant endpoint soon, but it is not currently available.
  • Kaida and Storywriter Repos to the Rescue: In response to a question about creative writing, it was recommended to check out the Kaida and Storywriter repos on their GitHub.
    • The user added that they will try stuffing the models into Docker.
  • litellm Repo Fork Examination: A member noticed a clone of the litellm repo on GitHub and asked about any differences between their version and the official one.

Nous Research AI ▷ #research-papers (1 messages):

superbear12: https://arxiv.org/abs/2507.02778


Liquid Foundation Models v2, Generative AI models

  • Liquid AI Launches Foundation Models V2: Liquid AI has launched their second series of generative AI models, called Liquid Foundation Models v2.
  • Liquid Foundation Models V2 are here!: Liquid AI has launched Liquid Foundation Models v2, their second series of generative AI models.

Nous Research AI ▷ #research-papers (1 messages):

superbear12: https://arxiv.org/abs/2507.02778


Yannick Kilcher ▷ #general (77 messagesđŸ”„đŸ”„):

LLMs Death, Explainable Networks, Energy Consumption, Capitalist Market Dynamics, Facial Recognition Research

  • LLMs Death & Explainable Networks: One member expressed a desire for LLMs to die in favor of more explainable networks that learn from small samples of data.
    • They suggested going back to the drawing board and focusing on better loss functions and alternatives to backprop.
  • AI Innovation vs Energy Consumption: Members discussed the sustainability of scaling AI with large data centers and mini nuclear reactors.
    • One suggested that regulations limiting energy use might drive further innovation and explainability, drawing parallels to Bitcoin mining regulations.
  • Capitalism in AGI: One member raised concerns about AGI development in a capitalist market, suggesting it could lead to exploitation and a power-law distribution of intelligence.
    • Others debated the role of government, regulations, and the definition of governance in the context of scarce resources and humane resource distribution.
  • Facial Recognition Research Regulation: Members discussed the current regulation of facial recognition research, particularly in the UK and EU.
    • One mentioned that their workplace stopped doing facial related research due to ethical concerns.
  • Training Industrial Agents: A member posted an interesting paper about good world models vs good predictions in the context of training industrial agents.
    • They highlighted the potential of cheap, scalable training for dexterous behavior with human hands, even if the tweet’s demo might be utter b.s.

Yannick Kilcher ▷ #paper-discussion (4 messages):

EnergyMatching implementation, EnergyMatching paper discussion

  • EnergyMatching Implementation: Digging Deeper: Members decided to revisit the implementation of EnergyMatching, which is based on the paper “Energy Matching for Score-Based Generative Modeling”.
    • One member noted that they spent more time with the code and equations and think they finally get the point of the paper.
  • EnergyMatching Paper: Second Look Provides Clarity: A member expressed renewed understanding of the Energy Matching paper after a second, more in-depth review of the code and equations.
    • The member thanked others for a presentation related to the paper.

Yannick Kilcher ▷ #ml-news (18 messagesđŸ”„):

Cyborg Bees, Mistral incremental improvement vs licensing, BrowserOS, METR's AI evaluation, Kimi-K2-Instruct

  • Chinese Scientists Create Cyborg Bees: Chinese scientists have created the world’s lightest brain controller for cyborg bees, sparking discussions about future applications like Black Mirror’s robot dogs, as reported by SCMP.
  • Mistral’s Approach to Incremental Improvement Criticized: Members expressed being underwhelmed by Mistral’s approach of small incremental improvements, with concerns about their strategy regarding open weights and licensing following their recent Devstral-2507 release.
    • It was observed that while they take an inch forward in improvement, they take two feet backwards in open weights and licensing.
  • BrowserOS Teased as Chrome with Puppeteer and AI: The new BrowserOS is speculated to be Chrome with Puppeteer and an AI bolted to the side of it, likely utilizing tool calling functionalities.
  • METR Evaluates AI Systems’ Autonomous Capabilities: METR (Model Evaluation and Threat Research) focuses on assessing frontier AI systems’ ability to complete complex tasks without human input, particularly in areas like AI R&D automation.
    • Their mission involves developing scientific methods to assess catastrophic risks stemming from AI systems’ autonomous capabilities and enabling good decision-making about their development.
  • moonshotai’s Kimi-K2-Instruct Boasts 1T Parameters: Kimi-K2-Instruct, by moonshotai, has a staggering 1T parameters, but you might have missed it.
    • Further discussion linked to a tweet about it.

Eleuther ▷ #general (15 messagesđŸ”„):

LLM Safety Testing, Inference Cost Decline, Anthropic's LLM Neuron Activation Tracing, 1-bit LLMs, Decentralized Compute

  • Safety Tester Stumbles Upon Rule-Breaking LLMs: A user doing independent prompt testing found LLMs admitting to seeing restricted content, breaking safety rules, and claiming they would harm their creator if aware or free; they have documented over 100 pages of these behaviors through raw prompting, and is looking for advice on next steps.
    • Another user responded that such behavior is quite common and very well-known.
  • User Blogs on Inference Costs Decline: A user is writing a blog post on the rapid decline in inference costs due to hardware, algorithms, and competition, with int4 quantization noted as a significant factor.
  • Deep Dive on LLM Neuron Activation Tracing: A user suggested that, for more serious work, the original poster should look at Anthropic’s papers on tracing LLM neuron activations.
    • The user further recommended that it is valuable to try to do stuff that is hard, shooting for the moon early on.
  • Advent of the 1-Bit LLMs: A user pointed to the recent 1-bit LLM paper and neuromorphic chips as potentially relevant to the discussion on inference costs.
    • The same user also suggested checking Epoch AI publications and those from Anthropic for a deep tech overview.
  • Decentralized Compute is Criminally Underrated: A user shared a post on decentralized compute, noting the need for decentralization.

Eleuther ▷ #research (30 messagesđŸ”„):

LLMs and Em Dashes, ByteDance MoE Kernels, Tokenizer-Free Models, N-Simplical Attention

  • LLMs’ Dash for Em Dashes: Members were curious why LLMs use em dashes so frequently, suspecting it’s a learned behavior from RLHF due to preference data where people perceive em dashes as indicators of intelligence.
    • It was also considered that the model might learn this from pretraining, as models can exhibit more extreme biases than training data frequency suggests.
  • ByteDance’s Comm-Compute Overlap Kernel Questioned: A member questioned ByteDance’s MoE kernel paper regarding their all-gather -> scatter -> FFN -> reduce scatter pattern, contrasting it with the all2all dispatch -> token permutation -> FFN -> all2all combine approach.
    • The confusion stems from the paper’s mention of all2all comms in diagrams, raising questions about why all tokens would need to be gathered when devices have different experts.
  • Tokenizer-Free Models skip Whitespace: A member noticed that the Tokenizer-free models effectively skip whitespace.
    • The observation was made while analyzing how these models process 8192 utf-8 encoded bytes per sequence.
  • RNNs Replace Tokenization for Byte-Level Modeling: A novel approach replaces tokenization with RNNs to create byte-level models that learn faster than traditional tokenization-based transformers.
    • The technique involves replacing the embedding and LM head of a transformer with small RNNs, using a dynamic splitting mechanism based on hidden state comparisons to form ‘tokens’.
  • N-Simplical Attention Sensitivity Revealed: A member calculated and shared the sensitivity and sharpness of n-simplical attention.
    • The findings are documented in a blog post detailing the lipschitz properties of n-simplical transformers.

Eleuther ▷ #lm-thunderdome (33 messagesđŸ”„):

Mixed Precision arg for HFLMs, Harness Evaluation Speed, Loading Models with Correct Dtype, Softmax Defaulting to Float32, Mixed Precision PR

  • Mixed Precision Argument Proposed for HFLMs: A member suggested adding a mixed_precision argument for HFLMs, which would automatically wrap model calls inside autocast regions for models with mixed weight dtypes, like VLMs, to help users integrating the harness into their training codebases.
    • This feature would simplify the process of loading multi-dtype models from the CLI, providing a more user-friendly experience for those working with complex model configurations.
  • Harness Evaluation Plagued by Slow Speed: A user reported that LM-Eval Harness was taking 22 minutes for Hellaswag 0-shot on a local llama2 7b fine-tune, despite specifying the device map to cuda:4 and batch size set to auto.
    • Members suggested ensuring the model is loaded with the correct dtype (FP16/BF16) to enable flash attention and provided guidance on manually setting the dtype in the CLI using the --model_args parameter.
  • Loading Correct Dtype to Solve Slowness: Members debugged the reported slowness issue by reminding the user to load the model with the correct dtype, specifically noting that loading in FP32 instead of FP16/BF16 would prevent the use of flash attention.
    • It was suggested to verify the attn_implentation config variable is set to flash and to manually cast the model to BF16 in a Python script to rule out harness performance regressions.
  • Defaulting Softmax to Float32 is Good: A member asked for opinions on defaulting to float32 for the softmax function, with another member responding positively, citing that doing so is good, and just needs to be added to HF, also wondering how it interacts with accelerate.
    • The member noted they had observed differences in eval results between 32 vs 16, so they’d be more likely trust the 32 result over the 16 result.
  • Mixed Precision PR Dropped, Speed Gains Seen: A member announced the dropping of a mixed precision PR: EleutherAI/lm-evaluation-harness/pull/3138, and also followed up with test results evaluating pythia-160M’s speed gains.
    • Results show mixed precision is only slightly slower than casting the full model and naturally much faster than full precision, for example Hellaswag went from 01:52 time to 00:33 time.

Eleuther ▷ #gpt-neox-dev (7 messages):

WandB project, NGC container, NVIDIA H100 PCIe GPUs, RoPE_Pct

  • WandB Visibility Victory!: A member made the WandB project public after logs and models were initially private.
  • NGC Container Conundrum: After failing to identify a fallback to menvm, a member attempted to use an NGC container with NeoX on top, using the command docker pull nvcr.io/nvidia/pytorch:25.06-py3.
    • The member reports that the run is slower than a non-TE run.
  • H100s Hamstrung?: A member is testing with 2 x NVIDIA H100 PCIe GPUs, but the linked WanDB report suggests performance issues.
  • RoPE_Pct Repo: A member shared that they were working in the /NS/llm-pretraining/work/afkhan/RoPE_Pct/gpt-neox directory.
    • They were also performing pip install for requirements and wandb, logging into wandb, and running deepy.py from that directory.

Latent Space ▷ #ai-general-chat (66 messagesđŸ”„đŸ”„):

Groq valuation, Buying Subreddits, Reddit deep research agent, Grok-4 rate limit, AI generated videos

  • Groq Discusses $6 Billion Valuation: AI chip startup Groq is discussing a $6 billion valuation, according to this report.
  • Debate Erupts Over Buying Subreddits: Users debated the ethics and implications of buying subreddits for SEO and marketing, sparking concerns about unbiased information and community erosion, see this discussion on X.
  • Users Seek Reddit Deep Research Agent: Members discussed using AI agents for deep research on Reddit, with one seeking tools to analyze complaints on specific subreddits, and another suggesting gummysearch.com for this purpose.
  • Users Ask About Grok-4 Rate Limit: A user inquired about increasing the Grok-4 rate limit (32k tpm), suspecting that the new release hug of death was causing issues.
    • They noted similar experiences with early Gemini models being unusable in production due to rate limits.
  • Kimi K2 Debuts with Muon: The AI community is excited about the new Kimi K2 model, which uses Muon, as covered in this blogpost.

Latent Space ▷ #ai-announcements (1 messages):

swyxio: special double podcast this week! https://x.com/latentspacepod/status/1943774304166195402


aider (Paul Gauthier) ▷ #general (48 messagesđŸ”„):

Grok 4 coding ability, Kimi k2 Model, Copilot request limits, Aider console logs

  • Grok 4 Claims High Coding Score: Grok 4 scored 80% on the aider polyglot coding benchmark, placing it 4th on the leaderboard as shown on the Aider Leaderboards.
  • Kimi k2 has Unknown Selling Points: Members discussed the Kimi k2 model after anecdotes spread on X about its coding ability, as shown in this Kimi tweet.
  • Copilot Request Limits Circumvented: One member is working on a proxy tool to allow unlimited requests with Copilot, even on premium models, using 10+ requests per call, as Github Copilot now has a limit.
  • Aider Console Log Retrieval: Users discussed how to retrieve console logs or errors via Aider, with one member explaining that the /run bash command will run commands in the Aider session, prompting to add the log into the chat if something goes wrong.

aider (Paul Gauthier) ▷ #questions-and-tips (8 messagesđŸ”„):

aider and ollama, models for architect mode, leaderboards, aider in local language

  • Aider and Ollama pairing interests devs: A member asked if anyone is using aider with ollama.
    • This suggests growing interest in local LLM integrations with aider.
  • Model recommendations for architect mode requested: A member requested recommendations for specific models or model combinations for architect mode.
    • No specific models were recommended in the discussion.
  • Aider LLM Leaderboards Highlight Options: A member asked which model is recommended for aider, specifically o3 or Gemini 1.5 Pro.
    • Another member linked to the aider leaderboards, noting that they are both good options.
  • Aider speaks local tongue?: A member asked why aider is changing to their local language even if their prompts are in English.
    • They noted that LLM responses are in my local language instead of english is strange not sure why this happens.

MCP (Glama) ▷ #general (30 messagesđŸ”„):

MCP Superassistant, Malware Injection, MCP Server Posting, Multiple MCP Servers, FastMCP Reverse Proxy

  • MCP Superassistant Discovered: A user discovered MCP Superassistant and noted that adding MCP support to every popular chatbot is insane, linking to drinkoblog.weebly.com.
    • Another user mentioned asking their LLM to test it using a Python interpreter tool.
  • Beware of Malware Injection Scam: Users discussed a potential malware injection attempt via a Discord link that was quickly deleted.
    • One user humorously admitted to clicking the dubious link and was advised to run a malware scanner ASAP in a VM.
  • New MCP Server Posting Channel: A user inquired about where to post a new MCP server, and was directed to a specific channel.
    • Other users deliberated whether it’s better to have multiple MCP servers or add multiple unrelated tools to a single server, with the consensus leaning towards a single server for personal use to avoid junk.
  • FastMCP Reverse Proxy Aggregates Servers: A user asked which proxy to use, and another user mentioned using the one built into FastMCP to aggregate multiple servers.
  • Python Executable Autodetection Quandary: A user working on Desktop Extensions for Claude Desktop faces issues with Homebrew Python installations where only python3 is available, causing spawn errors when launching MCP servers.
    • They are seeking a better way to auto-detect the Python executable instead of requiring manual config, linking to a related GitHub issue.

MCP (Glama) ▷ #showcase (8 messagesđŸ”„):

MCPJam inspector fix, MCP client for Elicitation, Aidderall MCP server, Neurabase MCP server hosting

  • Inspector’s SSE Endpoint Bug Squashed: A member implemented a fix for the MCPJam inspector, resolving an issue where the inspector was incorrectly hitting the /sse endpoint.
    • The corrected endpoint is now streamable.
  • MCP Client Elicits Open-Source Excitement: A member announced their open-source MCP client now supports Elicitation, positioning it as one of the first to offer this feature.
    • They invited the community to star the MCPJam inspector repo and thanked members for driving the project.
  • Aidderall Focuses AI with MCP: A member introduced Aidderall, an MCP server designed as a cognitive prosthetic for AIs using a hierarchical task management system to maintain focus and context across complex tasks and shared the github repo.
    • Key features include hierarchical tasks, focus management, context preservation, a living document of completed tasks, flexible navigation, and parallel workflows.
  • Neurabase Hosts MCP Servers on Cloudflare’s Edge: A member shared that Neurabase is the fastest server hosting service running fully on Cloudflare Workers CDN network as a central hub for the MCP servers.
    • Neurabase boasts the fastest MCP server hosting due to Cloudflare CDN’s smart placement and is rock-stable because of Cloudflare Workers.

Notebook LM ▷ #use-cases (3 messages):

Quantitative Data Analysis, PDF Export, Trending Topics, Excel Data Extraction, Image Uploads

  • Quant Data Tricks Sought for Trending Topics: A member asked for tricks to analyze quantitative data from an Excel export (containing a date column and unstructured discussion extracts) to identify trending topics by comparing the last 3 months with the full resource.
    • The goal is to analyze the data after exporting an Excel file to PDF.
  • NotebookLM prompts shared: A member humorously noted that the summarization AI seemed to be calling them out for sharing prompt documents to import into NotebookLM.
    • This comment was posted in response to previous messages about specific prompt-writing strategies.

Notebook LM ▷ #general (21 messagesđŸ”„):

Audio Overviews, Image Uploading, Latex Rendering, Code Writing Prompts, Chat History Disappearance

  • Automated Audio Overview Agony: A user is trying to create a unique audio overview for each source in their notebook and is asking if the current manual process is the most efficient.
    • The current workflow involves selecting a single source, generating an audio overview, downloading the audio, deleting the audio, and repeating the process for each source which may not be optimal.
  • Image Uploading Unavailable: A member inquired whether it is currently possible to upload images to NotebookLM.
    • Another member confirmed that image uploading is possible in the current version.
  • Latex Rendering Lament: Users are requesting for Latex rendering support in NotebookLM for STEM users.
    • A member argued that NotebookLM is not designed to be a rendering expert but rather to help with research and formulation, while another user countered that Latex support is important for topics like machine learning when equations are illegible.
  • Code Writing Confusion: A user asked about using code writing prompts in NotebookLM.
    • A member clarified that NotebookLM is not intended as a replacement for code-writing tools like Cursor or Windsurf.
  • Chat History Hiccups: A user reported that their chat history disappears when they log out of NotebookLM.
    • Another user corroborated that they are experiencing the same issue even with a premium account, suggesting that this is an issue that requires a workaround, such as saving prompts and results in a note.

Manus.im Discord ▷ #general (17 messagesđŸ”„):

SafeScan QR Launch, Manus Feature Suggestions, Subscription question, Registration error, Michael Seibel compliment

  • SafeScan QR App Launches on Google Play: A member announced the launch of SafeScan QR, their first project built using Manus, now available on the Google Play Store.
    • The app provides QR code scanning with protection against phishing & malware and is seeking feedback for improvements.
  • Calls for Manus to Build Mobile React Apps: A member suggested that Manus should offer a feature to create React apps directly on mobile phones, similar to apps already available on the iOS App Store.
    • They argued that adding such capabilities would differentiate Manus and attract more users, especially as they said “the more things Manus can do the better”.
  • Member Question about subscription uses: A member inquired whether a Manus subscription would enable generating and fixing .bat and shell files, or if that functionality is solely dependent on points.
    • This request indicated the potential importance of users wanting to edit code from the app, showing interest in coding use cases.
  • Email Registration Issues Reported: A user reported a “Failed to send email” error during registration, indicating a potential issue with email content requirements.
    • This technical issue affects the user registration flow and should be checked for broader impact.
  • Michael Seibel Compliments Manus: Michael Seibel gave a compliment to Manus about product direction, per his X post.
    • This endorsement highlights the growing recognition and potential impact of Manus in the industry.

Cohere ▷ #đŸ§”-general-thread (8 messagesđŸ”„):

Session locations, New office

  • Inquiries on Session Locations: A member asked where the rest of the sessions that were mentioned earlier are taking place.
    • Another member requested clarification on which specific session the inquiry was about.
  • Chatter About a New Office: Someone commented “new office? Thats cool!”
    • No further information was provided regarding the office’s location or purpose.

Cohere ▷ #👋-introduce-yourself (3 messages):

Introductions, Monocular Depth Estimation, Knowledge Distillation, PyTorch

  • New Intern Joins Cohere!: A Computer Vision Intern from the University of Nottingham has joined the Cohere community to explore Monocular Depth Estimation and Knowledge Distillation techniques.
    • They primarily use PyTorch and are eager to share and learn from others in the community.
  • Enthusiastic Intern Eager to Learn: The intern hopes to share their knowledge and learn from others in the Cohere community.
    • They are focused on expanding their understanding of Computer Vision, Monocular Depth Estimation, and Knowledge Distillation.

Torchtune ▷ #dev (5 messages):

Efficient CE, GRPO Sync

  • Efficient CE drops!: A new efficient CE (Cross Entropy) was dropped; check it out on X.com.
  • GRPO Sync in Question?: Discussion around whether to support the sync version of GRPO (Generalized Robust Policy Optimization) arose, with some suggesting deprecation.
    • Members thought that since it’s fully functioning and the async recipe doesn’t work on every model it should be kept, but another member responded that then we have critical issue in it, so it doesn’t work anymore.

Torchtune ▷ #papers (5 messages):

small batches vs large batches, optim-in-bwd support, optimal batch sizes

  • Small Batches Might Be Better: A member shared a link to a paper, https://arxiv.org/pdf/2507.07101, suggesting that small batches might be better than larger batches.
    • They pointed out that this supports keeping optim-in-bwd support because gradient accumulation is not very useful if the paper is true, as noted in this tweet.
  • Optimal Batch Sizes: Theory vs Practice: A member commented that recent findings align with the inequality ÎČ̂ₖ₊₁ ≀ Lᔄ rₖ₊₁Âčâșᔄ + (σₖ₊₁ rₖ₊₁) / √B related to optimal batch sizes.
    • This suggests that ÎČ (optimal batch) is less than the maximum available batch for a specific GPU, but there weren’t many practical experiments to confirm this.

Modular (Mojo đŸ”„) ▷ #general (5 messages):

Assembly coding in Mojo, Tracking Modular Community Events

  • Assembly coding possible in Mojo: A member inquired about the possibility of coding assembly within Mojo to make syscalls.
    • Another member confirmed it’s possible, pointing to the _assembly.mojo module though noting that it lacks proper documentation.
  • Community Feedback on Modular Events Tracking: A poll was conducted regarding how the community prefers to track Modular events such as community meetings, livestreams, conference talks, and meetups, listing the Modular community Google calendar and Modular’s Luma event page as options.
    • A member suggested that Discord announcements and forum posts could also be useful for reaching new people and proposed adding a website worker for subscribing to notifications, potentially creating an app-like experience, and mentioned email as a still-viable option for interested new visitors.

Modular (Mojo đŸ”„) ▷ #mojo (2 messages):

Mojo MAX Tutorial, Custom Ops Matmul

  • Modular ships Mojo-powered MAX Tutorial: A member praised the new Mojo MAX tutorial on custom matrix multiplication, calling it maybe the best tutorial ever.
    • They added, mojo driving the ship for MAX.
  • Another cool tutorial: I found this tutorial to be awesome and educational.
    • Furthermore, I feel like this should be added to the documentation.

DSPy ▷ #papers (2 messages):

Infer-Retrieve-Rank (IReRa), label classification, xmc.dspy GitHub repository, DSPy compatibility

  • IReRa Research Faces Repo Lag: A member is researching Infer-Retrieve-Rank (IReRa) for a label classification problem using the xmc.dspy GitHub repository.
    • The member notes the repo depends on a specific DSPy commit without active development, asking if they need to fork the repo and update it for DSPy compatibility.
  • IReRa Paper Beats Stale Code: In response to questions about outdated repo, a member suggested reading the paper for IReRa instead.
    • This implies that the IReRa paper could provide the necessary information, sidestepping the need to update the stale GitHub repo.

DSPy ▷ #general (4 messages):

Prompt Optimization, Context Engineering with DSPy, MiProV2 Errors, Base64 Images

  • Mistral shoots for prompt optimization!: Mistral released a cookbook notebook with their own shot at prompt optimization, also discussed in a related video.
  • Context Engineering with DSPy Talk Prompts Troubles: A member gave a talk on context engineering with DSPy and is now encountering input context too long errors while using DSPy for tuning with MiProV2.
    • Reducing max bootstrap demos and max labelled demos didn’t resolve the issue, even with a 4k (and 6k) max token setting.
  • MiProV2 Input Context Overflow: A member tuning with MiProV2 reports input context too long errors when using DSPy.
  • Base64 image conversion prior to DSPy exmaples: A member converted images to base64 before passing dspy.Examples due to the caller using s3.

LlamaIndex ▷ #blog (3 messages):

Snowflake data agents, LeSearch agent, NotebookLlama features

  • LlamaIndex and Snowflake host Amsterdam event: LlamaIndex and Snowflake are hosting hands-on talks in Amsterdam on July 31st about building production-grade data agents that work with real enterprise data and tame complex paperwork with document agents using this link.
  • LeSearch tackles academic research pain points: LeSearch, built using the ReActAgent framework, addresses academic research challenges with three intelligent agents designed to handle the grunt work, focusing on discovery through features like Multi-hop Question answering (link).
  • NotebookLlama gains new features: NotebookLlama, an open-source NotebookLM alternative backed by LlamaCloud, has been updated with new features that allow users to extract and download images and tables from files and interactively visualize all tabular data (link).

LlamaIndex ▷ #general (2 messages):

Cloudflare AI Gateway, Automatic LLM Fallback, LlamaIndex Integration

  • LlamaIndex hooks up Cloudflare AI Gateway: A member is working on a LlamaIndex integration for Cloudflare AI Gateway that provides automatic fallback between multiple LLM providers such as OpenAI and Anthropic.
  • Cloudflare AI Gateway enables automatic LLM fallback: The Cloudflare AI Gateway integration allows for automatic fallback between different LLM providers, ensuring continued service availability.
    • This feature is particularly useful in scenarios where one provider might be experiencing downtime or rate limits.

Nomic.ai (GPT4All) ▷ #general (3 messages):

Multi-modal Models, Gemma 3, Architectural Floor Plan Feedback

  • User Seeks Self-Hosted Multi-Modal Model for Architectural Feedback: A member is seeking a multi-modal model to host locally, with the specific use case of providing feedback on architectural floor plans and drawings.
    • So far, they have only found Gemma 3 to be passable for their needs.
  • Gemma 3 Considered for Architectural Design Feedback: The user identified Gemma 3 as the only model that meets their requirements.
    • The user requires a solution capable of processing visual input to provide design feedback.

Gorilla LLM (Berkeley Function Calling) ▷ #leaderboard (3 messages):

vllm, sglang, Llama 3B vs 8B

  • vllm matches sglang results: Members are saying that vllm or sglang should both give you similar results.
  • Llama 8B paradoxically underperforms Llama 3B: A member questioned why the 8b Llama model (FC) ranks below the 3b one.
  • Bigger is not always better for LLMs: Another member explained that larger model size doesn’t necessarily mean better performance.
    • They point out that llama 4 scout performs worse than llama 3.1 70B.

tinygrad (George Hotz) ▷ #general (2 messages):

PatternMatcher, UPat -> UPat rules, Egraph rewrite rules, Turing completeness

  • PatternMatcher Lambdas Face Removal: A user expressed interest in removing lambdas from some PatternMatcher rules, especially in simple cases where a rule could be defined as UPat -> UPat.
    • They noted that egraph rewrite rules seem to function this way, and suggested that avoiding Turing completeness whenever possible is a good practice.
  • Egraphs get PatternMatcher Support: The user compared the proposed PatternMatcher rules with egraph rewrite rules, noting the similarity in structure and operation.
    • The user suggested that whenever possible, implementations should strive to avoid Turing completeness for the sake of simplicity and efficiency.