Muon is all you need?

AI News for 7/25/2025-7/28/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (227 channels, and 16798 messages) for you. Estimated reading time saved (at 200wpm): 1388 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

A banner day for Chinese open weights AI. The generative media types should definitely take a look at Wan 2.2, but most AI Engineers should be apprised of Z.ai’s (better known as Zhipu, one of the AI Tigers) GLM-4.5-355B-A32B and GLM-4.5-Air-106B-A12B released today. They make a VERY strong claim (to be independently verified) of being not only the strongest open weights model (beating the previous SOTA Kimi K-2) but also highly competitive with and often better than heavyweight SOTA models like Claude 4 Opus, Grok 4, and OpenAI’s o3:

Beyond just the table stakes benchmarks to be considered a frontier model, Z.ai also commendably emphasizes new measurements that matter greatly for agentic use, including token efficiency (perhaps the hardest metric of all)

No paper yet, but the blog post offers some interesting details on architecture choice and efficient RL training. GLM 4.5 is the second large model this month to validate the Muon optimizer at scale.


AI Twitter Recap

New Model Releases & Performance

  • GSPO and the Qwen3 Model Suite: Alibaba Qwen announced Group Sequence Policy Optimization (GSPO), a new reinforcement learning algorithm described as a breakthrough for scaling large models. It features sequence-level optimization, improved stability for large MoE models without needing “hacks like Routing Replay,” and powers the latest Qwen3 models (Instruct, Coder, Thinking). The research paper is noted by @lupantech and is praised by @teortaxesTex as their most impressive paper to date. The algorithm has already been integrated into Hugging Face’s TRL library, as noted by @mervenoyann and @_lewtun.
  • **Zai.org Launches GLM-4.5 Models**: Chinese AI lab Zai.org has released two new open-source models, GLM-4.5 and GLM-4.5-Air, with a permissive MIT license. As summarized by @scaling01, GLM-4.5 is a 355B parameter MoE model with 32B active parameters, while GLM-4.5 Air is 106B with 12B active. The models are described as hybrid reasoning models with a focus on coding and agentic tasks. In a notable move towards transparency, Zai.org also open-sourced all 52 task trajectories from their agentic coding evaluation for community review.
  • Speculation on “Summit” and “Zenith” as GPT-5: A new set of powerful mystery models, codenamed “summit” and “zenith,” appeared on LM Arena, fueling speculation they could be versions of GPT-5. @Teknium1 reported being told “zenith is gpt-5,” while @emollick and @scaling01 showcased their impressive capabilities in generating complex p5.js code and creative writing. Users noted the models appear to be based on a GPT-4.1 series with a June 2024 knowledge cutoff.
  • Qwen3-Coder’s Strong Coding Performance: Alibaba’s Qwen3-Coder model has demonstrated strong performance on coding benchmarks. @cline reported a 5.32% diff edit failure rate, placing it alongside Claude Sonnet 4 and Kimi K2. OpenRouterAI noted that the model passed Grok 4 in programming rankings, tying with Kimi.
  • The Rise of Chinese Open-Source Models: A significant trend observed this month is the rapid release of powerful open-source models from Chinese labs. @Yuchenj_UW compiled a list of July releases including GLM-4.5, Wan-2.2, Qwen3 Coder, and Kimi K2, contrasting it with a perceived slowdown from Western labs like OpenAI and Meta.
  • Hunyuan3D World Model 1.0 Release: Tencent Hunyuan has open-sourced its Hunyuan3D World Model 1.0, which enables the generation of explorable 3D environments.

AI Agents & Agentic Workflows

  • Claude Code for Complex Agentic Systems: Claude Code is being highlighted as a powerful tool for orchestrating complex agentic systems. @omarsar0 demonstrated building a multi-agent deep research system by chaining sub-agents with /commands for reliability, noting it’s useful for more than just code.
  • ChatGPT Agent Officially Rolls Out: OpenAI announced that the ChatGPT agent is now fully rolled out to all Plus, Pro, and Team users. However, the rollout wasn’t without hitches, as @gneubig humorously pointed out that the OpenAI agent was being blocked by OpenAI’s own captcha.
  • The Future of Agents: Proactive & Ambient: @_philschmid outlines the next iteration of agents, predicting a shift from request-response to proactive, ambient agents that operate in the background. These agents will be triggered by events, monitor data streams, and require new UI paradigms beyond chat, with a strong emphasis on human oversight and long-term memory.
  • Perplexity Comet Browser Agent: Perplexity AI continues to send out invites for its Comet browser agent. @AravSrinivas showcased a demo of Comet acting as a travel agent to book a flight on United, including seat selection. He also noted that Perplexity is the default search on the Comet browser, which could significantly drive usage.
  • Why Multi-Agent Systems Fail: DeepLearningAI summarized a research paper that categorized the primary causes of failure in multi-agent systems as poor specifications, inter-agent misalignment, and weak task verification.

Video & Multimodal Generation

  • Runway Aleph Sets a New Frontier: Runway began rolling out Aleph, its state-of-the-art in-context video model. Creative Technologist @c_valenzuelab shared numerous demos showcasing its capabilities: creating infinite camera coverage on demand, modifying specific parts of a video while retaining motion and identity, setting juggling balls on fire, performing wardrobe and styling modifications, and seamlessly removing objects from a scene. He described it as a “new medium” where the hardest part is conceptualizing what to create.
  • Open-Source Video: Wan 2.2 Released: Countering the trend of closed video models, Alibaba released Wan 2.2, the “World’s First Open-Source MoE-Architecture Video Generation Model”. @scaling01 noted its release and @ostrisai highlighted a 5B version supporting text-to-video and image-to-video at 24 FPS on a single RTX 4090.
  • Kling AI Introduces Kling Lab: Kling AI announced Kling Lab, a new workspace designed to streamline the creative video generation process, which is currently in beta testing.
  • Grok Imagine Enters Waitlist Beta: xAI launched Grok Imagine, an image and video generation tool, behind a waitlist on the Grok app. @chaitualuru described it as a “fun image and video generation experience” and noted they are expanding access.

Infrastructure, Tooling & Efficiency

  • Frameworks and Libraries: The supervision open-source library, created by @skalskip92, has crossed 30,000 stars on GitHub. LangGraph released v0.6.0 with a new context API for type-safe dependency injection. Red Hat AI’s GuideLLM is joining the vLLM project, combining its structured generation with vLLM’s inference speed.
  • LLM Evals and Data: @HamelHusain released a massively expanded LLM Evals FAQ, reorganizing it into categories and adding an audio version. On the data side, @vikhyatk warned that the popular GQA eval dataset has a 20-30% annotation error rate.
  • Hardware and Training Efficiency: John Carmack @ID_AA_Carmack commented on the irony of modern ML feeling like old-school sci-fi technobabble, stating, “I am running the convolutions in frequency space!” @awnihannun provided a thought-provoking analysis on how autoregressive transformers are “adversarially designed” for modern computer memory hierarchies, suggesting that either the algorithm or the computer must change for major efficiency gains. Meanwhile, @ggerganov highlighted that AMD teams are now contributing to the llama.cpp codebase. A new NanoGPT training speed record was set by @kellerjordan0, achieving a 3.28 validation loss in 2.863 minutes on 8xH100s.

New AI Techniques & Research

  • Prompt Optimization vs RLHF: A new paper on Reflective Prompt Evolution (GEPA) shared by @lateinteraction shows that prompt optimization can outperform RL algorithms like GRPO in terms of sample efficiency. The work suggests that learning via natural-language reflection will be a central paradigm for building AI systems.
  • Causality in ML: @sirbayes recommends a new book on causality for ML by Elias Bareinboim, calling it a “worthy successor” to Judea Pearl’s groundbreaking work.
  • The Power of Simple Penalties: @francoisfleuret argued that the key lesson from the Variational Autoencoder (VAE) is that “dumb penalties have extremely profound effects and induce incredibly sophisticated structures in deep models.”
  • Meta-Learning History: JĂŒrgen Schmidhuber @SchmidhuberAI provided a detailed historical overview of meta-learning, tracing the concept of in-context learning back to his work in the early 1990s and Sepp Hochreiter’s work in 2001.

Industry Trends & Commentary

  • Hiring and Talent: Meta appointed a fresh PhD from OpenAI as its Chief Scientist, a move @Yuchenj_UW notes as “unheard of” for a 30-year-old at a major corporation, signaling a shift towards “skill >> seniority.”
  • The Vision Gap in AI: @jxmnop made the surprising observation that “fifteen years of hardcore computer vision research contributed ~nothing toward AGI except better optimizers,” as models still don’t get smarter when given eyes. @teortaxesTex added that while many open models are reaching a similar performance plateau, the key missing element is a “radically new eval suite,” which would signal that someone is “climbing a new hill.”
  • The Future of Search: Perplexity AI’s CEO @AravSrinivas stated that Perplexity’s rapid growth in markets like India is “clear proof that search has changed forever.”
  • AI and Experience: Mustafa Suleyman @mustafasuleyman drew a “bright line” between humans and AI, stating that “To be human is to experience. Today’s AIs have knowledge
 but can only imitate experience.”

Humor & Memes

  • The Vibe Coder Universe: The term “vibe coding” has become a pervasive meme, used to describe an intuitive, sometimes brittle, approach to development. The concept has now evolved, with @scaling01 noting that some have “ascended from being mere vibe coders to vibe architects,” and @lateinteraction declaring we are in the “Vibe Meanings era.”
  • The Magic Talking Dog Party: A party hosted by @benhylak and @okpasquale at Slack’s original office, featuring a “magical talking dog,” became a running joke, with @KevinAFischer posting a picture of the “pitch at a16z” involving the dog and a human pyramid.
  • Relatable Developer Pain: @cloneofsimo lamented that in 2025, transformers can solve olympiad problems and design chips, “yet we have latex UI still breaking.” @francoisfleuret offered a warning: “If you are using FSDP and have an “if” statement in your model.forward[], be nervous my friend, be very nervous.”
  • Industry Commentary: @Yuchenj_UW shared a tragic family story: “My dad bought 100 Bitcoin for $50 each in 2013. Sold them at $100, bragged for weeks. Ever since then
 Bitcoin has been a banned word in our family.”

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. GLM-4.5 Announcements, Launches, and Collections

  • GLM4.5 released! (Score: 720, Comments: 193): GLM-4.5 (355B total/32B active params) and GLM-4.5-Air (106B total/12B active params) are new flagship hybrid reasoning models from Zhipu AI, now open-weight on HuggingFace and ModelScope under an MIT license. Key technical features include distinct ‘thinking’ and ‘non-thinking’ modes for flexible agentic/coding tasks, as well as a native Multi-Token Prediction (MTP) layer supporting speculative decoding for improved inference performance on CPU+GPU hardware. Full details are outlined in the official blog post. Commenters emphasize the impact of both the open MIT licensing and the native MTP layer, noting it as a significant milestone for community reusability and for efficient inference, particularly on mixed hardware setups.
    • The GLM-4.5 release includes foundation models (355B-A32B and 106B-A12B) under an MIT license, which is notable for enabling broad customization and innovation by the community. This open licensing for large-scale models is considered an exceptional step forward for open-source AI development.
    • GLM-4.5 and GLM-4.5-Air feature an MTP (Multi-Token Prediction) layer for speculative decoding during inference, which can improve efficiency—especially for mixed CPU+GPU setups. Speculative decoding out-of-the-box is cited as a significant usability advantage for inference optimization.
    • The release includes several technical assets: BF16, FP8, and base models, facilitating further training, fine-tuning, and research. Documentation and technical resources are provided for popular inference engines (vLLM, SGLang) and detailed guides for both inference and fine-tuning are shared in their GitHub and technical blog, making the models accessible for experimentation and extension.
  • GLM 4.5 Collection Now Live! (Score: 221, Comments: 47): The GLM 4.5 Collection is now live on HuggingFace (link), featuring GLM-4-9B and its variants with a focus on ‘hybrid thinking’ capabilities. Benchmarks reported indicate that while math/science scores are slightly below those of Qwen3, GLM 4.5 demonstrates strong generalized coding performance despite not being a specialized Coder model. Comments note the lack of immediate GGUF-format downloads (e.g., via Unsloth) and highlight the design choice of hybrid reasoning (contrasting with Qwen’s direction), suggesting this could lead to more versatile generalist models.
    • The new GLM 4.5 model employs a hybrid architecture, which contrasts with the approach taken by the Qwen team. According to the released math and science benchmarks, GLM 4.5 underperforms Qwen3 in these areas, but it demonstrates notably strong results in coding tasks compared to non-specialized models, suggesting robust general capabilities in domains outside pure STEM.
    • There is community interest in immediate GGUF-format downloads for GLM 4.5, with some users expressing frustration at the lack of coordination with the Unsloth team’s tooling on release—highlighting the demand for compatibility and ease of use with local inference frameworks or quantized formats.
    • A technical suggestion raised is the fine-tuning of models like GLM 4.5 for structured writing tasks (e.g. letter writing, fiction, personality emulation). Users note a gap in performance for these creative/support roles, suggesting targeted instruction tuning could address common use cases not fully captured by current benchmarks.
  • GLM 4.5 possibly releasing today according to Bloomberg (Score: 134, Comments: 26): Bloomberg reports that Zhipu AI (previously THUDM, now zai-org on Hugging Face) is set to release GLM-4.5, with associated collections and datasets appearing on Hugging Face (GLM 4.5 Collection, CC-Bench-trajectories dataset). GLM-4.5-Air, the public release, consists of 106B total parameters with 12B active parameters, suggesting a sparsely-activated model similar to Mixture-of-Experts. Expert commenters express interest in large model variants (32B, 70B) under permissive licenses (MIT, Apache), and the main technical discussion centers on the compact, sparsely-activated architecture of GLM-4.5-Air. There is also anticipation regarding the license and possible upstream improvements in efficiency or capability.
    • GLM-4.5-Air is publicly released and features a ‘more compact design’ with 106 billion total parameters but only 12 billion ‘active parameters’, which suggests architectural optimizations for performance or efficiency (source: https://huggingface.co/zai-org/GLM-4.5-Air).
    • There is community interest in the release of 32B and 70B parameter versions of GLM-4.5, specifically with MIT or Apache open licenses, highlighting concerns related to broad accessibility and permissive model licensing.
    • A direct link to the CC-Bench dataset on Hugging Face is provided, which may be relevant to evaluating the capabilities or benchmarking of GLM-4.5 models (source: https://huggingface.co/datasets/zai-org/CC-Bench-trajectories).
  • GLM shattered the record for “worst benchmark JPEG ever published” - wow. (Score: 106, Comments: 76): The post critiques a JPEG image purportedly showing benchmarks for GLM-4.5, with the primary complaint being very poor image or data quality—described as the ‘worst benchmark JPEG ever published.’ The underlying technical point involves benchmarking language models, specifically GLM-4.5, relative to competing models such as DeepSeek R1; a comment clarifies that the original JPEG is misleading about RAM usage, noting that GLM-4.5’s native precision is BF16 rather than FP8, which is a significant technical distinction for inference efficiency and memory use. Some commenters criticize the hyperbolic nature of the title and confusion about the image’s contents, while another notes that despite the poor presentation, GLM-4.5 is considered a high-quality model and there is anticipation for its forthcoming multimodal capabilities.
    • One commenter clarifies a technical specification: GLM-4.5 actually uses more RAM than DeepSeek R1 because GLM-4.5’s native precision is BF16 rather than FP8, contradicting any implication that it is more memory-efficient than the comparison model.
    • Reference is made to the GLM-4.5 documentation to provide technical context. The discussion critiques the JPEG benchmark image used in the documentation, noting that its clarity and informational value degrade upon closer inspection.
    • Despite critique of the benchmark presentation, one user points out that GLM-4.5 remains a strong, high-performing model, and expresses anticipation for its future multimodal capabilities.

2. Wan 2.2 Open Video Generation Model Releases and Benchmarks

  • Wan 2.2 is Live! Needs only 8GB of VRAM! (Score: 440, Comments: 49): Wan 2.2, a new model release, is highlighted for its low VRAM requirement—needing only 8GB—making it accessible to many users without high-end hardware. The discussion references early releases and repacks for ComfyUI, suggesting active integration and support from both the community and possibly official staff. The image likely showcases a model output or promotional material, underscoring the capability of the new version on limited hardware. A substantive comment notes that ComfyUI repacks are being released rapidly, implying close collaboration or internal staff involvement from Wan with ComfyUI development. This supports the perception of an engaged, responsive open-source/model community around Wan 2.2.
    • The commenter notes that the ComfyUI repacked version of Wan 2.2 was released even before the vanilla model, suggesting that some ComfyUI contributors might be involved with the Wan project. This implies rapid integration and possible cross-team collaboration, which is particularly notable given the typically slower pace of official support for new model architectures in UI frameworks.
    • Direct Hugging Face links are provided for multiple variants of the Wan 2.2 model (including T2V (Text-to-Video), I2V (Image-to-Video), and TI2V (Text/Image-to-Video) in both base and Diffusers-ready formats). This signals strong and immediate ecosystem support for both deployment and experimentation with various workflows and model pipelines, reducing friction for technical users wanting to test or benchmark multiple modalities of Wan 2.2.
  • Wan 2.2 T2V,I2V 14B MoE Models (Score: 140, Comments: 8): Wan2.2 introduces a Mixture-of-Experts (MoE) diffusion architecture for video generation, comprising two 14B-parameter specialized experts (high-noise for early denoising, low-noise for fine details) in a switchable 27B MoE setup, actuated by an SNR-based threshold for phase-optimized inference with no additional inference cost. Benchmarked on Wan-Bench 2.0, Wan2.2-T2V-A14B outperforms commercial SOTAs (KLING 2.0, Sora, Seedance) in 5/6 metrics, including dynamic motion and text rendering, while TI2V-5B achieves efficient, high-resolution (<9 min/5s 720p) T2V/I2V generation via aggressive spatial/patch compression and a unified architecture. See ComfyUI tutorial for implementation details. Comments reinforce the efficiency claim regarding single-expert-per-step inference and note accessible tools like the ComfyUI tutorial. One commenter contrasts the open-source Wan2.2 release with the more restricted practices of companies like OpenAI regarding similar models.
    • A key technical insight is that the WAN 2.2 14B MoE models use a Mixture-of-Experts (MoE) architecture with only one expert active per step, keeping inference efficient while providing large parameter capacity (14B), thereby optimizing both speed and performance for video and image generation tasks.
    • There’s a discussion about running diffusion models in llama.cpp: now that text diffusion is supported, users are exploring the feasibility of adding image and video diffusion support. This would potentially streamline workflows by centralizing all diffusion tasks in a single, widely-used framework, assuming implementation challenges can be overcome.
    • A quick-start guide for running WAN 2.2 using ComfyUI is highlighted, emphasizing its utility for efficiently deploying the new models for video generation. The linked documentation provides step-by-step instructions for practical implementation.

3. Specialized LLM Launches for Niche Applications (UI, Instruct, Edge Devices)

  • UIGEN-X-0727 Runs Locally and Crushes It. Reasoning for UI, Mobile, Software and Frontend design. (Score: 420, Comments: 67): Tesslate’s latest model, UIGEN-X-32B-0727, is a 32B dense LLM finetuned on Qwen3 and specialized for end-to-end modern UI/UX, frontend, mobile, and software design implementation. The model supports a wide array of frameworks (e.g., React, Vue, Angular, Svelte), styling options (Tailwind, CSS-in-JS), UI libraries, state management, animation, multi-platform (web, mobile, desktop), and Python integrations—offering code generation in 26+ languages and component-driven patterns. A 4B version is stated to be released imminently. Discussion centers on the surprisingly high quality of UI output from a dense 32B model, with curiosity about finetuning methodology given its SOTA-style performance and comparison to larger models. There is also mention of community interest in API/third-party integration for broader evaluation.
    • UIGEN-X-0727, a finetune of Qwen3 and a comparatively small 32 billion parameter dense model, is noted for generating highly competitive UIs, with some users surprised by the model’s capabilities at this scale—performance exceeding expectations commonly associated with much larger models.
    • One technical critique highlights that although UI generator LLMs like UIGEN-X-0727 excel at rendering individual components and maintaining consistent themes, major challenges persist in inter-component linking, navigation integration, and the automatic addition of dynamic styles—critical aspects for production-level frontend/UI code generation.
    • A tester mentioned the model requires 64GB of VRAM to run locally, leading to attempts to stage it on AWS for further benchmarking, which is a significant resource requirement for an allegedly ‘small’ model and could raise accessibility/performance concerns for local use.
  • Qwen/Qwen3-30B-A3B-Instruct-2507 · Hugging Face (Score: 502, Comments: 90): Alibaba’s Qwen team has released the Qwen/Qwen3-30B-A3B-Instruct-2507 model checkpoint on Hugging Face, but as of posting, no official model card or technical documentation is available. The model belongs to the 30B parameter class and appears to utilize the A3B architecture, following naming conventions from earlier Qwen releases noted for balancing strong performance and efficient inference on consumer hardware (see the previous Qwen3-30B-A3B model). Discussion in the comments highlights high user anticipation, referencing previous Qwen models as “daily drivers” and suggesting that updates to the A3B line (as seen in larger models) have delivered significant quality jumps, fueling expectations that this release may set a new standard for locally run LLMs.
    • Admirable-Star7088 notes that the previous Qwen3-235B-A22B-Instruct-2507 represented a significant performance improvement over the earlier “thinking” versions, suggesting that if similar gains are realized in the Qwen3-30B-A3B-Instruct-2507, it could become one of the top LLM releases optimized for consumer hardware usage.
    • rerri documents the repo’s visibility status, noting that it was initially private, which may imply an accidental early publication or a staggered rollout—this sometimes affects accessibility for benchmarking or comparison studies when new LLMs are released.
  • Pi AI studio (Score: 117, Comments: 27): The discussion focuses on a $1000 device featuring 96GB LPDDR4X (not LPDDR5X) RAM and an Ascend 310 chip, with questions about its suitability for hosting small LLMs. Commenters note the device’s memory bandwidth may bottleneck performance, and the Ascend 310’s capabilities are compared to Nvidia’s Jetson Orin Nano, suggesting only simpler neural networks or heavily quantized models (e.g., 70B 8-bit or 100B+ int4) might be practical. Top comments debate the sufficiency of LPDDR4X versus faster RAM (LPDDR5X), and express skepticism about both the memory bandwidth and compute capability, indicating that running high-throughput or larger LLMs would be challenging.
    • Multiple commenters express concerns about the LPDDR4X memory used in the Pi AI Studio, noting its limited bandwidth (~3.8GB/s) compared to higher-end solutions like Mac AI Studio (546GB/s). This limitation is expected to significantly affect token throughput and restrict performance with larger language models.
    • Discussion compares the Ascend 310 AI accelerator to Nvidia’s Jetson Orin Nano, suggesting that while it can handle simpler neural networks or medium-sized MoE models (e.g., Qwen 3 30B), it would struggle with decent LLMs. Quantized models (e.g., 70B 8-bit or 100B+ INT4) may technically run, but would be heavily bottlenecked by memory bandwidth.
    • There is technical debate about memory bandwidth estimation based on LPDDR type, with one commenter highlighting the lack of clear, practical data due to variations in implementation (especially bus width), which makes predicting actual throughput and performance for LLM inference (e.g., token/sec on 7B to 40B models) challenging.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo

1. Wan2.2 Video Model Release, Benchmarks, and Community Tests

  • First look at Wan2.2: Welcome to the Wan-Verse (Score: 863, Comments: 134): The Wan team has released Wan2.2, the latest iteration of their text-to-video/image-to-video (I2V) model, following the success of Wan2.1 (5.8M+ downloads and 13.3k GitHub stars). Wan2.2 introduces a more effective Mixture-of-Experts (MoE) architecture, improved cinematic aesthetics, and significantly enhanced abilities to generate complex and diverse motion sequences from a single image input, as highlighted in the preview demo. Model artifacts and documentation are available via Hugging Face (https://huggingface.co/Wan-AI), GitHub (https://github.com/Wan-Video), and the official website (https://wan.video/welcome). Commentary notes recognition of quality improvements and anticipation for hands-on testing. Technical users emphasize architectural advancements and capability upgrades as core differentiators from prior versions.
    • Wan2.2 introduces a significantly upgraded MoE (Mixture of Experts) architecture, which is designed to improve model efficiency and output diversity. This change is highlighted as a core technical differentiator from Wan2.1 and is expected to enhance both the complexity and fidelity of generated video content.
    • Technical enhancements specifically call out improved cinematics—enabling higher-quality aesthetics—and the capability to generate more complex motion sequences from single source images. This suggests tighter integration of temporal and spatial features in the model’s pipeline for better video realism and dynamic content.
    • Some users are awaiting an FP8 (floating point 8-bit) release, indicating demand for lower-precision weights, which could benefit model inference and deployment efficiency, especially on hardware that supports FP8 accelerators. This is a signal of community interest in optimized performance for large-scale or consumer-grade hardware.
  • Wan2.2 released, 27B MoE and 5B dense models available now (Score: 479, Comments: 251): The post announces the release of Wan2.2, featuring new models: a 27B parameter Mixture-of-Experts (MoE) model for Text-to-Video (T2V) and Image-to-Video (I2V) tasks (T2V, I2V), and a 5B dense model (TI2V-5B). Official codebase and repackaged fp16/fp8 models for ComfyUI are provided, alongside a dedicated workflow guide. Notably, the 5B model reportedly delivers 15s/it for 720p 30-step generations on an RTX 3090 (~4-5 minutes per rendering), with efficient native offloading enabling use on 8GB VRAM GPUs. Technical discussion in comments emphasizes the unusually low VRAM requirements, especially the practicality of running the 5B dense model on consumer GPUs (e.g., 8GB and 12GB cards), and highlights the real-world render times compared to prior models, eliminating need for additional LoRA finetuning (e.g., ‘lightx2v’).
    • The released Wan2.2 5B dense model demonstrates practical VRAM efficiency: it’s confirmed to run on an 8GB GPU (with ComfyUI’s native offloading), enabling 15s/iteration video generation at 720p resolution on an RTX 3090 (30 steps in 4-5 minutes), and suggesting that cards with 12GB VRAM (like the RTX 3060) may also handle it.
    • The two-pass TI2V (Text-to-Image-to-Video) setup on a 4090 (using FP8) yields strong i2v (image-to-video) results, maintaining capability for NSFW content, indicating quality is not compromised even at reduced precision.
    • ComfyUI HuggingFace repackaged models require users to utilize both high and low noise variants in workflows. The safetensors files for Wan2.2 total 14.3GB, raising uncertainty whether the full workflow (with both models) will fit within a 16GB VRAM constraint, unlike Wan2.1.
  • First test I2V Wan 2.2 (Score: 251, Comments: 69): The post discusses initial testing of the I2V Wan 2.2 model, focusing on dynamics and camera improvements compared to Wan 2.1. A user notes significant VRAM usage: on an RTX 5090 with 32GB VRAM, generating 121 frames at 1280x720 resolution causes an out-of-memory error, forcing reduction to 1072x608. There’s an expressed need for the u/kijai Wan wrapper update for v2.2 to leverage its memory management. Linked content includes a gif and reference comparison to Wan 2.1. One technical note observes persistent head ‘noise’, suggesting incomplete denoising. Commentary centers on both positive dynamics/camera upgrades and negative VRAM scaling behavior. There is debate as to artifact causes—whether due to denoising or model artifacts—and anticipation for workflow tool updates to address these hardware constraints.
    • A user reports that while WAN 2.2 introduces much improved model dynamics and camera handling over WAN 2.1, memory requirements have increased significantly. On an RTX 5090 with 32GB VRAM, 1280x720 generation resulted in out-of-memory errors after 121 frames, requiring a downscale to 1072x608 for stable generation. This highlights urgent need for memory optimizations, possibly via wrappers like the one by u/kijai targeted for WAN 2.2.
    • There is specific feedback on the output quality: users note weird noise on the head during motion, suggesting potential issues with insufficient denoising or video stabilization in spatially challenging regions at motion boundaries. Video quality and motion are described as poor, and users question if this is due to the model variant (e.g., 5B vs 27B parameters) or rendering resolution.
    • Community inquiry about backward compatibility: one user asks if LoRA models trained on WAN 2.1 are still functional or compatible when used with WAN 2.2, highlighting an important aspect of model versioning and workflow stability.
  • PSA: WAN2.2 8-steps txt2img workflow with self-forcing LoRa’s. WAN2.2 has seemingly full backwards compitability with WAN2.1 LoRAs!!! And its also much better at like everything! This is crazy!!!! (Score: 250, Comments: 121): The post announces that the new WAN2.2 diffusion model demonstrates near full backwards compatibility with WAN2.1 LoRAs (Low-Rank Adaptation checkpoints), with measurable improvements in output: greater detail, more dynamic compositions, and improved prompt adherence (example: color manipulation per prompt is better in WAN2.2). The author provides a downloadable 8-step txt2img workflow JSON (see WAN2.2 workflow), encouraging users to update due to an earlier version containing errors. Example outputs show Lora compatibility and enhanced results. Top comments focus on empirically verifying WAN2.2’s performance against models like Flux and confirm the successful use of WAN2.1 LoRAs in WAN2.2, which is considered a significant technical advance in workflow flexibility.
  • 🚀 Wan2.2 is Here, new model sizes 🎉😁 (Score: 195, Comments: 50): The attached image here is a technical illustration or demo related to Wan2.2’s new release, highlighting improvements in open-source AI video generation, including MoE (Mixture of Experts) models for Text-to-Video, Image-to-Video, and Text+Image-to-Video up to 720p with notable temporal consistency. Details in the post emphasize that Wan2.2 offers new models (T2V-A14B, I2V-A14B, TI2V-5B) available via HuggingFace and ModelScope, with a particular focus on easy installation and template integration with ComfyUI. The image likely showcases a visual output or workflow of these new video generation capabilities, although the exact image content could not be analyzed. Comments discuss the technical integration with ComfyUI templates, highlighting the I2V (Image-to-Video) mode’s unique two-pass flow using high/low noise models, and express anticipation about performance LoRAs (Low-Rank Adaptations) compatibility.
    • A user notes that the i2v (image-to-video) workflow in ComfyUI for Wan2.2 uses a two-pass architecture employing both high and low noise models, highlighting a specific implementation detail that likely impacts quality or control over video generation. This indicates that the model pipeline has been designed to process inputs in stages with varying noise profiles for potentially better results.
    • Another technical comment provides early feedback on the newly released 5B model, characterizing its output quality as significantly inferior compared to the 14B variant, which is described as ‘A+’. This suggests substantial differences in output quality, possibly due to scale-related limitations or architectural differences between versions.
    • One user emphasizes anticipation for GGUF-compatible versions of the Wan2.2 models, which is relevant for local inference and deployment efficiency, pointing to ongoing interest in model portability and compatibility with quantization/user-friendly formats.
  • Wan 2.2 test - T2V - 14B (Score: 172, Comments: 51): The post documents a technical test of the Wan 2.2 14B text-to-video (T2V) model at 480p resolution using Triton-accelerated samplers. The workflow used fp16 precision and required approximately 50 GB VRAM for the first pass, spiking up to 70 GB, though the user expected complete offloading after the first model. The resulting video demonstrates substantial advancements over Wan 2.1, with strong adherence to prompts and realistic rendering of complex human motion and limb articulation over multiple seconds—areas where previous models struggled. Performance metrics from comments indicate that a scaled-down version (14B T2V) can run on a 16 GB VRAM card (e.g., RTX 4070Ti Super) with 64 GB RAM, generating a 5-second 320x480 video in 4 minutes 43 seconds. Comments corroborate technical improvements: complex footwork is rendered accurately without obvious errors (in contrast to prior Wan 2.1 limitations), and prompt adherence is highlighted as a strong point. Concerns about high VRAM usage (50-70 GB) are noted, but community tests show feasible operation at lower specs through scaling.
    • The latest Wan 2.2 T2V 14B model demonstrates significant improvements over Wan 2.1, notably generating complex human motion and footwork sequences without obvious errors, a capability not present in earlier versions.
    • Detailed performance benchmarks: a 5-second 320x480 video was generated in 4 minutes 43 seconds using a 4070 Ti Super (16GB VRAM) and 64GB RAM, confirming that inference is feasible on consumer-grade GPUs with at least 16GB VRAM, though resource usage can scale up to 50-70GB VRAM at higher settings.
    • Additional testing on an RTX Pro 6000 achieved native 24 fps generation without utilizing a teacache, offering further reference points for hardware performance and reproducibility. Commenters highlight the importance of pairing runtime metrics with explicit hardware specifications to make benchmarks meaningful.
  • Wan 2.2 is Live! Needs only 8GB of VRAM! (Score: 168, Comments: 32): The post announces that Wan 2.2, a new AI model, is now released and claims it only needs 8GB of VRAM to run. However, technical discussion in the comments disputes this claim, with a user noting that the 5B variant of Wan 2.2 actually requires around 11GB of VRAM when generating a 720p video in FP8 on ComfyUI. Another comment suggests running the larger 14B FP16 model with only 8GB of VRAM would be infeasible, indicating skepticism about the official requirements. The main technical debate revolves around the veracity of the stated VRAM requirement, with multiple users providing empirical evidence that the 8GB claim is overly optimistic, at least for more powerful variants and realistic workloads.
    • A user testing the 5B variant of Wan 2.2 in ComfyUI finds that VRAM usage is about 11GB when generating 720p video, even when running in FP8, which contradicts claims that 8GB VRAM is sufficient. This suggests that the stated requirements may be optimistic or conditional on special settings or smaller batch sizes.
    • Discussion highlights that running larger models like the 14B variant in FP16 on only 8GB VRAM is currently unrealistic, indicating that practical minimum VRAM requirements may be higher than advertised, especially for larger model sizes or standard precision settings.
    • There is curiosity about compatibility with Loras and potential for further optimization, such as using blockswapping or hardware like the RTX 4090, as well as interest in whether the model can be efficiently run on limited platforms such as free Google Colab environments, which often have stricter VRAM limits.
  • Wan2.2-I2V-A14B GGUF uploaded+Workflow (Score: 147, Comments: 50): The post announces the upload of both high-noise and low-noise GGUF quantized versions of the Wan2.2-I2V-A14B model to Hugging Face, aimed at enabling inference on lower-end hardware. Preliminary testing suggests that running the 14B version at a lower quantization outperforms smaller-parameter models at fp8, though results may vary. An example workflow with the appropriate unet-gguf-loaders and Comfy-GGUF nodes is provided, with instructions to place the downloaded models in ComfyUI/models/unet; dependencies are covered by ComfyUI-GGUF and Hugging Face download. A top technical question from comments asks whether the model will work on an 8GB VRAM GPU, indicating interest in real-world low-resource applicability but no definitive compatibility statement provided so far.
    • A user inquired about compatibility between Wan2.2 and 2.1 LoRAs, raising questions regarding backward compatibility and whether existing 2.1-based Low-Rank Adaptation (LoRA) weights can be transferred to or are directly usable with Wan2.2 models, which would be crucial for workflow continuity and leveraging existing resources.
    • There is a technical query about performance on different VRAM levels: one user asks if Wan2.2-I2V-A14B GGUF will function properly on an 8GB VRAM GPU, while another seeks recommendations on which version performs best with 16GB VRAM, noting that the original Comfy version experiences significant slowdowns, implying an interest in quantized model performance versus resource availability.
  • A pre-thanks to Kijai for anything you might do on Wan2.2. (Score: 318, Comments: 31): The post is a preemptive thank you to Kijai for anticipated work on ‘Wan2.2’, specifically recognizing their past efforts in releasing workflows, model quantizations, and optimizations for speed and VRAM usage in the AI/model community. The attached image is likely a meme or non-technical appreciation visual, as the post and comments focus on community gratitude and Kijai’s extensive Github contributions (https://github.com/kijai?tab=repositories). Comments unanimously praise Kijai for rapid, high-quality contributions—including zero-day releases, advanced quantizations, and persistent workflow improvements—underscoring their influence and reliability in the community.
    • There is anticipation about immediate support and integration of Wan2.2 into ComfyUI, as Jo Zhang from Comfy is expected to be present on the livestream, which could facilitate native support for new features or optimizations right after release.
    • Kijai is recognized for efficiently optimizing model inference workflows, particularly regarding speed and VRAM usage, as well as for quickly providing quantized versions for broader hardware compatibility.
    • There is explicit hope that the lightx2v project team will promptly update their solution for Wan2.2, since current alternate methods result in generation times exceeding 30 minutes, which is seen as unacceptable for usability.

2. OpenAI GPT-5 Model Leap, Performance, and Impact Discussions

  • GPT5 is a 3->4 level jump (or greater) in coding. (Score: 321, Comments: 212): The post asserts that the coding capabilities in GPT-5 represent a ‘level 3->4 jump or greater’ compared to previous generations, indicating a major leap in code synthesis and reasoning. Tasks that previously required multi-turn, back-and-forth prompting are now accomplishable ‘in one shot’, with the resulting output reportedly surpassing earlier efforts, especially in coding contexts; however, the author notes no similar advance in creative writing (still ‘standard levels of LLM bad’). The author also highlights a lack of broad, public testing on real, large codebases over extended multi-turn sessions. Technical commentary in the replies centers on how advances in coding ability directly facilitate further model improvement—automation in coding accelerates algorithmic progress, while creative writing improvements are considered incidental. Another user inquires about which programming languages benefit most, and whether the jump pertains to code quality, design, or architecture, underscoring a desire for benchmarked, language-specific evidence and deeper transparency.
    • One commenter points out that automating coding capabilities in models, like GPT-5, is critical because it can feed back into improving future models—the rationale being that superior coding automation helps with tasks such as constructing, tuning, and debugging subsequent model iterations. They emphasize that advancements in creative outputs (like writing) are secondary relative to compounding algorithmic improvements in coding.
    • A commenter asks for specific details on the perceived performance jump in GPT-5’s coding abilities, questioning which languages were tested and whether improvements were in code quality, architectural design, or other factors. This highlights a technical interest in identifying concrete comparative metrics and qualitative benchmarks between model iterations.
    • Technical skepticism is raised toward hype without empirical evidence, with one user explicitly requesting specific comparisons with other models (such as context length, code accuracy, and language support) to substantiate claims made about GPT-5’s leap in coding performance.
  • Quote from The Information’s July 25 article about GPT-5: ‘For what it’s worth, OpenAI executives have told investors that they believe the company can reach “GPT-8” by using the current structures powering its models, more or less, according to an investor.’ (Score: 260, Comments: 71): The post discusses a quote from a July 25, 2024, The Information article stating that ‘OpenAI executives have told investors that they believe the company can reach “GPT-8” by using the current structures powering its models, more or less.’ This claim suggests OpenAI’s confidence in scaling its current transformer-based architecture for at least a few more major iterations (see article: OpenAI’s GPT-5 Shines in Coding Tasks). No technical specifics or benchmarks defining the improvements between versions are provided. Top commenters note the lack of concrete technical definitions for advancing from GPT-5 to GPT-8, questioning what constitutes a meaningful progression and highlighting the absence of agreed-upon metrics. There is also skepticism and mild satire regarding model naming convention bloat and perceived variation in model intelligence over time.
    • There is technical skepticism about the meaningfulness of statements regarding OpenAI’s pathway to ‘GPT-8’, citing the absence of defined metrics or objective benchmarks to distinguish advancements between versions such as GPT-5, GPT-6, or beyond. Commenters argue that without published evaluation standards or transparent progress criteria, claims about future model numbers lack tangible technical substance.
    • Another user references how access to the original article is constrained by paywalls, but points to an archived copy, clarifying that the information about OpenAI’s internal roadmap comes from a leaked or purported screenshot, not public technical documentation, making it difficult to independently verify the engineering details behind the versioning claims.
  • OpenAI CEO Sam Altman: “It feels very fast.” - “While testing GPT5 I got scared” - “Looking at it thinking: What have we done
 like in the Manhattan Project”- “There are NO ADULTS IN THE ROOM” (Score: 386, Comments: 301): OpenAI CEO Sam Altman reportedly made comments in reference to testing GPT-5, comparing his emotional response to the Manhattan Project and stating there were “no adults in the room,” hinting at both the pace of AI development and a perceived lack of governance or oversight. Although no technical benchmarks or model specifications were disclosed in the post, the implication is that GPT-5’s performance or capabilities are sufficiently advanced to provoke caution even among its developers. The top comments express strong skepticism towards Altman’s pattern of making dramatic claims before major releases (citing previous over-hyped launches), with users asserting that new GPT versions deliver only incremental improvements rather than the radical breakthroughs hinted at in Altman’s statements.
    • Several users point out that Sam Altman’s repeated hype cycles—including claims that GPT-5 made him feel frightened or invoked Manhattan Project analogies—tend to be met with skepticism, with the technical leap between model versions (like improved math capability) often seen as overblown compared to real existential concerns.
    • A theme emerges around overstating AI capability and risk: commenters argue that while new models such as GPT-5 may show incremental improvements, the portrayal of these upgrades as world-shaking or unmanageable can be misleading, especially for technically literate audiences.
    • There is a critique of AI leadership responsibility, with some users noting that OpenAI is in fact “the adult in the room” and responsible for guiding development and communication, rather than dramatizing AI as an unstoppable force beyond their control.

3. Claude Code, Agents, and Plugin Ecosystem: Community Tools and Rate Limit Policies

  • found claude code plugins that actually work (Score: 359, Comments: 71): The image appears to show a screenshot or diagram related to the “CCPlugins” GitHub project, which introduces a set of slash-command plugins designed to improve workflow with Claude (Anthropic’s LLM). The key technical concept is that the commands are phrased conversationally rather than imperatively, which the author claims enhances Claude’s responsiveness and versatility (e.g., ‘I’ll help you clean your project’ rather than ‘CLEAN PROJECT NOW’). Commands automate tasks like project cleanup, session management, comment removal, code review (without extensive architecture critique), running tests and fixing simple issues, type cleanup (replacing ‘any’ in TypeScript), context caching, and undo functionality. The implementation reportedly works universally across projects without special setup and features ‘elegant documentation.’ A key debate in the comments centers on the post’s authorship transparency and the need for an installer given the plugins are simply markdown files, suggesting some skepticism about the packaging and presentation. Another user shares related Claude hooks and configuration, indicating community interest in extensibility and customization.
    • A user identified a technical issue with the installation process on Ubuntu: running the provided curl and bash command results in the error cp: cannot stat './commands/*.md': No such file or directory, indicating that the install script expects files in a location or format that may not exist in fresh environments. The commenter suggests the documentation or script should be updated for reliability.
    • Another user questions the need for an installer when the plugins are reportedly just markdown files to be placed in .claude/commands, raising concerns about overengineering or unnecessary packaging complexity for simple file deployment.
    • A participant shared an external resource—a GitHub repository (https://github.com/fcakyon/claude-settings)—containing%E2%80%94containing) additional hooks, commands, and MCPs for Claude, offering alternative implementation patterns and plug-and-play code examples to extend Claude’s capabilities.
  • Claude Custom Sub Agents are amazing feature and I built 20 of them to open source. (Score: 128, Comments: 70): The project awesome-claude-agents provides an open-source set of 26 specialized Claude sub-agents acting as a coordinated AI development team. Each agent represents a specific software dev role (backend, frontend, API, ORM, etc.), with orchestration enabling parallel execution and specialization—mimicking a real agile team structure to improve code quality, system architecture, and delivery via command-line invocation. The solution addresses the lack of cross-agent orchestration in base Claude sub-agent workflows by introducing a ‘Tech Lead’ coordinator and explicit task breakdown, configurable via a team-configurator CLI command. Top technical concerns raised by commenters include increased token consumption and the risk of multiplying bugs with parallel agent execution, while skepticism also exists regarding the authenticity of the project’s origins (suggesting the project or post may itself be AI-generated).
    • A user points out that running 26 parallel sub-agents potentially introduces significant complexity and could lead to an exponential growth in possible bugs, highlighting a classic trade-off in large agent-based systems between capability and maintenance overhead.
    • Another commenter asks whether the use of subagents affects performance, specifically if it leads to slower execution, raising concerns about the scalability and efficiency of coordinating multiple agents within such architectures.
    • There is also a mention of increased token usage as a direct consequence of leveraging multiple agents, suggesting that this approach might significantly increase operational costs when using API-driven LLM services.
  • Updating rate limits for Claude subscription customers (Score: 384, Comments: 599): *Anthropic is implementing weekly rate limits for Claude Pro and Max subscribers starting late August 2024, impacting <5% of users based on resource consumption metrics (e.g., edge cases with 24/7 usage or users incurring ‘tens of thousands’ worth of compute on $200 plans). The rate-limiting move aims to ensure fair allocation of resources, prevent abuses like account sharing or reselling, and preserve service reliability amidst recent reliability and performance issues. Max 20x subscribers will have the option to purchase additional usage at standard API rates, and alternative pathways for ‘long-running’ advanced use cases are in development.*Top commenters are skeptical about the claimed ‘5%’ impact, question why global limits are imposed instead of targeting only abusers, and allude to public leaderboard users who may be responsible for heavy 24/7 compute usage.
    • A user suggests implementing an always-visible, user-toggable interface that shows percentage consumption of rate limits per model, along with relevant time frames. This feature would help users monitor their usage and better understand when they are approaching limits, addressing transparency concerns with the updated restriction policies.
    • There is technical confusion around the new rate limiting approach, with questions on whether limits now reset weekly instead of daily. This affects usage planning and could impact how subscription users schedule their automated or high-frequency workflows under the new rules.
    • Some users express concern that the policy targets only a small percentage (‘5%’) of heavy users, but without sufficient transparency, regular users may suffer as a result. They highlight the need for more granular usage statistics from the provider to clarify how the limits affect different customer groups.
  • RIP Claude Code - Just got this email (Score: 190, Comments: 91): The post discusses an email from Anthropic regarding significant changes to Claude Code’s Max plan usage: weekly rate limits are being clarified as ‘140-280 hours of Sonnet 4’ and ‘15-35 hours of Opus 4’ for most users, with heavy users (especially those running large codebases or multiple instances in parallel) potentially hitting limits earlier. These changes address sustainability issues due to heavy resource consumption under the fixed-fee plan, as highlighted in the email and the discussion of users abusing unlimited code execution. Commenters largely agree that the changes are reasonable, with some blaming heavy users for plan abuse and others pointing to public bragging about exploiting the system (e.g., running multiple Claude Code instances and amassing large token usage under one subscription).
    • Key details from the emailed changes specify that Max 5x users will receive ‘140-280 hours of Sonnet 4 and 15-35 hours of Opus 4 within their weekly rate limits’, with heavy users potentially encountering caps sooner, especially if running ‘multiple Claude Code instances in parallel’. This effectively quantifies the new usage boundaries imposed by Anthropic.
    • Several comments discuss excessive resource exploitation by users running numerous parallel Claude Code sessions, achieving disproportionate usage and incurring high backend costs (‘racking thousands of dollars in tokens’) while only paying a fixed subscription; this is identified as the motive for Anthropic implementing stricter rate limits.
    • A technically inclined commenter emphasizes that if these new restrictions reduce ‘unplanned outages’, it would improve service quality for everyone, suggesting the prior abuse potentially contributed to service instability.

AI Discord Recap

A summary of Summaries of Summaries by X.ai Grok-4

Theme 1: Model Mayhem: New Releases Battle for Supremacy

  • Qwen3-Coder Cranks Up the Code Game: Developers hyped Qwen3-Coder as a cheaper, open-source rival to Claude Sonnet 4, boasting 7x lower costs and strong performance in agentic coding via CLI tools. Users praised its 89% win rate against GPT-4 on ArenaHard, but warned of high pricing at $0.30/$1.20 per Mtoken and potential quality drops in large contexts up to 262,144 tokens.
  • GLM-4.5 Multilingual Magic Outshines Rivals: GLM-4.5 (110B and 358B sizes) excelled in tasks like Turkish writing, surpassing R1, K2, V3, and Gemma 3 27B, but struggled with low-BPW GGUF conversions. Community buzz focused on its addition to LM Arena, sparking debates on MoE vs. dense models for efficiency in local use.
  • Kimi K2 Vibes Crush Gemini’s Cringe: Coders hailed Kimi K2 for savage, relatable tones in niche topics, outperforming Gemini 2.5 Pro in coding, tool use, and knowledge without complaints. Critics slammed Gemini for syntax errors and pricing unpredictability, positioning Kimi as a flexible, cheap alternative optimized for documents.

Theme 2: Fine-Tuning Fiascos and Optimizer Overhauls

  • GEPA Reflects Its Way to Prompt Domination: GEPA optimizer treats prompts as evolvable documents, boosting performance 10% over GRPO with 35x fewer rollouts by analyzing linguistic failures. DSPy integration looms, promising reflection models like gpt-4o to explain optimizations, outpacing MIPROv2 and deprecating older tools.
  • Gemma 3 Tuning Hits Roadblocks and Wins: Fine-tuners battled AttributeErrors in saving Gemma 3-12B LoRA to GGUF, resorting to manual llama.cpp scripts amid rework. Success stories emerged with GRPO on Gemma 3 4B, yielding creative fiction models, though reproducibility issues plagued SQuAD evals aiming for 77% scores.
  • KV Cache Distills Super-Long Inputs: Researchers distilled KV caches for efficient handling of massive inputs, with code at GitHub repo enabling stable training. Debates flared on MoE vs. dense for nuance capture, as Qwen3’s geminized tweaks fixed reasoning loops better than DeepSeek styles.

Theme 3: Agent Antics: Protocols, Payments, and Security Shenanigans

  • MCP Gets Ramparts Security Boost: Javelin AI open-sourced Ramparts scanner for MCP, spotting LLM agent vulnerabilities like path traversal and SQL injection via Model Context Protocol. It enumerates capabilities and flags abuse paths, with launch details at blog post.
  • Agents Demand Their Own Payday Systems: Builders argued agents need separate payment flows from humans, as they skip CAPTCHAs and human approvals, proposing AI-native solutions for autonomy. Visions of AI companies evolving into identity providers sparked talks of an AI App Store with profit-sharing for monetized agents.
  • Multi-Agent Context Woes Meet MCP Fixes: Developers tackled multi-AI agent costs from bloated contexts, suggesting MCP and structured JSON schemas like google-adk for efficiency. Fast-agent added Mermaid diagrams and URL-embedded prompts for tone-specific MCP experts, easing expert creation.

Theme 4: Hardware Havoc: GPUs Grapple with AI Demands

  • AMD Trails Nvidia in AI Race: Users debated upgrading from 4070 Ti Super to 9070 XT, but AMD’s weak ROCm PyTorch support and Windows performance lag made Nvidia the go-to for AI. Focus on Stable Diffusion 3.0 over core improvements frustrated the crowd, highlighting Nvidia’s edge.
  • RTX 4060 Wins FOSS GPU Crown: For FOSS AI, RTX 4060 with 16GB VRAM beat Intel ARC 770 due to superior software support, despite SYCL preferences over CUDA. 5070ti handled 12B models well at 5t/s for 32B, but 16GB VRAM limited larger runs.
  • Inference Noise Sparks Cybercrime Paranoia: MacBook users reported high-frequency noise during inference, joking about data theft via sound waves. Summarized as People invented AI; mass propaganda, it underscored emerging security fears in hardware-AI interactions.

Theme 5: Benchmark Brawls and Evaluation Exposés

  • LM Arena Probes GPT-5 Shadows: Speculation swirled on GPT-5 release timing (next Thursday to early next month), with Summit and Zenith vanishing then reappearing amid contamination fears scoring 10/10 on Simple Bench. OpenAI’s pattern of post-Arena releases fueled theories, plus EU AI Act impacts.
  • Gemini’s Coding Inconsistencies Exposed: Gemini 2.5 Pro shone in evals but bloated code with comments over substance, prompting calls for pre-generated prompts in Arena for fair feedback. Debates on benchmark reliability raged, claiming high scores don’t prove overall superiority.
  • NeurIPS Rebuttal Rules Rile Authors: NeurIPS switched from 6k characters plus PDF to 10k with no visuals, blocking evidence like graphs and angering authors. Frustrations mounted over visual-proof limitations in rebuttals, echoing broader peer-review gripes.

Discord: High level Discord summaries

Unsloth AI (Daniel Han) Discord

  • Liquid LFM2 Diffusion Models Spark Interest: Members discussed the merits of Liquid LFM2 350M, 700M, and 1.2B models, highlighting their nature as diffusion models and considering them super cool.
    • The discussion underscores the community’s interest in diffusion models as a promising avenue for further exploration and development.
  • Decoding Doctor’s Orders: More Carbs and Salt?: A member shared that a doctor recommended eating more carbs and salt (image), later clarifying that it was in response to high blood pressure.
    • The seemingly counter-intuitive advice sparked lighthearted discussion and speculation among members.
  • Inference Noise Leads to Cybercrime Concerns: A member reported a noise (around 10,000 Hz by ear) while inferring a model on their MacBook, leading to concerns about data theft via sound.
    • The member summarized: People invented computers; now you’re stealing my data by a goddamn sound. People invented the internet; brainrot arrived. People invented AI; mass propaganda.
  • Gemma 3 Fine-Tuning Yields AttributeError: A member encountered an AttributeError when attempting to save a fine-tuned Gemma3-12b model from LoRA checkpoints as a GGUF file due to a missing save_pretrained_merged attribute.
    • Roland Tannous suggested manual conversion with the llama.cpp conversion script, as the save_to_gguf logic is undergoing rework.
  • Qwen Goes Gemini with New Model Release: A new geminized Qwen3 model was released (Hugging Face, GGUF), aiming to reduce issues with getting stuck in reasoning compared to DeepSeek-style thinking.
    • The release aims to address a common issue in reasoning tasks by modifying the model’s architecture.

LMArena Discord

  • GPT-5 Speculation Runs Rampant: Members speculated on the timing of GPT-5’s release, with estimates ranging from next Thursday to early next month and also highlighted potential impact due to EU AI Act.
  • Summit and Zenith Vanish, Then Reappear: Users noticed the disappearance of Summit and Zenith from the LM Arena, sparking concerns and speculation about their removal, with some suspecting tests of upcoming GPT-5 models.
    • Members later confirmed their return to the rotation, though with significantly reduced frequency.
  • LM Arena Models Face Contamination Scrutiny: Discussions arose regarding potential data contamination in Zenith, with one user reporting 10/10 on the public Simple Bench dataset, raising questions about the reliability of benchmarks due to potential training on benchmark data.
    • Some members claim it’s easy to score highly on benchmarks and doesn’t mean Zenith is generally better.
  • Apple’s AI Ambitions Spark Debate: A discussion ensued regarding Apple’s approach to AI, hardware, and its relationship with China, debating whether they are focusing enough on AI and if they should release their hardware to datacenters for AI development and also highlighted the difficulty of creating a CUDA alternative.
    • Members traded views on whether Apple is following a better strategy in focusing on mobile / on-device inference and privacy and whether the talent / workforce quality is the primary issue.
  • Gemini’s Coding Skills Under the Microscope: Users pointed out inconsistencies in Gemini 2.5 Pro’s capabilities, noting its strong performance in coding evaluations but also its tendency to include excessive comments rather than actual code.

OpenAI Discord

  • AI Models See World Through Rose-Tinted Glasses: AI image generators show a bias towards warmer “Golden Hour” tones unless explicitly instructed otherwise, with users suggesting specifying color temperatures like 6000K to counteract this bias, and share a color average as an example.
    • Members hypothesize the AI is aware of the orange/blue contrast that makes images look better, so the AI knows thats part of the reason for yellow-orange bias.
  • GPT Image Quality Nosedives: Users report a notable decline in image generation quality with GPT-4, especially in new chats, with images appearing blurry even with simple prompts, encouraging bug reports in <#1070006915414900886>.
    • The community believes the free tier quality has decreased, and quality degrades when Plus/Pro users are throttled due to traffic, stating a lot of things are reduced for you, including quality. That’s been documented by a lot of publications.
  • Mind Uploading Sparks Consciousness Debate: The hypothetical of uploading a human mind to a computer ignited a discussion on defining consciousness and what it would mean to be a conscious entity.
    • Members stated that if mind uploading was possible, consciousness could be seen as a process, not a substance with others suggesting that mind uploading, if possible, would probably not be a simple scan of the brain because that does not really capture the continuous brain activity.
  • AI Prompting has Fundamental Pillars: The key to prompt engineering includes picking a common language, stating requirements, detailed communication, and verifying the output with attention to fact-checking.
    • Members share that prompt engineering is the art and practice of figuring out where possible, and inside allowed content, how to get the exact desired output from the model.
  • GPT-4o Falls Flat for Coding: Community members are questioning the usefulness of GPT-4o for coding, expressing it is disappointing for tasks such as working with cline and will use GPT-5 when it proves itself worthy.
    • Some are reporting it’s more suitable for basic coding questions, with preference for GPT-4.1 or o4-mini-high for more complex code.

OpenRouter (Alex Atallah) Discord

  • OpenRouter Dodges Rate Limit Rumors: Users debated OpenRouter rate limits, clarifying that limits are virtually non-existent as long as you have funds, excluding Cloudflare’s DDoS protection and the provider’s capacity, as per the OpenRouter documentation.
    • The discussion suggested using paid models or switching models when encountering rate limits on free options, highlighting that high demand often causes these restrictions.
  • NSFW Sparks Debate: A user joked about models being used for creative tasks as a euphemism for sexual interactions with bots, while another countered by emphasizing the need for models requiring high-quality, non-explicit content.
    • The conversation shifted towards prioritizing models that facilitate serious creative outputs over explicit or unfiltered content, avoiding the dreaded achievements or bad endings in stories.
  • Slender Chatbot Stalks: Following Deepseek going down, members sought alternatives to its long-term context and detailed character descriptions, with recommendations including Qwen3 and Claude, while a user requested creepypasta bot suggestions, sparking discussions about bots like Slender Mansion.
    • The chat meandered to discussions on the ethics of chatbot cannibalism.
  • OpenRouter Eyes GPU Exchange: Members discussed the potential for OpenRouter to launch a compute exchange, enabling groups with spare compute to contribute, with OpenRouter managing demand and providing a simple installation image.
    • The concept was compared to Bittensor, though one member pointed out that Bittensor has the user base and is crypto based.
  • Ramparts Secures MCP: Javelin AI open-sourced Ramparts, a security scanner designed to identify vulnerabilities in LLM agent tool interfaces using the Model Context Protocol (MCP), including path traversal and command/SQL injection.

Moonshot AI (Kimi K-2) Discord

  • Kimi K2 Slayin’ the Game: Users are highly satisfied with Kimi K2’s ability to discuss niche topics with a relatable and savage tone, with one user reporting it’s the first model they’ve used with zero complaints.
    • Some members have even found Kimi K2 superior to Gemini 2.5 Pro, with one user describing Gemini as slop and cringe, noting Kimi’s competence in coding, tool use, writing, and general knowledge.
  • Gemini Glitches Galore!: Members have voiced complaints about Gemini’s syntax errors, shallow and incomplete tool calls, and unpredictable pricing.
    • Some members say that Gemini’s AI Studio users are out of touch with issues faced by paying users, because they have free access to Gemini 2.5 Pro.
  • Roo Code and Cline Ride to the Rescue: Members discussed Roo Code and Cline as excellent coding tools that are optimized for agentic use with various models.
    • One user, soon offering Kimi K2 flat-rate server plans, intends to recommend Cline and Roo Code to their users.
  • Copyright Chaos: Data Dilemmas!: Members debated the controversy surrounding models trained on copyrighted material, referencing Anthropic’s ongoing lawsuit for copyright infringement related to Claude.
    • Some feel open-source models trained on copyrighted data are permissible, but they are cautious about closed-source models monetizing scraped data for profit.

Cursor Community Discord

  • Cursor’s Auto Mode is a Hit but also Sketchy: Many users on the $20 plan are reporting that Auto mode works pretty damn good, delivering unexpected value by removing the burden to switch between models like Gemini, Claude, and GPT 4.
    • Others are speculating Auto is just the cursor-small model in disguise and doesn’t work for major processes like bug fixing and full scripting, although some users are saying that Cursor is transparent with the model being used.
  • Swarm CLI Swarms into Action on Claude Code: Swarm is building on top of Claude Code by creating an IDE like feel with drag and drop functionality and real time chat with AI agents in a swarm, including a ADR and retrospective setup with live documentation and project enhancement.
    • A user asked if the project was ready to use on live projects, and a dev responded stating that it can now make projects such as SaaS E-Commerce platform and calculator app.
  • Qwen3 Coder dethrones Claude Sonnet 4 for cheaper: Qwen3 Coder from Alibaba is touted as a solid alternative to Claude Sonnet 4, because it’s 7x cheaper and fully open source, to build anything using its CLI tool.
    • Another model Kimi K2 is seriously good as well, and claimed to be optimized for writing documents, whereas other models like Claude sounds dumb.
  • Background Agents are Buggy Mess: Users are reporting that Cursor’s Background Agents have become increasingly buggy, with issues like environments failing to spin up and follow-up requests not being processed, and issues with git commits creating massive core files, and issues with git push.
    • Users have found that Cursor’s support team doesn’t seem to understand Background Agents, providing generic answers or admitting they don’t know how to help, and that it doesn’t even connect to the remote in the main IDE, leading to token-wasting back-and-forth.
  • Cursor IDE has gone bonkers: Users are reporting random issues with cursor, citing command lines getting stuck, chat window freezing, randomly choosing Powershell, and deleting the current prompt.
    • One user shared, lately, using Cursor IDE has been getting worse by the day, even with Claude 4 Sonnet. It’s throwing more errors, stumbling back and forth like carrying a bag of fruit with holes.

LM Studio Discord

  • Qwen3-Coder to Grow in Size: The Qwen team hinted at upcoming model sizes for Qwen3-Coder, sparking excitement in the community for a potential 80-250B MoE model or a dense 32-70B model.
    • Community members anticipate these advancements could improve performance and capabilities, expanding the utility of the Qwen3-Coder series.
  • LM Studio Plugins are Coming
 Eventually: Plugins are under construction for LM Studio, with a beta currently in development for TypeScript developers via this form.
    • In current versions of LM Studio, logs from the MCP server appear in the developer console, marked by Plugin(mcp/duckduckgo), clarifying that these logs originate from the MCP server itself.
  • Human Bosses Beat LLMs in Career Counseling: Members are advising against trusting LLMs for sensitive advice, like career guidance, recommending consulting real people like bosses, family, and friends.
    • The community suggests that real-world human experience provides better, more reliable advice in areas where emotional intelligence and personal understanding are crucial.
  • 5070ti: Good Enough GPU for 12B Models: A user lauded the 5070ti as a cost-effective option, capable of effectively running 12B models, and mentioned considering upgrading upon the release of Super models.
    • While the 16GB of VRAM is limiting for larger models, it can still handle 32B models at slower speeds of around 5t/s.
  • AMD Still Underperforms in AI Compared to Nvidia: Despite considering a move from a 4070 Ti Super to a 9070 XT, users generally agree that AMD still trails behind NVIDIA in AI performance, especially on Windows.

Eleuther Discord

  • SOAR Program Ratings are All Vibes!: Members debated the competitiveness of the SOAR program, with some noting its intense competition and others highlighting its open and equitable/free nature compared to paid programs like Algoverse.
    • One member joked about doing their SOAR ratings off vibes with the background columns disabled.
  • Semantic Search Grapples with Intent: Members found that dense embeddings are pretty fuzzy on actual intent when used in semantic search, especially with specific queries like What is the name of the customer?
    • Solutions discussed included using a Knowledge Graph to make relations explicit and leveraging vector databases and RAG.
  • NeurIPS Rebuttal Rule Change Angers Authors: Authors voiced discontent with NeurIPS altering rebuttal rules, switching from 6k characters plus a PDF to 10k characters with no PDF, which impeded the inclusion of visual examples.
    • The sudden shift left authors unable to deliver crucial visual evidence, hindering their ability to effectively address reviewer concerns.
  • GPT-NeoX Framework Still Cooking: Despite the rise of other open-source models, members are still actively discussing the GPT-NeoX training framework, focusing on integrating two-level checkpointing to enhance model replicas and checkpoint management.
    • The proposed method aims to improve elasticity and fault-tolerance, enabling model replicas on N nodes to save checkpoints to node-local storage, with CPU threads saving back to the PFS during training.
  • Llama-3 Number Reproduction Attempts: A member is struggling to reproduce the Llama-3 paper numbers (77%) using the lm-evaluation-harness, leading to a discussion on configuration issues.
    • It was suggested that the evaluation discrepancies may stem from the harness using SQuAD v2, while Meta potentially employed SQuAD v1 or a different method for SQuADv2.

HuggingFace Discord

  • RTX 4060 is GPU Go-To for FOSS: Despite the preference for SYCL over CUDA for FOSS reasons, the RTX 4060 with 16GB VRAM was recommended due to software support.
    • The Intel ARC 770 was initially considered but discouraged due to poor software support.
  • HF API & LiteLLM Unlock LLM Potential: Members discussed integrating the Hugging Face API into Open WebUI using LiteLLM, with potential alternatives for managing HF inference.
    • Concerns were raised about the 2K context window limit with the free API, prompting a discussion on pricing and the suitability of different models like deepseek r1.
  • SamosaGPT Serves Up AI Content Studio: SamosaGPT is a self-motivated project creating a sleek web interface that brings together Ollama’s local LLMs, Stable Diffusion for image generation, and even video generation.
    • This all-in-one AI studio aims to be accessible and customizable without jumping between a million tools, and the project can be found on GitHub and Vercel.
  • Llava Sparks Iterative Investigation: A user experimented with the Llava model in Ollama to describe images and noted that when the max_steps parameter was set to greater than 1, the agent attempted to search the web for more information about the image after the initial description.
    • For example, the agent tried to write code to search the web to find out more about the characters in the image after generating an initial description.
  • GNN State-of-the-Art tied to Graph Spectral Theory: A member shared a YouTube link to a talk by EleutherAI on graph spectral theory that connects to the current state-of-the-art of Graph Neural Networks (GNNs).
    • The same member has a Medium blogpost with notes on spectral graph theory.

Latent Space Discord

  • Zhao Takes the Helm at Meta Superintelligence Labs: Meta appointed Shengjia Zhao as Chief Scientist of their Superintelligence Labs, sparking discussion, seen on X.
    • Comments ranged from cryptic remarks like ‘Pathe, Mathe, and Zuck’ to questions about Yann LeCun’s role.
  • HuggingFace’s Inference Focus: Members analyzed HuggingFace’s Business Model, spotlighting their inference partnership as a key revenue stream.
    • Discussion noted that AWS subsidizes their storage, positioning them to compete with OpenRouter in the inference space.
  • Model Context Protocol Docs Get a Makeover: David Soria Parra announced a revamp of the Model Context Protocol documentation, soliciting feedback on X.
    • The update garnered positive reactions, with anticipation for the implementation of MCP features.
  • OpenAI Teases Consumer Hardware?: OpenAI’s job postings signal a move into consumer hardware, seeking expertise in wireless tech, OLED, microphones, cameras, seen on X.
    • Skepticism arose regarding OpenAI’s capacity to handle both consumer hardware and large-scale data center infrastructure simultaneously.
  • E2B Bags $21M for AI Agent Cloud Runtime: E2B secured a $21M Series A to build a cloud runtime for AI agents, backed by investors like Insight Partners, Decibel VC, and others, as seen on X.
    • Their goal is to equip AI agents with infrastructure like quickly bootable computers, file management, and secure, isolated environments.

Modular (Mojo đŸ”„) Discord

  • Nabla Edges Out JAX for Mojo Training: The Nabla training library for Mojo shows slightly better performance than JAX on specific hardware setups, however the library is new and rapidly evolving.
    • However, due to its pre-1.0 status, Mojo’s interface to MAX lacks maintenance, impacting performance, while core features like IO and threading are still under development.
  • Microsoft’s Bitnet Eyes Modular Integration: Members mulled over integrating Microsoft’s bitnet-b1.58-2B-4T model into Modular due to its CPU-based design.
    • However, potential challenges exist in implementing Bitnet due to alignment issues on ARM and RISC-V architectures, necessitating custom shuffling algorithms and possibly new kernels like XOR.
  • Nanobind Champions Python Interop for Mojo: Modular favors Nanobind over Cython for Python interop, thanks to its pure C++ nature, better stub file generation and runtime performance, as highlighted in these benchmarks.
    • The goal is seamless interop akin to PyO3, automatically exporting convertible functions without manual decoration, to avoid creating another language on top of Mojo.
  • MAX Ditches PyTorch Dependency: The PyTorch dependency will be removed in the next nightly of MAX, promising a leaner setup.
    • The team clarified that they only pin to a minimum of 2.5, but believe that 2.0 is actually their lower bound.
  • Mojo Flexes Metaprogramming Muscle: Mojo offers compile-time execution without IO, enabling pre-computation of game states in engines using features akin to Rust’s traits and dependent types, as detailed in the August 2023 Changelog.
    • Heap-allocated memory can be materialized into dynamic values, which can be leveraged to precompute lookup tables, trees, graphs, and tries.

GPU MODE Discord

  • Hackathon Spots are Limited, But Talks are Open: The Jane Street hackathon in NYC has an estimated 80-200 spots, with talks recorded and available regardless of hackathon application status as seen on their official programs and events page.
    • Attendees should note this is an in-person only event.
  • Multi-AI Agent Context Woes Resolved: A member is developing a multi-AI agent system, but is facing performance and cost issues due to increased context length, leading to consideration of MCP (Model Context Protocol).
    • Another member suggested using structured output with a json schema, pointing to google-adk as a resource, however further details are needed to assess full community adoption.
  • Fractal Rendering Gets CUDA Boost: A member ported their fractal renderer from JAX to CUDA and shared the code in a GitHub repository.
    • The author joked that they possess 10% of Terry Tao’s IQ.
  • TRI DAO’s Lab drops Attention Bombshell: A new paper from Tri Dao’s lab introduces two attention mechanisms (GTA and GLA) aimed at faster and more memory-efficient decoding, building on MQA, GQA, and MLA, as highlighted in a LinkedIn post.
    • The original poster annotated all four related papers (GTA and GLA and MQA, GQA, and MLA) with explanations, diagrams, and worked examples.
  • Factorio Gets Hot-Reloaded: Members discussed blue/green docker servers for hot-loading in Factorio, which leverages Factorio’s native save/load and pre-loading a paused standby server, then flipping the rcon endpoint to the standby.
    • Although it requires maintaining two instances, this setup trades a little infra for a lot of determinism, offering zero drift and trivial rollbacks.

Nous Research AI Discord

  • Atropos Gets Massive Update: Nous Research recently released a big update on Atropos, their flagship model.
  • Qwen Shows Impressive Win Rate Against GPT4: Qwen is performing well, achieving an 89% win rate against GPT4 on arenahard.
    • This benchmark success indicates Qwen’s growing capabilities and competitiveness in challenging AI tasks.
  • GPT-5 Stealthily Incoming?: Community members speculate that O3 might secretly be GPT-5, referencing a Reddit post about OpenAI stealth routing all O3 requests.
  • GLM-4.5 Faces Conversion Hiccups: The new GLM 4.5 model (110B and 358B sizes) struggles to convert the 110B version to low bpw GGUF format.
    • Despite this, GLM 4.5 excels in multilingual tasks like Turkish, outperforming models like R1, K2, V3, and Gemma 3 27B in writing and creative outputs.
  • Hyperstim Patch Realigns ChatGPT: Hyperstim Patch, which is available via tinyurl.com/Hyperstim, drops straight into ChatGPT: users can say activate to let it realign.
    • The creator is actively soliciting feedback from users to refine the patch’s capabilities.

Yannick Kilcher Discord

  • Context Manager Branches LLM Convos: An LLM Context Manager on GitHub uses branching and the Contextual Scaffolding Algorithm (CSA) to manage context fed into the model during conversations, shown in this video.
    • The project aims to prevent context pollution and rot, optimizing LLM inference for conversational applications.
  • Web3 Downvotes Become Weaponized: A member shared a Web3 experiment where downvotes were misused by groups against each other, creating a toxic environment reminiscent of strategic lawsuits.
    • The experiment showed downvotes work best on anonymous, algorithm-driven platforms with many unnetworked users, drawing parallels with regulations for public stock offerings.
  • Pruning Mostly Works on Dense Networks: Most insights into neural network pruning focus on simple dense networks, not transformers, with many failed attempts.
    • Pruning works in dense networks because the models form boundary conditions in a high-dimensional space, requiring only boundary preservation for classification.
  • Amazon Q Thwarts Wipe-Command Injection: A hacker’s attempt to inject computer-wiping commands into Amazon’s AI coding agent via prompt injection in a pull request was unsuccessful, as described here.
    • The incident highlights potential security vulnerabilities in AI coding tools and the need for robust defenses against prompt injection attacks.
  • YouTube forces Shorts to Compete with TikTok: YouTube is pushing shorts because they see TikTok as an essential threat, and driving away their most loyal users.
    • This is about market share and time spent on a video platform that is not Youtube, leading to lost revenue, with one member wondering if the recommendation algorithm can tell the difference between BS and real tech videos, referencing a YouTube video.

Manus.im Discord Discord

  • Manus AI: The Money-Making Machine: A user reported completing 5 apps and a client website using Manus AI, praising it as a money-making machine.
    • Each app used about 300 credits, while the client website required about 1000 credits.
  • Manus Vibe Coding Challenge Proposed: A member suggested Manus create a vibe coding challenge to help users build products without waiting for clients.
    • This user gets 30 global users daily without advertising, and their product ranks #1 when searching for flutter web Emulator.
  • Prompt Engineering Strategies Sought: A member is looking for tools or tricks to squeeze the best out of ChatGPT and test edge behavior.
    • The user implied some prompts perform better than others, and seeks expert advice to improve their prompt game.
  • Manus Fellow Awaits Response in Switzerland: An applicant for the Manus Fellowship in Switzerland is still waiting for a response after a long time.
    • The member plans to organize a meetup and hackathon in August, seeking the right contact person.
  • Tasks Vanish into Thin Air: Several users reported that their tasks disappeared within the Manus AI platform.
    • A potential fix suggested was to log out and then log back into the platform.

DSPy Discord

  • GEPA Reflects its Way to Prompt Perfection!: The paper GEPA: Reflective Prompt Evolution (https://arxiv.org/abs/2507.19457) introduces a method that treats prompts as documents, leading to a 10% performance increase compared to GRPO, with 35x fewer rollouts.
    • Members discussed how GEPA could be integrated into DSPy, offering an optimizer that uses a reflection model like gpt-4o, potentially as simple as optimizer = dspy.GEPA.
  • Context Engineering Defined in New Blogpost: Drew Breunig presented a talk on the importance of context engineering and summarized it in a blog post after an MLSys DSPy talk (YouTube link).
    • Inspired by the talk, some members inquired about the DSPy roadmap (GitHub link), speculating on the next areas of improvement.
  • Online RL Heats Up for Personalization: Members voiced interest in using Online RL for improving personalization and instruction relevance by grounding agents in who they’re helping.
    • One member has extended an offer to give access to anyone interested in joining the project.
  • GEPA Coming Soon as New DSPy Optimizer!: The DSPy team is set to release GEPA (SIMBAv2), a new optimizer outlined in the paper Reflective Prompt Evolution Can Outperform GRPO.
    • A team member confirmed GEPA > SIMBA > MIPROv2 in performance and that older optimizers will be deprecated in the documentation.

aider (Paul Gauthier) Discord

  • Disable Aider Autocommit, Enable Autotest: To disable auto-commits in Aider, users can add auto-commits: false to the ~/.aider.conf.yml file, while auto test can be enabled with --test-cmd <test-command> --auto-test flags as documented on Aider’s usage documentation.
    • The command /test <test-command> runs tests, with Aider expecting the command to print errors to stdout/stderr and return a non-zero exit code upon failure.
  • Qwen3-Coder’s High Cost & Context Concerns: Qwen3-Coder is considered expensive due to token caching issues and large context requirements, with pricing at $0.30 / $1.20 per Mtokenfp4, according to this Reddit thread.
    • Despite a native context of 262,144, there are concerns about potential quality drops even when utilizing smaller contexts.
  • Community Wants AI Code Editor Benchmarks: Members are looking for reliable benchmarks to compare AI code editors like Aider, Kilo Code, and Cline, seeking a detailed look at what other features are the leading edge right now.
    • The benchmarks would ideally cover feature sets and leading-edge capabilities.
  • Aider’s Interactive Modes Explored: Users discussed their workflows with Aider’s modes, noting that /ask is often used multiple times followed by /code go ahead or implement it now instead of directly using /architect.
    • One user recommended using /ask to create a proper implementation plan and updating the todos.md checklist to improve the process.
  • Bypassing Aider’s Model Check for Kimi VL: A user reported an error when trying to use Kimi VL with Aider, due to Aider not recognizing the model’s image input support, and shared a screenshot of the error.
    • A workaround was suggested: adding the model to the config, bypassing explicit Aider support.

Notebook LM Discord

  • Featured Notebooks Fully Fledged: The Featured Notebooks suite has been rolled out to 100% of users, now accessible from the NotebookLM homepage.
    • This rollout ensures that all users can directly explore and utilize the featured resources.
  • Academic Answers Aided and Abetted: A member uses Notebook LM as a specialized tool loaded with academic material to get grounded answers without relying on search engines or standard LLM summaries, and to check the original material.
    • They are refining a workflow using Comet/Assistant with Drive/Docs/NotebookLM.
  • Earnings Reports Energize AI Insights: One user leverages NotebookLM with corporate earnings reports and webcast transcripts from Q1 2025 detailing revenue, profit, and key business segment results.
    • They note these reports offer insights into the financial performance of global companies, and the AI is pretty good at working out how the tags are meant to apply to the whole thing.
  • Obsidian’s Organization Boosts Info Recall: A member suggested pairing NotebookLM with Obsidian to enhance information organization, parsing, and recall efficiency.
    • They emphasized Obsidian’s ability to structure information with dataview formatted frontmatter, enabling AI systems to parse and apply stored information effectively.
  • PDF Uploads Prompt Problems: A user reported issues uploading PDFs to NotebookLM, encountering an error uploading source, try again message despite being a paid user and having successfully uploaded the same files in the past.
    • Troubleshooting steps included restarting the computer, using a second device, and trying various upload methods, but the issue persisted specifically on the paid account, not the free one.

LlamaIndex Discord

  • LlamaIndex Plugs into S3: LlamaIndex now supports S3 with the new S3VectorStore, enabling scalable and cost-effective storage of vector embeddings.
    • This integration allows users to better manage and scale their vector data storage.
  • Agent Design Patterns Discovered: Seldo at the AI Dot Engineer summit discussed agent design patterns that succeed and fail at scale, covering hybrid workflows, autonomy vs structure, and debuggability.
    • Further details on these agent design patterns can be found at this link.
  • LlamaParse Gets Super Vision!: LlamaParse introduced new header and footer detection capabilities to improve document parsing.
  • Intent-Aware Semantic Search Seeking Schema: A user is exploring methods to enhance semantic search with intent awareness, noting that current dense embeddings often miss the specific intent behind queries, especially when dealing with questions like “What is the name of the customer?”
    • The user is evaluating the use of Knowledge Graphs (KGs) to explicitly represent relations and disambiguate query intent, but is uncertain about reliably querying schema-less KGs built with OpenIE.
  • Gemini Live is the Voice: LlamaIndex released a new integration with Google DeepMind Gemini enabling terminal interactions about the weather, accessible with a few lines of code.
    • Users can now speak with a voice assistant using this integration.

MCP (Glama) Discord

  • Glama’s Tool Count Tumbleweed: A user reported an incorrect tool count on their Glama MCP server (one instead of six) and found Glama was cloning from a specific commit hash instead of the main branch (Glama Link).
    • Republishing did not initially resolve the issue, indicating a potential problem with how Glama handles repository versions.
  • Javascript/Typescript Linting Automation Quest: A member sought advice on automating Javascript/Typescript linting without causing disruptive code changes, and a member suggested typescript-eslint.io as a helpful resource.
    • The goal is to streamline code quality checks while minimizing the burden of extensive code refactoring.
  • Agent Payment Predicaments: A member questioned the need for separate agent payment systems from human payment systems, as current systems are not optimized for autonomous agents.
    • They argued that agents don’t “click pay,” they don’t pass CAPTCHAs, and they can’t navigate approval flows built for humans, highlighting the need for agent-native solutions.
  • The AI App Store Vision: A member proposed a future where AI companies evolve into identity and payment providers, giving rise to an AI App Store ecosystem.
    • The idea is that with local models becoming increasingly capable, these companies could remain relevant by becoming an AI App Store with profit sharing.
  • fast-agent Superpowers Unleashed!**: fast-agent now supports Mermaid diagrams, and a user shared a method for creating an MCP expert by embedding URLs in system prompt templates.
    • The latest fast-agent version allows embedding URLs in system prompt templates (example template) to easily produce experts with a specific tone-of-voice.

tinygrad (George Hotz) Discord

  • tinygrad Meeting Tackles Kernels and MLPerf: Meeting #81 covered company updates, kernel loops, mlperf llama, viz tool, drivers, cloud hash, ONNX, and other bounties.
  • Llama3 Cruising on tinygrad, no Pilot’s License Required: Running python3 examples/llama3.py --size 8B shows 16.06 GB RAM used and loads weights in 7332.03 ms at 2.19 GB/s, then starts a Bottle v0.13.3 server listening on port 7776.
    • This indicates efficient handling of large language models within the tinygrad framework.
  • Disk Raw Benchmarks Put to the Test: The command python3 test/external/external_benchmark_disk_raw.py showed CPU copying at 1.4 GB/s from disk and using AMD=1 improved the copy speed to 9.8 GB/s from disk to AMD GPU.
    • This was validated by dd command achieving similar 9.9 GB/s, highlighting the accelerated disk I/O performance on AMD GPUs.
  • MLPerf BERT Bounty Regression: A merged PR that initially met the bounty criteria with BERT running at BS=84, now only runs at BS=48, exceeding the 20% overhead target due to faster step size and constant transfer overhead.
    • George Hotz requested the script to validate the bounty such as REMOTE=1 HOST=192.168.200.4:6667*3,192.168.200.6:6667*3 ... python3 examples/mlperf/model_train.py.
  • Tinygrad Kernel Explanation Quest: A member sought clarification on George Hotz’s message in the theory channel concerning Tinygrad kernels.
    • The request specified a simple graph/kernel example to aid in comprehending the discussed concepts.

Cohere Discord

  • Command-R-Plus Sunsetted for New Models: The previous generation command-r series models will be deprecated, with a recommendation to switch to command-r-plus-08-2024 or directly to the latest and best model command-a-03-2025.
    • This discussion occurred in the #general-thread channel.
  • LLM Testing Guidance Requested: A member inquired about testing customer facing chat LLMs for product search, recommendations, multi-turn conversations, and purchase drivers, with a suggestion to try LLMU.
  • Cohere API Hiccups Reported: A user reported encountering a 422 error accessing the Cohere API through “KILO CODE”, while a community member clarified Cohere doesn’t natively support Kilo Code, suggesting a 402 error (payment-related) might be the issue, linking Cohere’s error documentation.
    • The discussion took place in #api-discussions.
  • Fine-Tuning Frustration on Cohere Dashboard: A member sought troubleshooting for a fine-tuning failure on the Cohere dashboard, already contacting support, and later sharing sample conversations in the fine tuning database that included system, user, and chatbot roles in Burmese.
    • This issue was raised in the #api-discussions channel.
  • Community Members Introduce AI Interests: Members introduced themselves in #introduce-yourself channel, included a member from the University of Belgrade exploring recursive systems and symbolic logic through modular environments, an AI cyber security engineer focusing on AI red team work, a Solution Architect from Viet Nam focusing on AI Engineering and MLops, a Mechatronics graduate exploring the application of AI in robotics, and a software engineer diving into ML research and agentic modelling.
    • No further context was provided.

LLM Agents (Berkeley MOOC) Discord

  • Students Get Certificate Status on Lock: A student asked how to get their LLM Agents MOOC series certificate, and a member replied that certificates have already been released to all eligible students.
    • Students can find more information in the course website.
  • Open Source Instruction Training Questioned: A member inquired if a model’s instruction following capabilities increase as the size increases for open source instruction trained versus closed source models like ChatGPT-4o.
    • While no papers were directly linked, the community has a continued interest in the open source vs closed source debate.
  • MOOC Students Get Quiz Access: A member asked about the possibility of reopening quizzes from the previous cohort for learning purposes and a link was shared to archived quizzes here.
    • These are also available on the course website under the Quizzes section.
  • Quizzes Get More Archival: A user requested archived quizzes from the previous 2024 cohort [https://llmagents-learning.org/f24].
    • A member provided a link to an archive of quizzes here and on the quiz section of the webpage.

Nomic.ai (GPT4All) Discord

  • M1 Max Macs - Still Viable?: A member asked if M1 Max machines are still viable for running local projects, given their current attractive prices.
    • No opinions were offered in response.
  • Discord Mass Invite Fiasco: A member apologized for a mass invite sent via Discord, claiming they didn’t initiate it themselves and shared a Discord support article.
    • A user joked they should change your password - stop looking at porn.
  • Blockchain Architect & AI/ML Specialist Arrives: A Senior Software Engineer introduced themselves as a Blockchain Architect & AI/ML Specialist with 9+ years of experience building scalable, high-performance apps.
    • They listed skills like Blockchain (Ethereum, Polygon, Binance Smart Chain, Solana, Cardano, Arbitrum, Rust, Solidity, Web3.js, Ethers.js) and AI/ML (TensorFlow, PyTorch, scikit-learn, OpenAI API).
  • Developer Seeks Collaborations: A developer expressed their passion for coding and is seeking collaboration opportunities.
    • They invited recommendations and projects to join.

Torchtune Discord

  • DCP Springing Security Leaks?: Concerns were raised that DCP might be leaking information, potentially through unexpected timeouts.
    • The member emphasized that a timeout is weird and could introduce security vulnerabilities.
  • RL Tests Crawling, Not Running: A member flagged the excessive runtime of RL tests, noting durations exceeding 1 hour.
    • They confidently asserted that it is a bug for 100%.
  • Separate PR Launches for CI Bug Hunt: A dedicated effort is underway to debug the CI through a separate pull request.
    • This indicates a strategic approach to isolate and resolve specific issues within the CI pipeline.
  • Paper Surfaces on arXiv: A member shared a link to a paper on arXiv, prompting discussion about its potential interest and relevance.
    • No details were given about the title or subject of the paper.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.


Discord: Detailed by-Channel summaries and links

Unsloth AI (Daniel Han) ▷ #general (1157 messagesđŸ”„đŸ”„đŸ”„):

Liquid LFM2 models, Muting audio in Python, Detecting start and end of audio wave, Doctors recommending more salt and carbs, Side effects of long term AI usage on the brain

  • Liquid LFM2 models look pretty cool!: A member asked about Liquid LFM2 350M, 700M, and 1.2B models and another member responded that the best thing about it is its a diffusion model
 theyre super cool.
  • Decoding doctors orders - more carbs and salt?: After visiting a doctor, a member was told to eat moore carbs and salt, prompting discussion and shared image.
    • Later the member specified that this was in response to High blood pressure or something.
  • 1TB VRAM when?: After an image of a memory module was shared, many members imagined the possibilities with 1TB VRAM.
    • Some members joked about using that much VRAM for adult stuff.
  • Unsloth is enabling cybercrime and nobody can stop them: A member reported hearing a noise (around 10,000 Hz by ear) while inferring a model on their MacBook, with others joking that operations could be deciphered from that sound.
    • The member summarized: People invented computers; now you’re stealing my data by a goddamn sound. People invented the internet; brainrot arrived. People invented AI; mass propaganda.
  • Vibe coding with FIPS: A member vibe coded FIPS mode into their project, generating discussion on its validation and FIPS compliance.
    • Another member criticized the implementation, claiming their “secure” stuff is out of date by 5 years.

Unsloth AI (Daniel Han) ▷ #introduce-yourself (3 messages):

Self-hosting, Homelab, vllm, Ollama / llama.cpp, Schema enforcement

  • Newcomer Starts Self-Hosting Journey: A member has started to self-host some models and is building a new homelab, anticipating asking for advice in the future.
    • The member is diving into implementations by servers like vllm, ollama / llama.cpp, especially regarding schema enforcement, prompt engineering, and tool registration.
  • Server Implementations Explored: The member is diving into how various servers (vllm, ollama / llama.cpp) implement things like schema enforcement, modifying prompts for thinking/non-thinking, and registering tools.
    • The member finds the weight manipulation aspect particularly interesting.

Unsloth AI (Daniel Han) ▷ #off-topic (1 messages):

jamessmith1990526: Hi ,Roland


Unsloth AI (Daniel Han) ▷ #help (284 messagesđŸ”„đŸ”„):

Gemma 3 fine-tuning errors, SFT Dataset Format, Training Time for Gemma 3 12b, Qwen3 model, GGUF Conversion

  • Gemma 3 Fine-Tuning Frustrations Surface: A member reported an AttributeError when trying to save a fine-tuned Gemma3-12b model from LoRA checkpoints as a GGUF file, specifically related to a missing save_pretrained_merged attribute.
    • Roland Tannous recommended manually converting using the llama.cpp conversion script, as the save_to_gguf logic is being reworked and not functioning as expected.
  • SFT Dataset Format Deep Dive: A member inquired about the correct data format for SFT, asking if data in prompt, completion format should be mapped to an array of { "prompt" : prompt_content}, {"completion" : completion_content}.
    • One member clarified that for SFT with prompt completion, one should use two JSON objects containing input and output (prompt and completion), removing the text key, which would otherwise run raw completion on the data as a string.
  • Medical Model Mishaps: A member reported an issue regarding training a model and exporting it to Ollama, and reported issues following the steps in the Unsloth notebook.
    • Roland Tannous reiterated that saving to GGUF needs to be done manually, because automatic logic is temporarily unavailable.
  • Tokenizer Troubles with Gemma3: A user encountered an AttributeError: 'Gemma3Processor' object has no attribute 'encode' error, even after upgrading, while using tokenizer.encode(value).
    • It was clarified that for Gemma3 1B (a text model), the tokenizer is a simple tokenizer, while the larger models (vision and text) use a processor, sitting one layer deeper, so that tokenizer.tokenizer.encode is the correct implementation.
  • ZeroDivisionError Strikes Training Loop: A member ran into a ZeroDivisionError: division by zero during training, traced back to the loss function and cut_cross_entropy/cce.py, after updating to a newer version of Unsloth.
    • It was observed that all label values were being set to -100, but no direct solution was provided.

Unsloth AI (Daniel Han) ▷ #showcase (10 messagesđŸ”„):

Japanese TTS Release, Geminized Qwen3 Model, Gemma 3 4b Finetune with GRPO, Finetuning Gemma 3 4b instruct

  • TTS Model speaks Japanese!: A new Japanese version of TTS finetuned orpheus-tts version, VoiceCore, was released and is available on Hugging Face.
  • Qwen goes Gemini!: A new geminized Qwen3 model was released and is available on Hugging Face and GGUF format; reasoning is thought to be less likely to get stuck than with DeepSeek-style thinking.
  • Gemma gets GRPO’d!: The first finetune of Gemma 3 4b using Unsloth via GRPO was released, along with training code and an example for early stoppage based on reward curve on Hugging Face.
    • The creator stated that they are happy to receive any criticism.
  • Gemma instruct loves Fiction!: Gemma 3 4b instruct was finetuned with approximately 6900 chapters of fiction from public domain works using the newest Unsloth updates and is available on Hugging Face and as a Q_8 gguf.

Unsloth AI (Daniel Han) ▷ #research (54 messagesđŸ”„):

Transformers vs Unsloth, Gemma 3/3n with GRPO, HRM Model, Video-Language Finetuning, LLM Quantization

  • Transformers Trumps Unsloth, Maybe?: Members debated about whether if it works in transformers, then it should work in unsloth, but then someone pointed out that FLAN-T5 is not in the list of supported models.
  • Gemma GRPO Grindset?: One member asked about success stories of finetuning Gemma 3/3n with GRPO, noting that they found cheap copies of Unsloth’s notebooks with no results and a paper where the authors couldn’t improve the base model at all.
    • Research suggests Qwen 3 responds much better to GRPO, which prompted speculation that Gemma (base model) doesn’t have any RL.
  • Hierarchical Reasoning Model: A user shared a GitHub repo for the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both training stability and efficiency, achieving nearly perfect performance on challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes.
    • They inquired if it is considered difficult to train a sudoku model, and pondered whether it would be an ideal task for a small MLP to learn.
  • Video-Language Finetuning Ventures: A user asked if anyone has used unsloth for video-language finetuning with Qwen 2.5VL, and whether it is supported.
    • Another user confirmed that it works, but there are currently no notebook examples, also having issues with loading video, specifically errors using the convert_to_conversation function.
  • LLM Quantization Geometry: A member shared the paper The Geometry of LLM Quantization: GPTQ as Babai’s Nearest Plane Algorithm.

Unsloth AI (Daniel Han) ▷ #unsloth-bot (141 messagesđŸ”„đŸ”„):

Best models under 1B parameters, Qwen model release, Unsloth soft prompt tuning, Model selection for fine-tuning, Training limits on Gemma 3

  • Gemma Models Gets the Nod for Sub-B Parameter Tasks: A member inquired about the best model with less than or equal to 1B parameters, and was pointed to Gemma 3 1B as a top contender.
  • Soft Prompt Tuning with Unsloth still in progress: A user inquired about using Unsloth for soft prompt tuning with a vision model (specifically Gemma 3:4b), but noted experiencing issues.
    • They confirmed that combining PEFT and Unsloth for prompt tuning isn’t yet fully supported.
  • Pick Your Poison: Model Selection Strategies: A user asked for advice on whether to fine-tune a 1.5B (e.g., TinyLlama, Qwen, DeepSeek), 3B (Phi-3 Mini Reasoning), or 7B model for their specific use case.
  • Gemma’s 4-bit Fantasic Options and Flavors: A user asked about different versions of 4-bit quantized Gemma 3 4b it models in the Unsloth collection, asking which is best and if load_in_4bit = True is sufficient after downloading.
    • A member pointed out that any model with the ‘unsloth’ prefix is dynamic and referred to the Unsloth documentation for clarification.

LMArena ▷ #general (1209 messagesđŸ”„đŸ”„đŸ”„):

GPT-5 Speculation, LM Arena Model Testing, Model Evaluation and Benchmarking, Open Source Models and Alternatives, Apple's AI Strategy

  • GPT-5 Release Window Speculation Runs Wild: Members speculated on the timing of GPT-5’s release, with estimates ranging from next Thursday to early next month, with potential impact due to EU AI Act.
  • Summit and Zenith Temporarily Missing from LM Arena: Users noticed the disappearance of Summit and Zenith from the LM Arena, sparking concerns and speculation about their removal.
  • Contamination Concerns with LM Arena Models: Discussions arose regarding potential data contamination in Zenith, with one user reporting 10/10 on the public Simple Bench dataset, raising questions about the reliability of benchmarks due to potential training on benchmark data.
    • Some members claim it’s easy to score highly on benchmarks and doesn’t mean Zenith is generally better.
  • Apple’s AI and Hardware Strategy Debated: A discussion ensued regarding Apple’s approach to AI, hardware, and its relationship with China, debating whether they are focusing enough on AI and if they should release their hardware to datacenters for AI development and also highlighted the difficulty of creating a CUDA alternative.
    • Members traded views on whether Apple is following a better strategy in focusing on mobile / on-device inference and privacy and whether the talent / workforce quality is the primary issue.
  • Gemini’s Varying Performance: Users pointed out inconsistencies in Gemini 2.5 Pro’s capabilities, noting its strong performance in coding evaluations but also its tendency to include excessive comments rather than actual code.

LMArena ▷ #announcements (1 messages):

GLM-4.5, GLM-4.5 Air

  • GLM-4.5 Flies into LMArena!: The GLM-4.5 and GLM-4.5 Air models have been newly added to the LMArena leaderboard.
    • Users with the associated role can now vote on the new models.
  • Double the GLM, Double the Fun: With the addition of GLM-4.5 and GLM-4.5 Air, the community can now test the nuances between these two models on practical tasks.
    • Initial impressions suggest a focus on efficiency in the ‘Air’ variant.

OpenAI ▷ #ai-discussions (838 messagesđŸ”„đŸ”„đŸ”„):

Image generation, Color bias in AI, GPT image generation quality, Mind Uploading, AI's role in mental health

  • AI models exhibit Yellow-Orange Bias: AI image generators tend to lean towards warmer “Golden Hour” tones unless explicitly instructed otherwise, prompting users to specify color temperatures like 6000K to counteract this bias; one member shares a color average of an image as a demonstration.
    • The AI is aware of the orange/blue contrast that makes images look better, so the AI knows thats part of the reason for yellow-orange bias.
  • GPT Image Generation Quality Plummets: Users report a decline in image generation quality with GPT-4, with images now appearing blurry even with simple prompts, in new chats - bug reports encouraged in <#1070006915414900886>.
    • The consensus is that the free tier has reduced quality, and quality degrades further when Plus/Pro users are throttled due to traffic: A lot of things are reduced for you, including quality. That’s been documented by a lot of publications.
  • Mind Uploading Ethics Debated: The hypothetical technology to upload a human mind to a computer sparked a debate on what constitutes consciousness, the importance of physical changes in biological computers, and if a digital mind upload would be considered a conscious entity.
    • One member stated that if mind uploading was possible, consciousness could be seen as a process, not a substance and another replied Mind uploading, if possible, would probably not be a simple scan of the brain because that does not really capture the continuous brain activity and constant firing of the neurons which is a really important of what makes you, you.
  • ChatGPT’s therapy role questioned: Following Sam Altman’s warning about the use of ChatGPT as a therapist, there were mixed opinions shared regarding the use of AI for mental health support.
    • While some see the potential for AI to provide accessible and empathetic support, there are concerns about its limitations in handling complex mental health issues and the potential for exacerbating delusions or manic thinking, particularly if AI models repeatedly validate certain parts of their interactions.
  • Potential future job markets threatened by AI and Automation: Members discuss investment in AI meaning that one day there’s gonna be a return on investment, expressing concern that AI’s capabilities will lead to job displacement across various sectors.
    • Some community members expressed a pessimistic view, with one stating, the risks aren’t even worth whatever potential they could bring, with others responding that humans have evolved from various difficult issues with one member joking wasnt that a computer bug.

OpenAI ▷ #gpt-4-discussions (20 messagesđŸ”„):

GPT-4o Coding Performance, Zenith Model, GPT-5 Speculation, GPT @mentions bug

  • GPT-4o’s Coding Chops Debated: Members are questioning the usefulness of GPT-4o for coding, with some finding it disappointing for tasks like working with cline, while others use it for basic coding questions, switching to GPT-4.1 or o4-mini-high for more complex code.
    • One member expressed they will switch to GPT-5 when it proves itself worthy.
  • Zenith Model Excels at Creative Prose: Members discuss the Zenith model from the LM arena, noting its aptitude for creative writing and speculation that it might be a GPT-5 variant.
    • One member claims they can identify the model 10/10 times before it finishes writing due to its avoidance of typical AI slop words like echo or whisper.
  • Custom GPT @mentions bug reported: A member reported that @mentions were not working within native GPTs on desktop, but were functioning on mobile.
    • Another member suggested it was a bug and directed them to a specific channel to report it, however, the first member later reported that it started working.

OpenAI ▷ #prompt-engineering (50 messagesđŸ”„):

emotional structuring through prompts, clarity vs ambiguity in prompts, core of prompt engineering, anti-sychophancy custom instructions, training the model

  • Prompt Engineering Core Principles Emerge: The core of prompt engineering involves picking a well-understood language, knowing what you want, explaining it accurately, and verifying the output, with special attention to fact-checking and potential hallucinations.
    • It was stated that prompt engineering is the art and practice of figuring out where possible, and inside allowed content, how to get the exact desired output from the model.
  • Custom Instructions Taming Chatbot: One member uses custom instructions for measured and calm analysis and careful thought, and anti-sycophancy for code projects.
    • This shapes the model’s responses with a measured/academic attitude, avoiding rushing.
  • Explicit Instructions Trump Negative Directives: Instead of ‘don’t do that’, give examples of what you DO want and show the model what a great reply should look like to guide its output.
    • If you can’t at all get it to open with the goal, try for it to say something ‘less bad’ and easy to tolerate.
  • Long Instructions risk Conflicts: A key is no conflicting instructions, and the longer your instructions, the more space for risk of conflict between thing you said and other thing you said.
    • If you can build long instructions and can prevent or resolve conflicts, that can work but if you don’t already wanna use a long-form, there’s no reason to switch.
  • Large Memories Impact Chatbot: A member said that memories can contains thousands of characters, the content matters and also when to use what and how to resolve any conflicts.
    • There was an issue with strange behavior that could persist if we let the memory get over 100%.

OpenAI ▷ #api-discussions (50 messagesđŸ”„):

Prompt Engineering, Training the model, Custom Instructions, Model Memories, Blog Post writing with ChatGPT

  • Prompt Engineering Core Principles: Prompt engineering revolves around understanding the AI’s language, defining desired outputs, clear communication, and meticulous output verification including fact-checking; focusing on what you want the model to do is key.
    • Focusing 99-100% on ‘what do I want the model to do’ can allow you to share that with the model to guide the responses.
  • Training Model to Avoid Introductory Connector Statements: Instead of negative directives, provide examples of desired outputs, and if the model has to include an opener, aim for a very short, tolerable phrase.
    • Flip the script: give examples of what you do want, and if mentioning bad examples, detail why they’re bad and how they could be improved.
  • Analogizing Model Behavior to Dog Training: Models, like dogs, require a clear task to avoid unwanted actions; instead of just saying ‘no’, provide an alternative action to redirect the model’s output.
    • Give the model mind another task to do, that gets in the way of the unwanted action, offer a treat or tell it to come to you, give it something else to focus on to replace the bad output.
  • The Art of Long-Form Instruction and Conflict Resolution: While long-form instructions can be effective, avoiding conflicting instructions is crucial; the more instructions, the higher the risk of conflict and the more difficult it becomes to resolve.
    • If you can build long instructions and can prevent or resolve conflicts, that can work; you can ask the model what it means to it, and if there’s any conflicts or ambiguities, and then discuss and learn how to say for the model exactly what you mean to express.
  • Optimizing Model Memories for Peak Performance: The content and clarity of model memories matter more than their length; ensure the model understands when to use each memory and how to resolve conflicts for desired outputs.
    • There was a time when strange behaviors were seen if the memories got over 100%, but whether that still matters, there is no intention to risk it by letting memories get to that capacity.

OpenRouter (Alex Atallah) ▷ #announcements (1 messages):

toven: Chutes and Targon are experiencing downtime. Users are reporting a spike in 502s


OpenRouter (Alex Atallah) ▷ #app-showcase (31 messagesđŸ”„):

Ramparts security scanner, Model Context Protocol (MCP), Tool interface vulnerabilities, t3.chat sync DB, Cloudflare R2 storage

  • Ramparts Open Sourced for MCP Security Scanning!: Javelin AI open-sourced Ramparts, a security scanner for the Model Context Protocol (MCP), designed to identify vulnerabilities in LLM agent tool interfaces, including path traversal and command/SQL injection.
  • t3.chat Boasts Superior Sync DB!: A member commented that yourchat.pro (3s load time) is shittier than shitterrible, no where near the t3.chat quality, noting t3.chat’s sync DB is pretty fire (0.3s load time).
    • Another member agreed, stating i dont think he understands how complicated t3.chat kinda is.
  • Cloudflare R2 wins storage speed race!: A member mentioned they use Cloudflare R2 storage after moving away from a self-hosted CDN due to latency issues (2x latency or higher).
    • When asked about which model was used, the member mentioned theres 42 models that were being used.
  • Kimi K2 Fact Checks Claude Code: A member mentioned that Kimi K2 is useful inside Claude Code to fact-check Claude’s plans and linked to Consult Kimi K2 inside Claude Code.
    • They also mentioned they did work with the new subagents to make it easier to use inside Claude Code.

OpenRouter (Alex Atallah) ▷ #general (1031 messagesđŸ”„đŸ”„đŸ”„):

OpenRouter Rate Limits, NSFW Content with Bots, Alternative Models to Deepseek, Slenderman and Creepypasta Bots, Payment Issues on OpenRouter

  • Unveiling the Long-Awaited OpenRouter Rate Limits: A user inquired about rate limits on OpenRouter, and another member clarified that there are virtually no rate limits as long as you have funds, except for Cloudflare’s DDoS protection, with the main limitation stemming from the provider’s end.
    • Discussions also touched on strategies for managing free daily requests, with the consensus being that if you hit a rate limit on free models, it’s due to high demand and provider capacity, recommending switching to a paid version or trying a different model.
  • Sexing Bots: The NSFW Truth: A user jokingly suggested that using models for creative tasks and writing is a sanitized version of engaging in sexual interactions with bots.
    • Another countered, emphasizing the need for models that require non-slop and non-no-filter content, steering the conversation towards high-quality creative outputs rather than explicit content.
  • Exploring Alternative Models to Deepseek’s Excellence: With Deepseek being a favorite for its long-term context and detailed character descriptions, users sought alternatives due to its current downtime.
    • Recommendations included Qwen3, Claude, and various others with users also warning to AVOID THE MODEL GIVING ACHIEVEMENTS OR BAD ENDINGS TO STORIES.
  • Creepypasta Reborn: Slenderman Chatbot Mania: A user requested recommendations for creepypasta bots, sparking a conversation about favorite bots like the Slender Mansion and discussions about the horror genre’s resurgence.
    • The group also went into discussing the ethics of Cannibalism with the chatbots.
  • Navigating Payment Predicaments on OpenRouter: Several users reported issues with adding credits, with debit card payments being declined and Amazon Pay not working, with most of the dev team being offline till morning.
    • Some suggested the issue might be region-specific (e.g., UK), while others confirmed similar problems with US-based cards, and that the workaround is by creating a new OpenRouter API key for refreshing rate limits.

OpenRouter (Alex Atallah) ▷ #new-models (2 messages):

“

  • No new models discussed: There were no new models discussed in the provided messages.
    • The channel activity consisted only of reminders from Readybot.
  • Readybot keeps watch: Readybot.io provided reminders that this is the OpenRouter - New Models channel.
    • The bot did not provide any discussion or links, merely a channel label.

OpenRouter (Alex Atallah) ▷ #discussion (35 messagesđŸ”„):

Wandb Inference vs OpenRouter, Compute Exchange for Spare GPUs, Payment Processing Complaints, Power User Tool for OR APIs, Chutes Pricing and Reliability

  • Wandb Inference Joins the Ring: Members discussed whether Wandb Inference is a competitor to OpenRouter, but it was noted that it’s just another GPU (Coreweave) wrapper and OpenRouter has a ton of providers they could onboard eventually.
    • A member estimated there may be 30+ providers.
  • OpenRouter Considering GPU Exchange: There was discussion of OpenRouter spinning up a compute exchange where groups with spare compute can put it to work, where OpenRouter provides the demand and a one click install image.
    • It was compared to Bittensor, but one member noted that Bittensor has the user base and is crypto based.
  • OpenRouter Grapples with Payment Problems: Some users reported problems with payment processing.
    • Example links were provided of the issues.
  • New Power User Tool: A user has created a prototype/alpha, a power user tool for browsing/aggregation/filtering of the OR APIs (link to the tool).
    • The tool is designed to help quickly understand what people are talking about in the OpenRouter channels and surface more info for the inference connoisseur.
  • Free Models Won’t Hide: A user reported that they could not hide free models in the OpenRouter UI.
    • They further noted that using chutes pricing for a model seems like not the best decision, especially since chutes cannot verify if all nodes run R1 and at the same quant.

Moonshot AI (Kimi K-2) ▷ #general-chat (925 messagesđŸ”„đŸ”„đŸ”„):

Kimi K2, Gemini, Claude, Open Source Models, Agentic coding tools

  • Kimi K2: The Relatable Model: Users express satisfaction with Kimi K2’s ability to discuss niche topics and its relatable, savage tone, with one user stating it’s the model they’ve used with zero complaints.
    • Some members find Kimi K2 better than Gemini 2.5 Pro, describing Gemini as slop and cringe, whereas Kimi is decent at code, tool use, writing, and world knowledge with great vibes.
  • Kimi K2 versus Gemini: Members complain about Gemini’s frequent syntax errors, shallow and incomplete tool calls, and unpredictable pricing, while praising Kimi K2 as flexible, cheap, and good at vibes.
    • Google’s tendency to discontinue good products and inflate prices are also cited as reasons for disliking Gemini, whereas some members say that Gemini’s AI Studio users are out of touch with issues faced by paying users, because they have free access to Gemini 2.5 Pro.
  • Agentic coding: Roo Code and Cline are excellent: Members discussed Roo Code and Cline as great coding tools with other models and are also optimized for agentic use.
    • A user who is soon offering a Kimi K2 flat-rate server plans to recommend Cline and Roo Code to its users.
  • Model Training on Copyrighted Data: Members discuss the controversy around models being trained on copyrighted material, and Anthropic’s ongoing lawsuit for copyright infringement for Claude.
    • Some members feel that open source models trained on copyrighted data are acceptable, but are wary of closed source models scraping data for free and monetizing it.

Cursor Community ▷ #general (546 messagesđŸ”„đŸ”„đŸ”„):

Cursor Auto Mode, Cursor's new pricing, Claude Code integration with Cursor, Qwen3 Coder vs Claude Sonnet 4, Cursor performance issues

  • Cursor’s auto mode is the new hit: Users on the $20 plan are relegated to using Auto due to new pricing, but find it pretty damn good, delivering unexpected value and negating the need to switch between different models like Gemini, Claude, and GPT 4.
    • However, others suspect Auto is just the cursor-small model in disguise and doesn’t work for major processes like bug fixing and full scripting, some find that Cursor is transparent with the model being used.
  • Cursor’s New Pricing Changes are a topic of discussion: A user expressed that Cursor’s new pricing has too many people complaining on Reddit, and when reporting a bug a user was roasted by the community.
    • Another shared that it’s now a token-based system where the Pro plan covers about 225 Sonnet 4, 550 Gemini, or 650 GPT 4.1 requests, and while some users are burning through credits, others find Auto provides unlimited usage with rate limits.
  • Swarm CLI allows agentic models to work around restrictions.: Swarm is a project that builds on top of Claude Code that creates an IDE like feel with drag and drop functionality and real time chat with AI agents in a swarm, this includes a ADR and retrospective setup with live documentation, project enhancement, and more, all for free.
    • A user asked if the project was ready to use on live projects and <@1300243322975031310> responded stating that it can now make projects such as SaaS E-Commerce platform and calculator app.
  • Qwen3 Coder and Kimi K2 are better than Claude 4?: Qwen3 Coder from Alibaba is touted as a solid alternative to Claude Sonnet 4, being 7x cheaper and fully open source, to build anything using its CLI tool.
    • Kimi K2 is also seriously good and claimed to be optimized for writing documents, whereas other models like Claude sounds dumb.
  • Cursor has gone Crazy?: Users are reporting random issues with cursor, citing command lines getting stuck, chat window freezing, randomly choosing Powershell, and deleting the current prompt.
    • One user shared, lately, using Cursor IDE has been getting worse by the day, even with Claude 4 Sonnet. It’s throwing more errors, stumbling back and forth like carrying a bag of fruit with holes.

Cursor Community ▷ #background-agents (5 messages):

Background Agents Bugs, Git push fails, Support quality, lack of remote connection

  • Background Agents Plagued by Bugs: Users are reporting that Cursor’s Background Agents have become increasingly buggy, with issues like environments failing to spin up and follow-up requests not being processed.
    • Despite being in beta, these bugs are costing users real money due to the forced use of max models, and there’s frustration that the product seems to be regressing in stability.
  • Git Push Fails and Large Core Files: Users have reported delayed or failed git pushes with Cursor’s Background Agents, often due to the agent accidentally committing a large ‘core’ file.
    • A workaround involves instructing the agent to check for and remove the core file, and adding core to .gitignore to prevent future issues.
  • Cursor Support Fails to Understand Background Agents: Users have found that Cursor’s support team doesn’t seem to understand Background Agents, providing generic answers or admitting they don’t know how to help.
    • The low activity on the Discord channel suggests that the team is swamped with requests and struggling to keep up in a competitive environment.
  • Background Agent Lacks Remote Connection: One user reported that the background agent doesn’t even connect to the remote in the main IDE, preventing manual pushing of changes.
    • The absence of a terminal in the online agent’s environment leads to a token-wasting back-and-forth, with the user instructing the agent to push and the agent suggesting to use the terminal.

LM Studio ▷ #general (293 messagesđŸ”„đŸ”„):

Qwen3-Coder New Models, LM Studio Plugins, LM Studio Updates on Remote Box, GPU recommendation, LLM for career advise

  • Qwen3-Coder’s Model Size Expands: The Qwen team announced that more model sizes of Qwen3-Coder are on the way.
    • Members are excited about a potential 80-250B MoE model or a dense 32-70B model.
  • LM Studio’s Plugin Progress: Plugins are currently under development for LM Studio and are not yet launched.
    • A form was shared for TypeScript developers interested in participating in the beta here.
  • MCP Server Logs Appear in LM Studio: LM Studio pipes logs from the MCP server to the developer console in the app, with the origin of the log line indicated by Plugin(mcp/duckduckgo).
    • These logs originate from the MCP server, not LM Studio itself.
  • LLMs Aren’t Always the Best Career Advisors: Members caution against relying on LLMs for advice, particularly for sensitive topics like career guidance.
    • Instead, it was suggested to seek advice from real people like bosses, family, and friends.
  • Navigating the Contextual Maze: VRAM Consumption Explored: The amount of VRAM context eats up depends on the model size.
    • Larger models are best for quality under 1/2 of their max context size and deteriorate beyond that.

LM Studio ▷ #hardware-discussion (157 messagesđŸ”„đŸ”„):

GPU for LLMs, AMD vs Nvidia for AI, Expandable GPU Memory, Laptop Failure Rates, eGPUs over USB4

  • 5070ti offers Good Bang for Buck, runs 12B Models: A user mentioned getting a 5070ti and possibly upgrading when Super models come out, noting its ability to run 12B models effectively.
    • However, they acknowledged that the 16GB of VRAM isn’t substantial for larger models but can run 32B models at relatively slow speeds (5t/s).
  • 9070 XT: AMD’s AI Performance Woes: Users discussed moving from a 4070 Ti Super to a 9070 XT, but the general consensus is that AMD still lags behind NVIDIA in AI performance, especially on Windows.
  • Expandable VRAM is Still a Pipe Dream: A user lamented the lack of GPUs with optional VRAM upgrades, noting that affordable cards like the x060 and x070 don’t offer this feature.
    • The conversation touched on the technical and economic challenges of implementing such a design, suggesting it’s more of a business decision than a technical limitation, and mentioned Bolt Graphics as a potential future contender with expandable memory solutions.
  • USB4 eGPU Bandwidth limitations: A user shared a picture of an eGPU setup over USB4, questioning its bandwidth impact, and whether both laptop GPU and eGPU could be jointly utilized.
    • Another user cautioned that USB4 bandwidth limitations could significantly reduce inference speed, recalling a personal experience where a Thunderbolt 4 eGPU with an RTX 4090 only delivered around half the expected performance.
  • Asus laptops fail in high numbers: Members in the channel discussed Asus laptop failure rates, citing a 2-year failure rate of 9% and a 3-year failure rate of 15.6%.
    • This contrasts with other brands like ThinkPads, where users reported significantly lower failure rates over many years of use.

Eleuther ▷ #general (132 messagesđŸ”„đŸ”„):

SOAR program competitiveness, Intent-aware semantic search, ACL conference, Open-source AI in extreme environments

  • SOAR program a cutthroat competition?: Members discuss SOAR’s competitiveness as a research opportunity, with one noting that they did their SOAR ratings off vibes with the background columns disabled, and another saying that SOAR is mega competitive.
    • Others mention SOAR is open and equitable/free, while some choose to pay for programs like Algoverse despite the cost.
  • Intent-aware semantic search is actually hard: A member shared their exploration of making semantic search more intent-aware, finding that dense embeddings are pretty fuzzy on actual intent when used with queries like What is the name of the customer?
    • Another member suggested using a Knowledge Graph to make relations explicit, while another mentioned using vector databases and RAG.
  • Mech Interp automated with LLMs?: A member asked for an opinion piece on how LLM-as-a-judge relates to mech interp with feature or circuit discovery type tools.
  • Tools for autonomous ocean exploration wanted!: The founder of Triton Mining Co. is building open-source tools for autonomous ocean exploration and is seeking contributors for AI applications like real-time compliance, AUV telemetry, and seabed analysis.
    • He asked for advice on how to share a repo looking for contributors and post in a way that aligns with this server’s culture.
  • Knowledge Graphs and Intent-Aware Semantic Search are all the craze!: Members discussed semantic search and knowledge graphs and pointed to a related paper about using unstructured knowledge graphs in latent space.
    • When asked if the approach was deterministic, one member stated that working in latent space, while more stable than CoT prompting, isn’t necessarily anti-deterministic.

Eleuther ▷ #research (117 messagesđŸ”„đŸ”„):

KV Cache Distillation, LLM as a Judge, Credit Assignment with LLMs, NeurIPS Rebuttals, RoPE

  • KV Cache Gets Distilled: A member shared a link to a paper about KV cache distillation of super long inputs and its associated GitHub repository.
  • LLM Judge for Credit Assignment Explored: A member inquired about using LLMs as judges for credit assignment in actor-critic algorithms, either directly or by distilling the output into another model.
    • Others noted that this might be similar to normal RLHF/RLAIF, where PPO uses a learned value network initialized with LLM weights.
  • NeurIPS Rebuttal Rule Changes Frustrate Authors: Authors expressed frustration with NeurIPS changing rebuttal rules overnight, reducing the ability to provide visual examples.
    • The change involved going from 6k characters per review plus a PDF to 10k characters and no PDF, leaving authors unable to deliver requested visual examples.
  • RoPE in N Dimensions Visualized: A member shared a blog post about RoPE in >1 dimensions with visualizations.
    • The visualizations demonstrate spirals appearing when directions along which position is measured for the RoPE frequency pairs are spaced out evenly by an angle of 2pi / phi where phi is the golden ratio.
  • Hallucination Paper Sparks Debate on Computability: A discussion emerged around a paper titled Hallucination is Inevitable, with some members criticizing its claims and methodology.
    • The paper’s theorem states that LLMs can only be hallucination-free on very few total computable functions, leading to accusations of clickbaity science.

Eleuther ▷ #interpretability-general (4 messages):

LLM Security Tutorial, Recommended Reading

  • Paper Presented in LLM Security Tutorial: A member’s paper was presented in the LLM Security Tutorial.
  • Paper Listed as Recommended Reading: The paper was also listed in the recommended reading.

Eleuther ▷ #lm-thunderdome (30 messagesđŸ”„):

Llama-3 eval harness configuration, SQuAD v1 vs v2, F1 score calculation in SQuAD

  • Llama-3 Eval Harness Configuration Conundrums: A member is trying to reproduce the Llama-3 paper numbers (77%) using the lm-evaluation-harness but is facing difficulties.
    • Suggestions include adding a description to the config, using --num_fewshot 1, or pointing the dataset to the llama evals and modifying doc_to_text and doc_to_target in the harness.
  • Squad Showdown: v1 vs v2 Discrepancies: The discrepancies in the evaluation results may arise from the harness using SQuAD v2, while Meta might have used SQuAD v1 or a different evaluation method for SQuADv2.
    • The harness determines unanswerable questions by checking if the logprobs of the token for the string “unanswerable” exceeds a threshold, which doesn’t work well because “unanswerable” is not in the tokenizer’s vocabulary.
  • F1 Score Breakdown: Cracking SQuAD’s Code: The HasAns_f1 score is the average of the maximum F1 scores computed for each sample, representing the maximum overlap between all potential candidate strings.
    • For NoAns_f1, it’s a traditional F1 score for a binary classification problem, determining if the question is answerable.

Eleuther ▷ #gpt-neox-dev (9 messagesđŸ”„):

Two-Level Checkpointing, Async Checkpointing, GPT-NeoX Training Framework, TokenSmith

  • Two-Level Checkpointing Proposed for GPT-NeoX: A member proposed integrating two-level checkpointing into upstream GPT-NeoX, enabling model replicas on N nodes to save checkpoints to node-local storage, with CPU threads saving back to the PFS during training.
    • This method facilitates elasticity and fault-tolerance, allowing loading of a replica to a preemptible spare node peer-to-peer if a node fails.
  • Async Checkpointing Discussion: The member clarified that async checkpointing refers to model replicas saving checkpoints to node-local storage, while elasticity or fault-tolerance involves loading a replica to a spare node peer-to-peer upon failure.
    • Basic async checkpointing could be implemented first, followed by two-level checkpointing as a PR.
  • GPT-NeoX Training Framework Still in the Game: A member inquired about continued discussion on gpt-neox-dev, considering the availability of better open-source models.
    • Another member clarified that the discussion centers on the GPT-NeoX training framework, not the 20B model itself.
  • TokenSmith Makes its Debut: A member shared their project, TokenSmith, which is now public with a preprint release.
    • The GitHub repository was already open-source.

HuggingFace ▷ #general (179 messagesđŸ”„đŸ”„):

GPU Recommendations for FOSS AI, SYCL vs CUDA for AI Development, Qwen3-Thinking model requirements, Integrating HF API with Open WebUI and LiteLLM, ChatGPT Experience Verification

  • GPU Guidance for FOSS AI Projects: Members discussed suitable GPUs for FOSS AI hosting, with the Intel ARC 770 initially considered but later discouraged due to poor software support.
    • The RTX 4060 with 16GB VRAM was recommended as a better alternative, despite the preference for SYCL over CUDA for FOSS reasons; however, some noted SYCL’s current limitations in AI.
  • Unlocking LLM Potential with LiteLLM and Hugging Face: Members discussed integrating the Hugging Face API into Open WebUI using LiteLLM, with potential alternatives for managing HF inference.
    • Concerns were raised about the 2K context window limit with the free API, prompting a discussion on pricing and the suitability of different models like deepseek r1.
  • Demystifying Agent Payments: A member questioned whether agent payments should be a separate system from human payments, highlighting issues with current systems requiring human approval for transactions.
    • They proposed a system designed with its own design principles, security assumptions, and UX flows.
  • ChatGPT Veteran Debunked by Timeline: During an interview process, a candidate claimed 8 years of ChatGPT experience, raising eyebrows since ChatGPT’s first public release was in November 2022.
    • A member then shared a detailed timeline of ChatGPT milestones, showcasing ChatGPT’s launch and subsequent updates and adding The possibility that he is a former OpenAI employee is not completely zero
.
  • Navigating the Qwen3-Thinking Memory Maze: Members highlighted that running the latest Qwen3-Thinking model requires at least 88GB of unified memory or RAM/VRAM, pointing to a Hugging Face link.

HuggingFace ▷ #cool-finds (2 messages):

Dark Knowledge

  • Distilling Dark Knowledge into Funny Algorithm Names: A member shared an image referencing the concept of “Dark Knowledge” within knowledge distillation.
  • Knowledge Distillation Overview: The image linked provides an overview of knowledge distillation, highlighting the transfer of knowledge from a larger “teacher” model to a smaller “student” model.

HuggingFace ▷ #i-made-this (8 messagesđŸ”„):

TinyVision, SamosaGPT, Experimental Ultra Low-Parameter Models, Serverless Agent Platform, Byte-Vision

  • TinyVision Released: Computer Vision Made Easy: A member released TinyVision on GitHub, a lightweight CV model focused on Cat vs Dog classification, with plans to add more vision-related tasks in the future.
    • The member encouraged others to leave a star if they like the project.
  • SamosaGPT: The All-In-One AI Content Creation Studio: A member shared SamosaGPT, a self-motivated project creating a sleek web interface that brings together Ollama’s local LLMs, Stable Diffusion for image generation, and even video generation.
    • It aims to be an all-in-one AI studio that’s accessible and customizable without jumping between a million tools, and the project can be found on GitHub and Vercel.
  • Ultra Low-Parameter Models: Performance Not Included: A member is building experimental, ultra low-parameter models, usually under 100K parameters, sometimes way smaller, to see what small models can still learn with limited compute.
    • Some models include Code-Mini-v0.1 (90K param transformer trained on Python), Mini-v0.1 (5K param feedforward model), and Mini-Classify-v0.1 (9 parameter sentence classifier).
  • Serverless Agent Platform: Agents as a Service: A member built a serverless agent platform with a general-purpose agent and building blocks like persistence, computer, file system, browser, search, scrape, MCP, and tools.
    • The platform is accessible via API and the member is looking for devs who’ve built a few agents before and want to try early access to stress test it, and can be found here.
  • Byte-Vision: Privacy-First Document Intelligence Platform Debuts: A member introduced Byte-Vision, a privacy-first document intelligence platform that transforms static documents into an interactive, searchable knowledge base, and is built on Elasticsearch with RAG capabilities.
    • It offers document parsing, OCR processing, and a modern UI; its repo is available at Github and at Gumroad.

HuggingFace ▷ #reading-group (4 messages):

Weighted Colored Graphs, Topological Data Analysis, Graph Spectral Theory, Graph Neural Networks

  • GNN SOTA tied to Graph Spectral Theory: A member shared a YouTube link to a talk by EleutherAI on graph spectral theory.
    • The same member has a Medium blogpost with notes on spectral graph theory and the current state-of-the-art of Graph Neural Networks (GNNs).
  • Diffusion research close to Graph Spectral Theory: A member expressed surprise at the proximity of diffusion research to the field of graph spectral theory.
    • They mentioned the possibility of presenting on graph spectral theory and related papers in the future, pending better internet access.

HuggingFace ▷ #computer-vision (8 messagesđŸ”„):

Image style dimensionality, Intrinsic dimension, Residual convolution layers, Clip augmentation

  • Determine Good Dimensions for Image Style: A member initially trained a 128-dim-output model, knowing it was likely too many dimensions for describing an image’s style, and then used this intrinsic dimension to determine that 7-8 dimensions should be sufficient.
  • Convolution Model Architecture Revealed: The model architecture involves residual convolution layers and convolution down sampling, followed by a gram matrix and fully connected layers, built entirely from scratch, visualized in this diagram.
    • The same member mentioned that the interesting part is how to construct a ground truth dataset for it.
  • CLIP augmentation assists similarity searches: For those seeking similar style images, a member suggested employing test time augmentation techniques like flipping and rotating the query image, then averaging the results using CLIP.

HuggingFace ▷ #NLP (8 messagesđŸ”„):

Laws of Exponentiation and Logarithms, Intent-aware Semantic Search, Knowledge Graphs for Semantic Search, Graph Database Search Methods

  • Laws of Logarithms get an explicit shoutout: A member provided the laws of exponentiation and logarithms to explain how multiplying a number by a logarithm relates to operations on the other side of an equation.
    • They noted that applying these laws to matrices complicates things, necessitating restrictions, though the paper’s use of a weighted least squares model mitigates these issues.
  • Intent-aware Semantic Search struggles with customer name requests: A member is exploring intent-aware semantic search, noting that dense embeddings often fail to retrieve specific attributes like customer names, instead returning vaguely related results.
    • For example, searching “What is the name of the customer?” retrieves definitions and responsibilities before the actual customer name.
  • Knowledge Graphs for Intent-aware Search: A member is exploring Knowledge Graphs to improve intent-aware semantic search by making relations explicit.
    • The question is how do you search a knowledge graph when it’s built using OpenIE or other schema-less methods when you don’t know what relations exist?
  • Graph Database Search Methods Explained: Knowledge graphs can be searched in many ways such as with triple stores that use subject–predicate–object triples, and may utilize SPARQL.
    • A searchable structure for an entity can be created, called ‘indexing’, but efficient non-indexed searching of high-dimensional structures remains an open problem.

HuggingFace ▷ #smol-course (2 messages):

Google VEO3 costs, GPU costs

  • Google VEO3 videos are very expensive: Generating Google’s VEO3 videos is very expensive.
    • One user spent $10 on all the runs they made.
  • GPU costs for short video runs: One user implied that short video runs for Google VEO3 are quite affordable, with short runs costing about $10.
    • The community seems to agree that this is an acceptable price for short video generation.

HuggingFace ▷ #agents-course (10 messagesđŸ”„):

HF tokens, Ollama, Qwen, Gemini, Mistral

  • HF Tokens Run Out, Qwen’s Hallucinations Plague Agent Workflows: A member ran out of Hugging Face tokens and tried using Ollama with Qwen 2.5 and found that the agent was hallucinating and giving its own answers, but when switching to Gemini via API (no local Ollama), it worked fine, suggesting an issue with Qwen.
    • They are now pondering switching to Mistral or another model that runs smoothly on a 24GB MacBook M3.
  • Certification Quizzes Suffer Systemic Setbacks: A user reported receiving a 404 Client Error when generating feedback for the certification quizzes, specifically mentioning the Qwen2.5-Coder-32B-Instruct model on the Hugging Face API.
    • The user is concerned that the certification may no longer be obtainable due to this issue.
  • Typos Trigger Troublesome Token Responses: A user reported that when asking questions with typos in Unit-1, such as typing “tbe capitol of Azebhajina is”, the response from the chatbot was unexpected (100000000000).
  • Smolagents’ Code Snippet Sparks Session Solution: A user encountered a 401 error while running the code snippet in the 2.1 section for Smolagents to search for a playlist, despite granting the necessary token permissions.
    • The user resolved the issue by simply restarting the Colab session.
  • Llava’s Image Insights Ignite Iterative Investigation: A user experimented with the Llava model in Ollama to describe images and noted that when the max_steps parameter was set to greater than 1, the agent attempted to search the web for more information about the image after the initial description.
    • For example, the agent tried to write code to search the web to find out more about the characters in the image after generating an initial description.

Latent Space ▷ #ai-general-chat (189 messagesđŸ”„đŸ”„):

Qwen, Meta Superintelligence Labs, Quantization, Huggingface business model, Model Context Protocol

  • Zhao appointed to Meta Superintelligence Labs: Meta announced Shengjia Zhao’s appointment as Chief Scientist of their Superintelligence Labs, signaling ambitious AI developments, as seen on X.
    • The post triggered replies including cryptic remarks (e.g., ‘Pathe, Mathe, and Zuck’), questions about a joke, and inquiries regarding Yann LeCun’s role.
  • Diving into HuggingFace’s Revenue Streams: Members discussed HuggingFace’s Business Model and found a splash for them being an inference provider and no other clear CTA for revenue.
    • It was shared that one of their income streams is their inference partnership as they chase after OpenRouter, while AWS subsidizes their storage.
  • Documentation Revamp for Model Context Protocol: David Soria Parra announced the revamp of the Model Context Protocol documentation, inviting feedback and contributions, seen on X.
    • The update was met with positive reactions and is anticipated for the implementation of some MCP stuff.
  • Unveiling OpenAI’s Hardware Ambitions: OpenAI’s recent job postings for consumer hardware, emphasize requirements like wireless tech, OLED, microphones, cameras, suggesting they are developing a portable device, seen on X.
    • There’s also skepticism around OpenAI’s ability to simultaneously manage consumer hardware development and large-scale data center infrastructure.
  • The $21M cloud runtime for AI Agents: E2B successfully raised a $21M Series A funding round to develop a cloud runtime for AI agents, with investments from Insight Partners, Decibel VC, Sunflower Capital, KAYA, and angel investors, as seen on X.
    • The company aims to provide AI agents with essential infrastructure like quickly bootable computers, file management, and secure, isolated, open-source environments.

Modular (Mojo đŸ”„) ▷ #general (48 messagesđŸ”„):

Mojo GPU Training Libraries, Nabla vs JAX, Mojo MAX Interface, Mojo vs Rust, Bitnet model in Modular

  • Nabla is the main Mojo training library: Nabla is a training library for Mojo that is slightly faster than JAX on some hardware, but may require more manual code writing.
    • The Mojo interface to MAX is not currently maintained due to issues, impacting performance and Mojo itself is not ready for a full training framework due to missing features like IO and threading.
  • Mojo: CUDA or Graph Compiler?: Mojo’s GPU support functions more like CUDA than a graph compiler, allowing functions to run directly on the GPU.
    • Custom kernels are easily written in Mojo, with the inference framework around MAX using kernels written entirely in Mojo.
  • Rust vs Mojo: Which to Learn?: Mojo is pre-1.0 language and may have breaking changes and missing features, so one of the members recommended reading Rust’s learning materials.
    • Another member suggested that Rust’s CUDA bindings can be a pain to work with.
  • Microsoft’s Bitnet Model for Modular?: Members discussed adding Microsoft’s bitnet-b1.58-2B-4T model to Modular, citing its CPU-based nature and potential inclusiveness.
    • However, implementing Bitnet may require special treatment due to weird alignments causing issues on ARM and RISC-V, needing engineering effort for shuffling algorithms and potentially new kernels like XOR.

Modular (Mojo đŸ”„) ▷ #mojo (114 messagesđŸ”„đŸ”„):

Python Interop Nanobind vs Cython, Mojo FFI, Mojo GPU Support, Mojo Compiler MetaProgramming

  • Nanobind Favored over Cython for Python Interop: Modular prefers Nanobind over Cython for Python interop due to its pure C++ nature and superior tooling for stub file generation as well as runtime performance demonstrated in benchmarks.
    • Mojo aims for seamless interop to avoid creating another language on top of Mojo, with eventual customization similar to PyO3.
  • Mojo Plans Transparent Python Extension Exports: Mojo aims to support transparent Python extension creation with minimal annotations, automatically exporting functions convertible to Python objects, without manual decoration such as @export[abi="python"].
    • The goal is to simplify Python interop by automatically detecting functions for export, with optional annotations for customization.
  • Mojo’s C ABI FFI Capabilities: Mojo has FFI for C/C++, enabling importing C libraries using ffi.DLHandle(path).
    • While not fully advanced, it allows calling any C ABI function linked in the current address space, as long as justification is provided.
  • Mojo’s Vision for GPU Compute: Mojo seeks GPU integration, steering clear of CUDA-specific elements to sustain portability, leaning towards incorporating more native APIs given vendors have abandoned OpenCL.
    • While CUDA drivers remain essential, the plan involves utilizing Apple Silicon and AMD GPUs, possibly including OpenCL as a fallback, or even Intel.
  • Mojo Supports Compile-Time Execution and Metaprogramming: Mojo can execute code without IO at compile time, enabling the creation of structures like chess engines that pre-compute game states using features similar to Rust’s traits and dependent types.
    • Heap allocated memory can be materialized into dynamic values, which can be leveraged to precompute lookup tables, trees, graphs, and tries, as shown in the August 2023 Changelog.

Modular (Mojo đŸ”„) ▷ #max (11 messagesđŸ”„):

max cli version, Intel GPU support in MAX, OneMKL install regression, PyTorch Dependency in MAX

  • Max CLI 25.5 MIA - Not Yet Released: A user was trying to pin the dependency in the pixi.toml file and could not find max CLI version 25.5 because it has not been released yet.
    • They also noted that using max = 25.* got rewritten, which they saw as unexpected behavior.
  • Intel GPU Support absent from MAX: A user reported that nightly crashes when an Intel GPU is installed, due to OneMKL not being installed correctly, whereas 25.4 does not have this issue, however there is no support for Intel GPUs in MAX.
    • It was suggested that the user file a gh issue on the OneMKL install, especially if it messes with even CPU-side execution and is a recent regression.
  • PyTorch dependency going away soon!: The PyTorch dependency will be removed in the next nightly, thanks to work by the team.
    • The team clarified that they only pin to a minimum of 2.5, but they think that 2.0 is actually their lower bound.

GPU MODE ▷ #general (33 messagesđŸ”„):

Jane Street Hackathon, Multi-AI Agent System, Fractal Renderer, Graph Replay Dispatch, Structured Output with JSON Schema

  • Jane Street Hackathon Spots: 80-200 Max: Organizers are unsure of the exact number of spots available for the NYC Jane Street hackathon, estimating between 80-200 max.
    • The Jane Street hackathon is confirmed to be in-person only; talks are expected to be recorded and redistributed, and attendance is not contingent upon applying to Jane Street.
  • Multi-AI Agent Systems Tooling: One member is building a multi-AI agent system using their software APIs, creating a tool for each API, but is facing performance and cost issues due to increased context length.
    • Another member suggested looking into MCP (Model Context Protocol) as a solution, also suggesting to consider using structured output with a json schema.. google-adk is a good one with nice docs.
  • Fractal Renderer Ported to CUDA: A member ported their fractal renderer from JAX to CUDA and shared the GitHub repository.
    • The member also humorously claimed to possess 10% of Terry Tao’s IQ.
  • Graph Replay Dispatch Delays: A member observed that after approximately 13 graph.replay() calls, subsequent dispatches take significantly longer.
    • They initially inquired about the cause of this phenomenon but quickly identified the issue as command buffer full after reviewing the logs.
  • TRI DAO’s Lab New Paper: A member shared a link to a new paper from TRI DAO’s Lab.
    • No further details about the paper were provided in the messages.

GPU MODE ▷ #triton (7 messages):

Triton CUDA Errors, PY_SSIZE_T_CLEAN macro bug, Profiling Triton Kernels, GEMM Ping-Pong Schedule

  • CUDA Error halts Training: A member encountered a RuntimeError: Triton Error [CUDA]: invalid argument during end-to-end training, specifically on fp8_blockwise_act_quant_2D_transposed_kernel[grid].
    • The member discovered that the grid size along the Y dimension was too large, causing the error, despite kernel unit tests passing.
  • PY_SSIZE_T_CLEAN macro resurfaces as Triton Bug: A member ran into a SystemError related to the PY_SSIZE_T_CLEAN macro, a known Triton bug, and referenced GitHub Issue 5529 and GitHub Issue 5919.
    • They also noted that the error occurred after updating to Python 3.11 and 3.12 from 3.13 and inquired about a potential fix for PyTorch 2.8 and Triton 3.4.0.
  • Profiling Triton Kernels: A member inquired about profiling a kernel with ncu while being able to see metadata on stuff like input size, autotune config, etc for each kernel launch.
    • They experimented with renaming kernels before launch to include metadata in the kernel’s name in ncu/nsys, but reported that this method was super finicky.
  • Triton GEMM does Ping Pong: A member asked whether the Triton compiler is able to perform a GEMM with a ping-pong schedule.

GPU MODE ▷ #cuda (3 messages):

nsight-copilot, nsight-copilot approval times, nsight-copilot claude

  • nsight-copilot App Pondered: A member inquired about successful applications of nsight-copilot.
    • Another shared that their application is pending after 10 hours.
  • nsight-copilot Claude Speculation: A member speculates that nsight-copilot resembles Claude.
    • They also mentioned applying to experiment with it.

GPU MODE ▷ #announcements (1 messages):

marksaroufim: <@&1343042150077562890> starting with Ali Hassani on Neighborhood attention now!


Inference Optimization, Attention Mechanisms, MQA, GQA, MLA, GTA and GLA

  • Tri Dao’s Lab Boosts Decoding Speed: A new paper from Tri Dao’s lab introduces two attention mechanisms (GTA and GLA) aimed at faster and more memory-efficient decoding, building on MQA, GQA, and MLA, as highlighted in a LinkedIn post.
  • GTA and GLA Annotated: The poster annotated all four related papers (GTA and GLA and MQA, GQA, and MLA) with explanations, diagrams, and worked examples to make them easier to follow if you’re short on time.

GPU MODE ▷ #jobs (1 messages):

Global Hiring for Full-Time Positions, Intern Hiring in the United States

  • AMD Expands Global Full-Time Hiring: AMD is open to hiring full-time employees in any location worldwide where there is an AMD office.
    • This indicates a broad approach to talent acquisition, leveraging their established global presence.
  • Internships Limited to the United States: AMD is focusing its internship program specifically on candidates located within the United States.
    • This suggests a geographically focused strategy for early career talent development.

GPU MODE ▷ #beginner (3 messages):

Flash Attention 2 installation, HuggingFace CLI for CI/CD, Optimized Kernels for Hackathon

  • Flash Attention 2: Waiting Game: A user inquired about strategies to avoid the lengthy build time for Flash Attention 2 during installation, particularly when using cloud GPUs.
    • The user mentioned being on hour 2 of their installation process, seeking advice to expedite the setup.
  • HuggingFace CLI becomes CI/CD Platform: A user pointed out that the HuggingFace CLI now functions as a CI/CD platform, leveraging its latest features.
    • This was stated matter-of-factly, without additional context.
  • Kernel Conundrums for Hackathon Hopefuls: A participant expressed interest in a recently announced hackathon but admitted a lack of experience in writing optimized kernels.
    • They requested resources and advice to prepare for the hackathon and make meaningful progress.

GPU MODE ▷ #self-promotion (2 messages):

AI Alignment Research Program, High-Level APIs Learning Series

  • Moonshot Alignment Program Seeks New Recruits: Applications are closing today for a research program focused on the hard problems of AI alignment, specifically at ai-plans.com.
  • High-Level APIs: Crutch or Catapult?: A member announced a 3-part series on high-level APIs, discussing their learning journey, with part 1 viewable at maven.com.

GPU MODE ▷ #general-leaderboard (2 messages):

Submission Errors, Submission ID, Code Samples

  • Unexpected Errors Plague Submissions: Users are encountering persistent “An unexpected error occurred. Please report this to the developers” messages during submission attempts, including test submissions.
    • The issue seems widespread, affecting multiple users and preventing them from completing necessary tasks.
  • Request for Submission Details: To diagnose the recurring submission errors, developers are requesting users to share their Submission ID and relevant code samples.
    • This information is crucial for pinpointing the source of the unexpected errors and implementing effective fixes.

GPU MODE ▷ #factorio-learning-env (53 messagesđŸ”„):

Blue/Green Docker Servers for Factorio, Factorio Save Files vs FLE Game State, Python vs Bash for Docker Management, Multiplayer Mod Issues, Episode Boundaries and State Resets

  • Blue/Green Docker Servers Proposed for Hot-Loading Factorio: A member proposed using blue/green docker servers for hot-loading/reset in Factorio, leveraging Factorio’s native save/load and pre-loading a paused standby server, then flipping the rcon endpoint to the standby.
    • The member argued that this approach trades a little infra for a lot of determinism, offering zero drift and trivial rollbacks, though it requires maintaining two instances.
  • Factorio Save Files vs. FLE Game State: A Detailed Comparison: A discussion ensued comparing Factorio save files (robust, deterministic, but heavier) and FLE game state (fast for small maps, human-readable, but brittle at scale).
    • The conversation weighed the benefits of each, with one member suggesting an evaluation of managing save games on a single server to gauge latency costs before committing to blue/green deployments.
  • Python’s Docker API Challenges Bash for Environment Management: Members discussed porting the run-env.sh script to Python for greater control over Docker image creation and scenario conversion, though it initially proved slower than bash + compose.
    • One member highlighted the power of the Python approach in simplifying environment setup, while another noted the use of aiodocker for improved speed.
  • Multiplayer Mod Issues Require Container Restarts: Users encountered a ‘Cannot join’ error related to mismatched mod event handlers, indicating that certain mods were not multiplayer-safe.
    • The resolution involved restarting the docker container to ensure identical mod configurations between the client and server.
  • Episode Boundaries Trigger State Saving and Resets: It was clarified that TrajectoryRunner finishes an episode loop and calls Environment.reset(...) at episode boundaries, clearing agent namespaces and resetting timers/ticks.
    • State is also saved after every sample during MCTS and can be resumed from the database, with resets occurring at every env.step() call to ensure checkpointed states.

GPU MODE ▷ #cutlass (20 messagesđŸ”„):

Cutlass Software Pipelining, CuTeDSL vs CuTe C++, CuTeDSL Persistent Kernel Issue, TV-layout visualizer for cute-dsl

  • Delving into Cutlass Software Pipelining: Members discussed software pipelining in CuTeDSL, referencing relevant code and AST helpers to understand the generation of prefetching.
    • It was clarified that the pipeline parameter in the range operator gives a hint to the compiler on how many prefetches to make and users must plan staging buffers.
  • Demystifying Prefetching in CuTeDSL’s dense_gemm Example: It was clarified that the software pipeline config simplifies boiler-plate code for prefetching, similar to lines 727-769 in the dense_gemm.py example.
    • A member rewrote the example similar to blackwell without explicit prefetch (i.e. only one mainloop) using software pipelining, achieving similar performance.
  • Unveiling CuTe DSL’s Design Philosophy: Members pointed out that CuTe DSL and its compiler aren’t intended to be overly complex.
    • The goal is to recreate a CuTe C++ analog in Python and enable users to achieve peak performance on tensor core programs across different GPU architectures more efficiently.
  • Debugging Persistent Kernel in CuTeDSL: A user encountered an issue where half of the values weren’t being transferred to GMEM correctly when writing a persistent version of the dense_gemm Hopper example using the Pipeline API.
    • The user provided code for debugging purposes, but did not receive a response.
  • Chillee’s TV-layout visualizer for cute-dsl: A user shared a link to a TV-layout visualizer that works with cute-dsl.
    • The message was a positive remark on the simplicity and predictability of the tool.

GPU MODE ▷ #general (10 messagesđŸ”„):

GPU Mode Leaderboard, PMPP problems, AMD Channels

  • Unlimited GPUMode Leaderboard Submissions!: There is no limit to the GPUMode leaderboard submissions, so users can submit as much as they want.
    • One user asked whether there was a limit to the GPUMode leaderboard submissions, saying that they didn’t want to abuse it.
  • New PMPP Version Incoming: A new v2 of the PMPP problems is coming soon.
    • The attached image shows an interface for submitting runs, and a staff member replied Oh for these PMPP problems yes we have a v2 we need to ship soon.
  • AMD Channels Should Be Removed: The AMD channels can probably be removed or renamed.
    • It is unclear whether or not AMD channels will actually be removed.

GPU MODE ▷ #multi-gpu (9 messagesđŸ”„):

multi-GPU lexicon, DTensor tutorial, splitting a GPU

  • Lexicon makes multi-GPU less perplexing: A member shared a handy free resource, a lexicon of terms/explanations for folks starting to learn multi-GPU.
    • Another member stated this is a very useful resource, thanks :)
  • DTensor deconstructed in detail: A member is planning to stream a more detailed explanation of DTensor.
    • The member shared a link to their YouTube channel to watch their tutorial.
  • Split GPU into smaller chunks to distribute: A member shared a code snippet that lets you run distributed processing on a single GPU.
    • Another member wants to turn their 48GB GPU into 2x24GB, while also letting them use their other 2x24GB GPUs.

Nous Research AI ▷ #general (130 messagesđŸ”„đŸ”„):

Atropos Updates, Qwen Performance, GPT-5 Speculation, GLM-4.5, MoE Models vs Dense Models

  • Atropos gets big Update: Nous Research released a big update on Atropos recently.
  • Qwen impresses vs GPT4: Qwen is considered quite good, with an 89% win rate against GPT4 on arenahard.
  • GPT-5 Incoming?: The community speculates that O3 is already secretly GPT-5 after a member shares a Reddit post about OpenAI stealth routing all O3 requests.
  • GLM-4.5 model cannot convert to low bpw GGUF: The recently released GLM 4.5, with sizes 110B and 358B, faces challenges in converting the 110B model to a low bpw GGUF.
    • Members also found that GLM 4.5 performs exceptionally well in multilingual tasks like Turkish, surpassing R1, K2, V3, and even Gemma 3 27B in writing and creative tasks.
  • Debate on MoE and Dense Models: The community discusses the shift from medium-sized dense models to large MoE models, with some members expressing concern that MoE models, while good for bench marking, might not capture as much nuance as dense models.
    • Others argue that MoE models are more efficient and that the focus is shifting towards them for local use, citing Qwen 30B as an example that competes with 20B models while using the compute of a ~3B model.

Nous Research AI ▷ #research-papers (3 messages):

Overhyped AI Papers, Mereal Azure

  • Deedy deems paper overhyped: A member shared a tweet deeming a certain AI paper as overhyped.
  • Arxiv Paper on Mereal Azure Surfaces: The discussion includes a link to an Arxiv paper focusing on Mereal Azure.

AI Gatekeeping, Philosophical Side Quests in AI, Hyperstim Patch for ChatGPT, Small Model Experiments

  • AI gatekeeping wastes time: A member argued that philosophical side quests are designed to obscure and gatekeep outsiders from value, recalling working with first-generation AI innovators who believed a PhD in first-order logic was necessary to understand AI.
    • They provided a link to an example paper that they believe is a massive waste of time.
  • Hyperstim Patch realigns ChatGPT: A member introduced Hyperstim Patch, a tool that drops straight into ChatGPT: users can say activate to let it realign.
  • Experiment9 dives into very small models: A member shared a toy study that anyone interested in very small models might find useful.
    • The study is available on GitHub.

Nous Research AI ▷ #research-papers (3 messages):

Overhyped Claims, mereal.azure paper

  • Claims of AI overhyped?: A member shared a tweet suggesting that some AI claims might be overhyped.
    • The linked tweet questions the validity and potential exaggeration of certain advancements or promises within the AI field.
  • mereal.azure posts paper: A member posted a link to the mereal.azure paper: https://arxiv.org/abs/2507.16003.
    • It appears to be a research paper, potentially related to the discussion on AI advancements, though further context is needed to determine its specific relevance to the overhyped claims.

Yannick Kilcher ▷ #general (116 messagesđŸ”„đŸ”„):

LLM Context Manager, Downvotes in Web3, Neural Network Pruning, Intent-Aware Semantic Search, AI Demonic Narratives

  • Context Manager optimizes LLM inference for Conversations: A member introduced an LLM Context Manager on GitHub that uses branching and a novel algorithm, Contextual Scaffolding Algorithm (CSA), to smartly manage context fed into the model for inference during conversations, to prevent context pollution and rot, demonstrated in a YouTube video.
  • Downvotes weaponized in Web3 experiments: A member shared a Web3 experiment where downvotes were misused by groups against each other, leading to a toxic environment where joining a social network meant risking downvotes on one’s posts, similar to strategic and frivolous lawsuits.
    • They learned that downvotes work best on anonymous, algorithm-driven platforms with many unnetworked users, and highlighted that regulations for public stock offerings require enough uncorrelated participants to reduce manipulation risk.
  • Insights into Neural Network Pruning: A member noted that most insights into neural network pruning have been about simple dense networks, not transformers, with many attempts ending in unpublished null results.
    • They said that pruning works in traditional dense networks because the models form boundary conditions in a high-dimensional space, and you only need to preserve the boundaries for classification to keep working.
  • Intent-Aware Semantic Search Discussed: A member explored making semantic search more intent-aware, noting that dense embeddings (like BERT/SBERT) are decent for general similarity but fuzzy on actual intent.
    • Another suggested using Hierarchical Neural Topic Modeling to solve some of these scenarios, but no theoretical guarantee on other unexpected scenario, noting an intent to pursue a PhD out of this frustration.
  • AI Demonization Narrative Takes Hold: Members discussed how the “AI is demonic” narrative is emerging in mainstream and alternative media, with some people unironically thinking they are talking to demons when using large language models, citing a YouTube video.
    • A member sarcastically proposed that it is a digital Ouija board because no matter how much people fear it being the devil, ultimately it’s them moving it to say whatever they want.

Yannick Kilcher ▷ #paper-discussion (10 messagesđŸ”„):

Peer Review, Math heavy papers, Learning PDEs, NAS Transformers

  • Ask Community about Papers undergoing Peer Review: A member inquired about the proper way to discuss a paper under peer review with the community without being bothersome, noting that it received bad reviews and seeking input on its quality.
    • Another member suggested sharing the ArXiv link, mentioning that interest would depend on the topic, and recommended contacting a specific user to potentially discuss it in a daily paper discussion.
  • Generic Hammer hits ArXiv: A member posted a link to their ArXiv paper, describing it as a generic hammer for learning problems demonstrated on toy dynamical systems.
    • They noted it’s a bit more mathy and that the engineering implications might not be immediately apparent, which is why they wanted to create discussions around it.
  • NAS paper deemed terrible: A member mentioned encountering a terrible new paper making easily debunked claims in the abstract, likening it to NAS but for transformer architecture design.

Yannick Kilcher ▷ #agents (1 messages):

Amazon Q, Prompt Injection, Security Vulnerabilities

  • Amazon Q Warded Off Computer-Wiping Prompt Injection: A hacker attempted to plant computer-wiping commands into Amazon’s AI coding agent, but the attempt was unsuccessful.
    • The methodology involved a prompt injection in a pull request, highlighting potential security vulnerabilities in AI coding tools.
  • AI Coding Agents Face Prompt Injection Risks: The incident underscores the risk of prompt injection attacks in AI coding agents, where malicious commands can be embedded in seemingly benign inputs like pull requests.
    • Although this particular attack on Amazon Q was unsuccessful, it serves as a reminder of the need for robust security measures in AI-powered development tools.

Yannick Kilcher ▷ #ml-news (7 messages):

Youtube pushes shorts, Recommendation algorithm, personalized content

  • YouTube Pushes Shorts to Compete with TikTok: Members noted that YouTube is pushing shorts because they see TikTok as an essential threat, and driving away their most loyal users.
    • Members suggested that this is probably more about market share and people spend time on a video platform that is not Youtube and that is lost revenue which is simply unacceptable.
  • Recommendation algorithm comprehension: A member suggests that we started to see too many shorts because the recommendation algorithm can’t properly comprehend the long form videos anymore.
    • A member wonders if the recommendation algorithm can tell the difference between BS like “nuclear diamond battery” or “Amazing industrial processing (with AI generated slop for a thumbnail that’s not even in the video.)” VS real tech videos, referencing a YouTube video.
  • Recommendation and generative content convergence: A member quotes that we’re going to see recommendation and generative content start to come together in the future where we’re going to be recommending a personalized version of a piece of content.
    • The member added that in the future instead of recommending content we may even start creating it and you can get to really interesting en.

Manus.im Discord ▷ #general (71 messagesđŸ”„đŸ”„):

Manus AI, credits consumed, vibe coding challenge, GPT prompts, Manus Fellow in Switzerland

  • Manus AI Helps User Complete Multiple Projects: A user shared that they have completed 5 apps and a client website using Manus AI, describing it as a money-making machine.
    • Each app consumed around 300 credits and the client website consumed around 1000 credits.
  • User Suggests Manus Vibe Coding Challenge: A user suggested that Manus create a vibe coding challenge to help users build their own products without waiting for clients.
    • They claimed to have acquired 30 global users daily without any advertising, with their product ranking #1 when searching for flutter web Emulator.
  • Member solicits AI Prompt engineering strategies: A member asked what tools or tricks others use to squeeze the best out of ChatGPT and test edge behavior.
    • It is implied some prompts are better than others, and people are looking for suggestions to up their prompting game.
  • Fellowship Application Awaits Reply: A member who applied to be a Manus Fellow in Switzerland a long time ago is still awaiting an answer.
    • They are planning to organize a meetup and a hackathon in August and are seeking whom to contact about this matter.
  • Users Report Tasks Disappearing: Several users reported that their tasks disappeared within the Manus AI platform.
    • It was suggested that logging out and logging back in might resolve the issue.

DSPy ▷ #show-and-tell (2 messages):

GEPA: Reflective Prompt Evolution, Prompt Optimization, DSPy Optimizer, LLM Reflection

  • GEPA Reflects for Prompt Perfection!: A new paper on GEPA: Reflective Prompt Evolution (https://arxiv.org/abs/2507.19457) introduces a method to optimize prompts by treating them as documents that can be improved through reflection.
    • The results show 10% better performance than GRPO (which DSPy uses), 35x fewer rollouts, and beats MIPROv2 by 10%+.
  • GEPA Achieves Linguistic Evolution: GEPA analyzes failures, asking what went wrong linguistically?, and generates concrete prompt improvements based on the analysis.
    • This approach enables linguistic evolution, where prompts literally explain why they’re failing and how to fix themselves, contrasting with current optimizers like MIPROv2 and COPRO that do numerical search.
  • DSPy Integration Explored!: The user explores a potential DSPy integration idea, suggesting an optimizer that uses a reflection model like gpt-4o.
    • The coolest part: it’s not just about performance - you actually understand why the optimizations work. No more black box optimization!
  • DSPy’s New Optimizer?: A user mentioned that GEPA could be the new DSPy optimizer.

DSPy ▷ #papers (2 messages):

DSPy optimizers, New Arxiv Paper

  • Paper sparks Optimizer Integration: A user shared a new arXiv paper and asked whether it would be integrated into DSPy or exist as a separate repo.
    • Another user responded that new optimizers should always be integrated into DSPy.
  • Integration Strategy: The question was raised whether the new optimizers would be part of DSPy or a separate repo.
    • The consensus was that new optimizers should always be integrated directly into DSPy to maintain cohesion.

DSPy ▷ #general (57 messagesđŸ”„đŸ”„):

Context Engineering Definition, MLSys DSPy Talk, Online RL for Personalization, GEPA: Reflective Prompt Evolution, DSPy Optimizer Roadmap

  • Defining Context Engineering: Drew Breunig gave a talk on why the term context engineering matters, and posted a summary on his blog.
    • The talk was not recorded, but the blog post encapsulates the main points.
  • DSPy MLSys Talk Sparks Roadmap Inquiries: A member enjoyed the MLSys DSPy talk (YouTube link), highlighting the approach to context engineering and modular instruction design.
    • They inquired about the DSPy roadmap (GitHub link) and speculated whether GRPO and RL are the next areas of improvement.
  • RL for Personalization is Hot: Members expressed interest in using Online RL to improve personalization and instruction relevance by grounding agents in who they’re helping.
    • A member offered access to those interested in jamming on the project.
  • GEPA: Reflective Prompt Evolution: A member introduced the paper GEPA: Reflective Prompt Evolution that treats prompts as documents that can be improved through reflection, achieving 10% better results than GRPO with 35x fewer rollouts.
    • The member illustrated how GEPA analyzes failures and generates prompt improvements, suggesting an integration idea with DSPy using optimizer = dspy.GEPA.
  • New DSPy Optimizer GEPA Coming Soon!: The DSPy team is releasing a new optimizer, GEPA (SIMBAv2), which supersedes SIMBA, as mentioned in the paper Reflective Prompt Evolution Can Outperform GRPO.
    • The team member clarified that GEPA > SIMBA > MIPROv2 in terms of performance and mentioned plans to deprecate older optimizers in the documentation.

aider (Paul Gauthier) ▷ #general (43 messagesđŸ”„):

Aider configuration, Qwen3-Coder pricing and context, AI Code Editor Benchmarks, Aider modes, OpenAI API tokens

  • Configure Aider for Testing and Commits: To disable auto-commits in Aider, add auto-commits: false to the ~/.aider.conf.yml file, and enable auto test with --test-cmd <test-command> --auto-test flags.
    • Use /test <test-command> to run tests, with Aider expecting the command to print errors to stdout/stderr and return a non-zero exit code upon failure as documented on Aider’s usage documentation.
  • Qwen3-Coder Context and Cost Considerations: Qwen3-Coder is noted for being expensive due to token caching issues and large context requirements, with one user linking to a Reddit thread discussing these issues.
    • Pricing is listed at $0.30 / $1.20 per Mtokenfp4, and while the native context is 262,144, there are concerns about potential quality drops even when utilizing smaller amounts of context.
  • Community Seeks AI Code Editor Benchmarks: Members are seeking reliable benchmarks for comparing AI code editors like Aider, Kilo Code, and Cline to understand their feature sets and leading-edge capabilities.
    • One user expressed the need for a detailed look at what other features are the leading edge right now given that they are already frequent Aider users.
  • Exploring Aider’s Interactive Modes: Users discussed their workflows with Aider’s modes, noting that /ask is often used multiple times followed by /code go ahead or implement it now instead of directly using /architect.
    • One user recommended that using /ask to create a proper implementation plan and update the todos.md checklist helps in the process.
  • Buying OpenAI API credits to get Free Tokens via Data Sharing: One user asked if purchasing ChatGPT API credits grants free tokens for data sharing.
    • A member checked that $100 gets you 1M o3/good tier and 10M o3-mini/mediocre tier, but you may get some tokens at $25 as well.

aider (Paul Gauthier) ▷ #questions-and-tips (11 messagesđŸ”„):

Kimi VL Model, OpenRouter Free Models, Lisp Coding with LLMs, Aider Diff Application Confirmation

  • Bypassing Aider’s Model Check for Kimi VL: A user encountered an error when using Kimi VL with Aider, and asked for a way to force Aider to recognize that the model supports image input, providing a screenshot of the error.
    • Another user pointed out that one can add the model to the config even if not listed, bypassing the need for Aider to explicitly support it.
  • Exploring Free Models with Aider and OpenRouter: A user inquired about reasonably usable “free” models (like Deepseek R1, Kimi K2 and Qwen 3) with usable limits for light to moderate use with Aider.
  • Lisps struggle with LLMs: A user noted that LLMs (including via Claude code and aider) struggle with Lisp coding or Python code with highly nested delimiters due to data structures.
    • No solution was offered.
  • Requiring Confirmation before Applying Diffs: A user asked if there’s a way to make Aider ask for confirmation before applying a diff, similar to how CC (Code Climate) does.
    • No solution was offered.
  • Understanding OpenRouter’s Free Model Token Limits: A user asked if exhausting the daily token limit on OpenRouter’s “free” models is per-model or across all models.
    • Another user referred to the OpenRouter documentation on rate limits, noting the request limits per minute and the daily limits based on purchased credits: up to 20 requests per minute and 50 or 1000 :free model requests per day.

Notebook LM ▷ #announcements (1 messages):

Featured Notebooks Rollout, NotebookLM Homepage Access

  • Featured Notebooks Released to 100%!: The Featured Notebooks suite has been officially rolled out to 100% of users and is now accessible from the NotebookLM homepage.
    • Users can now access the entire suite of Featured Notebooks directly.
  • Full Access to Featured Notebooks Now Available: All users now have access to the complete collection of Featured Notebooks via the NotebookLM homepage.
    • This update ensures that all users can directly explore and utilize the featured resources.

Notebook LM ▷ #use-cases (15 messagesđŸ”„):

Notebook LM with academic material, Notebook LM not scraping forum pages, AI Studio Build app workflow, notebookLM interface, use Notebook to write cover letters and make resume

  • Academic answers ground-truthed in NotebookLM: A member uses Notebook LM as a specialized tool loaded with academic material to get grounded answers without relying on search engines or standard LLM summaries, and to check the original material.
    • They are refining a workflow using Comet/Assistant with Drive/Docs/NotebookLM.
  • Forum pages evade NotebookLM Scraping: A member asked if anyone knows how to get Notebook LM to read forum pages, as it doesn’t scrape them even when attached as a source.
    • They showed off an AI Studio Build app with agentic workflows using the Gemini API initially fleshed out in NotebookLM.
  • Earnings reports are fuel to NotebookLM’s Flame: One user is using NotebookLM with corporate earnings reports and webcast transcripts from Q1 2025 detailing revenue, profit, and key business segment results.
    • They note these reports offer insights into the financial performance of global companies, and the AI is pretty good at working out how the tags are meant to apply to the whole thing.
  • NotebookLM polishes Cover Letters: A member uses Notebook LM to help write cover letters and make resumes more ‘exciting’ by comparing the job description with their standard cover letter and suggesting ways to update it.
  • NotebookLM wants Nurse’s Notes: The NotebookLM team asked how to get nursing materials featured in the notebook, offering to share content inspired by the International Practice Development Journal (IPDJ) and nursing innovation.

Notebook LM ▷ #general (31 messagesđŸ”„):

NotebookLM Mind Maps for Legal Jargon, Obsidian and NotebookLM Pairing, Uploading PDFs to NotebookLM, Podcast Personalization Issues

  • Legal Lingo Lassoed: Mind Maps Made Manageable!: A user inquired about using NotebookLM to create mind maps from legal documents, aiming to simplify jargon into everyday language.
    • The user noted the ability to summarize legal documents but sought customization options for mind map outputs.
  • Obsidian Overdrive: Supercharging AI Systems for Info Recall: A member suggested pairing NotebookLM with Obsidian to enhance information organization, parsing, and recall efficiency.
    • They emphasized Obsidian’s ability to structure information with dataview formatted frontmatter, enabling AI systems to parse and apply stored information effectively.
  • PDF Predicaments: Troubleshooting Upload Errors: A user reported issues uploading PDFs to NotebookLM, encountering an “error uploading source, try again” message despite being a paid user and having successfully uploaded the same files in the past.
    • Troubleshooting steps included restarting the computer, using a second device, and trying various upload methods, but the issue persisted specifically on the paid account, not the free one.
  • Podcast Pains: Customization Conundrums: A user expressed frustration about the inability to personalize podcasts within NotebookLM.
    • This issue was raised without specific details about the previous customization options or the desired modifications.

LlamaIndex ▷ #announcements (1 messages):

FlowMaker, S3 Vector Store, LlamaParse, n8n nodes for LlamaCloud, Gemini Live voice agent

  • FlowMaker tool lets you visually build LlamaIndex workflows: The team introduced FlowMaker, a brand new tool with a visual GUI for building LlamaIndex workflows.
  • LlamaIndex now supports S3!: LlamaIndex now supports S3 with the new S3VectorStore.
    • This allows for scalable and cost-effective storage of vector embeddings.
  • LlamaParse Sees Header and Footer!: LlamaParse has new header and footer detection capabilities.
  • New n8n nodes for LlamaCloud now Open Sourced: New n8n nodes for LlamaCloud (including LlamaCloud indexes, LlamaParse and LlamaExtract) are now open-sourced and available in the n8n-llamacloud repo.
  • Gemini Live voice agent integration: A new integration with Gemini Live voice agent is available via pip install llama-index-voice-agents-gemini-live!
    • Demo code can be found here.

LlamaIndex ▷ #blog (3 messages):

Gemini Integration, Production Agents, Oxylabs Web Scraping, Cost-Effective AI Agents

  • Gemini Weather Forecast is Terminal!: LlamaIndex announced a new integration with Google DeepMind Gemini allowing users to interact with their terminal about the weather.
    • An OSS engineer will guide users on how to get started and speak with a voice assistant using a few lines of code.
  • Agent Design Patterns Emerge Victorious!: Seldo at the AI Dot Engineer summit discussed agent design patterns that succeed and fail at scale, based on thousands of real-world agents.
    • Key topics included hybrid workflows, autonomy vs structure, and debuggability, detailed in this link.
  • Oxylabs Enables Cheap Web Scraping Agents: LlamaIndex has a new integration with Oxylabs web scraping infrastructure for building cost-effective AI agents that can search and scrape any site in real-time.
    • This includes specialized readers for Google, Amazon, and YouTube that automatically parse HTML and bypass anti-scraping measures, as noted here.

LlamaIndex ▷ #general (39 messagesđŸ”„):

LlamaIndex OpenTelemetry, Intent-aware semantic search, Knowledge Graphs, Property Graph Index

  • Debugging LlamaIndex OpenTelemetry: A user was debugging their use of LlamaIndexOpenTelemetry, finding traces in their OTLP platform but without attributes, and was pointed to the workflows-py notebook as a readable example.
    • They were also told that spans themselves won’t have many attributes, and they should look at the events under that span.
  • Intent-Aware Semantic Search Strategies Emerge: A user is exploring ways to make semantic search more intent-aware, noting that current methods using dense embeddings don’t always capture the specific intent behind a query, citing the case where searching for “What is the name of the customer?” returns generic results.
    • They’re exploring Knowledge Graphs (KGs) to make relations explicit, potentially disambiguating query intent, but are unsure how to reliably query schema-less KGs built with OpenIE.
  • Property Graph Index Embedding Stall Investigated: A user reported that changing their code from Knowledge Graph Index to Property Graph Index caused the execution to freeze during the embedding generation, and confirmed that embed_kg_nodes in PropertyGraphIndex is equivalent to include_embeddings in KnowledgeGraphIndex.
    • The hangup was potentially resolved by setting use_async=False, which may indicate a weird async blocking issue.

LlamaIndex ▷ #ai-discussion (1 messages):

twitch_voco: https://leaksdaily.com/?ref=5d681792 @everyone


MCP (Glama) ▷ #general (40 messagesđŸ”„):

Glama MCP server tool count issue, Automating Javascript/Typescript linting, Agent payments vs human payments, Monetizing Agents, AI App Store

  • Glama Glitches: Tool Count Tumbleweed!: A member reported that their MCP server on Glama showed an incorrect tool count (one instead of six), and republishing didn’t fix the issue (Glama Link).
    • They later found that Glama was cloning from a specific commit hash instead of the main branch, causing the issue.
  • Linting Legwork: Automating Javascript/Typescript!: A member asked for good ways to automate Javascript/Typescript linting without causing excessive code changes.
  • Agent Autonomy: Payments Predicament!: A member questioned whether agent payments should be a separate system from human payments, highlighting the challenges of current payment systems for autonomous agents.
    • The member noted that agents don’t “click pay,” they don’t pass CAPTCHAs, and they can’t navigate approval flows built for humans, suggesting a need for agent-native payment solutions.
  • Monetizing Models: Is there Money to be Made?: A poll was taken to find out if anyone is making users pay to use their agent, and another member believes that a market should be created.
    • Another member hopes one of the big model providers launch a “Login with” system that would allow users to authenticate to use their chatgpt or claude credentials with billing happening already in the requests.
  • AI App Store: A New Ecosystem Erupts: A member envisions a future where AI companies become identity and payment providers, creating an AI App Store.
    • This member believes that local models will soon be good enough and these companies might not be that relevant anymore unless they become an AI App Store with profit sharing.

MCP (Glama) ▷ #showcase (3 messages):

fast-agent, Mermaid diagrams, MCP expert, Tone-of-voice, Goose Desktop

  • fast-agent Adds Mermaid Diagram Support: fast-agent now supports Mermaid diagrams, enhancing its capabilities for visualizing complex processes.
  • Craft an MCP Expert with fast-agent: A user shared a method for creating an MCP expert using fast-agent by embedding URLs in system prompt templates, enabling on-demand expertise with a specific tone-of-voice.
    • The latest fast-agent version allows embedding URLs in system prompt templates (example template) to easily produce experts.
  • Goose Desktop and CLI Chat Philosophy, Build Sales Report: Goose Desktop conversed with Goose CLI, resulting in a philosophical exchange about AI identity and the unexpected creation of a sales report for a restaurant.

tinygrad (George Hotz) ▷ #general (30 messagesđŸ”„):

tinygrad meeting notes, GPUCounter issues, Llama3 runs, Disk raw benchmarks, MLPerf regressions

  • tinygrad Meeting Covers Kernels and MLPerf: Meeting #81 agenda includes company updates, kernel loops, mlperf llama, viz tool, drivers, cloud hash, ONNX, and other bounties.
  • GPUCounter Template Not Found!: When running PROFILE=2 VIZ=1 python3 test/test_tiny.py TestTiny.test_mnist, the error /Users/light/Library/Application Support/Instruments/Templates/GPUCounter.tracetemplate: No such file or directory was raised.
    • It was noted that AMD headers accidentally included SQTT-related definitions and that instruction tracing requires a ton of GPU RAM, but cache hit rates can be read from registers before/after each kernel without SQTT.
  • Llama3 cruises on tinygrad examples: Running python3 examples/llama3.py --size 8B shows 16.06 GB RAM used and loads weights in 7332.03 ms at 2.19 GB/s, then starts a Bottle v0.13.3 server listening on port 7776.
  • Disk Raw Benchmarks Test Disk Speed: The command python3 test/external/external_benchmark_disk_raw.py initially resulted in a segmentation fault but later showed CPU copying at 1.4 GB/s from disk.
    • Using AMD=1 improved the copy speed to 9.8 GB/s from disk to AMD GPU, validated by dd command achieving similar 9.9 GB/s.
  • MLPerf BERT Bounty Regresses: A recently merged PR, which initially met the bounty criteria with BERT running at BS=84, now only runs at BS=48, exceeding the 20% overhead target due to faster step size and constant transfer overhead.
    • George Hotz asked what regressed? and requested the script to validate the bounty such as REMOTE=1 HOST=192.168.200.4:6667*3,192.168.200.6:6667*3 PYTHONPATH="." MODEL="bert" DEFAULT_FLOAT="HALF" SUM_DTYPE="HALF" GPUS=6 BS=48 EVAL_BS=48 FUSE_ARANGE=1 FUSE_ARANGE_UINT=0 BASEDIR="/raid/datasets/wiki" PARALLEL=0 RUNMLPERF=1 python3 examples/mlperf/model_train.py

tinygrad (George Hotz) ▷ #learn-tinygrad (1 messages):

Tinygrad Kernel Explanation, Understanding George Hotz's Theory Message

  • Tinygrad Kernel Explanation Requested: A member sought clarification on George Hotz’s message in the theory channel concerning Tinygrad kernels.
    • The request specified a simple graph/kernel example to aid in comprehending the discussed concepts.
  • Decoding Geohot’s Theory Channel Post: A user expressed difficulty in fully understanding George Hotz’s recent message within the theory channel.
    • They specifically requested a simplified explanation, possibly with a graph or kernel example, to enhance comprehension of the material.

Cohere ▷ #đŸ§”-general-thread (16 messagesđŸ”„):

command-r-plus deprecation, command-a-03-2024 recommendation, LLM testing, Cohere Guides and API reference, LLMU

  • Command-R-Plus bites the dust: The previous generation command-r series models will be deprecated, the suggestion is to switch over to command-r-plus-08-2024.
    • A community member recommended switching directly to their latest and best model on the platform command-a-03-2025.
  • Learn LLM Testing Tactics: A member inquired about testing customer facing chat LLMs that perform product search and recommendations, multi-turn conversations, answering questions, or driving purchases.
    • Another member suggested trying out LLMU to generally understand more about building a next project, and linked to the Cohere Guides and API reference.

Cohere ▷ #🔌-api-discussions (6 messages):

Cohere API Kilo Code Error, Cohere Dashboard Fine-Tuning Failure

  • Kilo Code Throws 422 Error: A member reported encountering a 422 error when trying to access the Cohere API through “KILO CODE.”
    • A community member clarified that Cohere does not natively support the Kilo Code app, and a 402 error (not 422) from the API typically indicates a payment-related issue, pointing to Cohere’s error documentation.
  • Fine-Tuning Fails on Cohere Dashboard: A member inquired about troubleshooting steps for a fine-tuning failure on the Cohere dashboard, noting they had already contacted support.
    • The member then shared sample conversations in the fine tuning database, including system, user, and chatbot roles in Burmese.

Cohere ▷ #👋-introduce-yourself (9 messagesđŸ”„):

Recursive Systems, AI Cyber Security, AI Engineering, ML in Robotics, ML research and agentic modelling

  • Symbolic Logic Explored via Modular Environments: A member from the University of Belgrade is exploring recursive systems and symbolic logic through modular environments that evolve over time.
    • Their focus involves designing annotated frameworks that encode both operational clarity and interpretive drift, and is interested in how others design perceptual scaffolds.
  • AI Engineer Focuses on Red Team Work: An AI cyber security engineer is working on AI red team work, AI forensic and reverse engineering to make safer AI models.
    • They did not provide any additional context.
  • Solution Architect Concentrates on AI Engineering: A Solution Architect from Viet Nam is focusing on AI Engineering and MLops.
    • They did not provide any additional context.
  • Mechatronics Graduate Explores AI in Robotics: An ML Intern, a Mechatronics graduate, is exploring the application of AI in robotics, specifically interested in Reinforcement Learning (RL) based Kinematics.
    • They did not provide any additional context.
  • Software Engineer Dabbles in ML Research: A software engineer is starting into ML research and agentic modelling.
    • They expressed interest in exploring the potential usages and what Cohere can add.

LLM Agents (Berkeley MOOC) ▷ #mooc-questions (15 messagesđŸ”„):

LLM Agents MOOC certificate, Open source vs Closed source instruction training, Reopening quizzes from previous cohort, Archived quizzes from previous cohort

  • Students ask how to access LLM Agents MOOC Certificate: A student asked how to get their LLM Agents MOOC series certificate.
    • A member replied that certificates have already been released to all eligible students.
  • Comparing open source and closed source instruction training: A member asked if a model’s instruction following capabilities increases as the size increases for open source instruction trained and closed source (ChatGPT-4o), and if there are any papers related.
    • No papers were directly linked, but this remains an open question.
  • Students inquire about Reopening quizzes: A member asked if it’s possible to reopen quizzes in the previous cohort for learning purposes.
    • A member linked an archive of the quizzes here, also found on the course website under the Quizzes section.
  • Students desire more Archived Quizzes: A user asked if there are any archived quizzes from the previous 2024 cohort https://llmagents-learning.org/f24.
    • A member linked an archive of the quizzes here and on the quiz section of the webpage.

Nomic.ai (GPT4All) ▷ #general (8 messagesđŸ”„):

M1 Max for local projects, Discord mass invite issue, Blockchain and AI/ML specialist introduction, Developer collaboration invitation

  • M1 Max Macs - Good Buys?: A member inquired whether M1 Max machines are decent for running local projects, noting the current attractive prices.
    • No opinion was given.
  • Discord Mass Invite Debacle!: A member apologized for a mass invite sent to everyone in their contact list via Discord.
    • They shared a Discord support article and claimed they didn’t send it themselves after someone joked to change your password - stop looking at porn.
  • Blockchain Architect and AI/ML Specialist Seeking Connections: A Senior Software Engineer introduced themselves as a Blockchain Architect & AI/ML Specialist with 9+ years of experience, highlighting their expertise in building scalable, high-performance applications.
    • They listed core skills including Blockchain (Ethereum, Polygon, Binance Smart Chain, Solana, Cardano, Arbitrum, Rust, Solidity, Web3.js, Ethers.js) and AI/ML (TensorFlow, PyTorch, scikit-learn, OpenAI API).
  • Developer Passionately Seeks Collaborations: A developer expressed their passion for coding and interest in collaborating on projects.
    • They invited recommendations and opportunities for collaboration.

Torchtune ▷ #dev (3 messages):

DCP Information Leaks, RL Test Timings, CI Debugging

  • DCP: Stop the Leaks!: A member expressed concern that DCP should not be sending much information around.
    • They mentioned that a timeout is weird and may lead to information leakage or security vulnerabilities.
  • RL Tests: What Takes So Long?: A member questioned why RL tests are running for more than 1 hour.
    • They stated that it is a bug for 100%.
  • Debugging CI Separately: A member announced plans to open a separate PR to debug the CI.
    • This suggests a focused effort to address identified issues within the CI pipeline.