China takes another huge leap ahead in open models

AI News for 1/26/2026-1/27/2026. We checked 12 subreddits, 544 Twitters and 24 Discords (206 channels, and 7476 messages) for you. Estimated reading time saved (at 200wpm): 602 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

AI News for 1/26/2026-1/27/2026. We checked 12 subreddits, 544 Twitters and 24 Discords (206 channels, and 7476 messages) for you. Estimated reading time saved (at 200wpm): 602 minutes. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

Kimi has been on an absolute tear in the past year, and we last heard from them in November with Kimi K2 Thinking. Like K2, today’s K2.5 is still a 32B active-1T param model (384 experts), “built through continual pretraining on 15 trillion mixed visual and text tokens atop Kimi-K2-Base” (which itself was on 15T tokens), and with an EXTREMELY well produced video from their founder (3 minutes, just watch it):

They again claim SOTA on HLE and BrowseComp (footnotes give confidence the tests are legit), but also open model SOTA for vision and coding tasks:

There are a few notables here - Kimi K2.5 is “natively multimodal” for the first time, perhaps borrowing from Kimi VL, but is attributed to “massive-scale vision-text joint pre-training” including VIDEO understanding - “simply upload a screen recording” and K2.5 can reconstruct the website for you:

The fact that this is a continued pretrain that changes arch (+400M param MoonViT vision encoder) is VERY exciting for model training folks who rarely get to see a scaled up model do stuff like this.

The other 2 headline features are equally exciting: Agent Swarm (only for paid users on the Kimi app) which “learns to self-direct an agent swarm of up to 100 sub-agents, executing parallel workflows across up to 1,500 coordinated steps, without predefined roles or hand-crafted workflows.” This parallelism results in higher end result performance with up to 4.5x faster speed… ignoring token cost of course.

and “Office Productivity” with K2.5 Agent focused on “high-density, large-scale office work end to end”.

This is not empty regurgitation - We saw enough to sign up as a paying subscriber of the Kimi App going forward. As Artificial Analysis notes, the China-Western gap in open models just took another big leap today.

AI Twitter Recap

MoonshotAI’s Kimi K2.5 ecosystem: open multimodal MoE + “Agent Swarm” push

Kimi K2.5 model drop and positioning: Moonshot positions Kimi K2.5 as a flagship open-weights model with native multimodality (image + video), strong agentic performance, and aggressive API pricing/latency claims. Official launch media and messaging: founder intro video, pricing/throughput claims incl. “Turbo-level speed 60–100 tok/s”, plus early community reactions emphasizing “agent swarm” and multimodal capability (kimmonismus, kimmonismus on multimodal/video).
Technical gist (as surfaced by the community): A useful unpacking of K2.5’s reported ingredients—~15T mixed visual+text tokens continual pretraining, context 128K→256K via YaRN, release in INT4 with selective quantization (only routed experts quantized), and the “Agent Swarm” orchestration concept (dynamic generation of subagents; up to 100 parallel subagents / 1,500 steps; wall-time improvements claimed 3–4.5×) is summarized by @TheZachMueller (and points to the technical report).
Benchmarks/third-party eval framing: Artificial Analysis positions K2.5 as “leading open weights” and closer to frontier labs, highlighting GDPval-AA Elo 1309 (agentic knowledge work harness), MMMU Pro 75%, INT4 ~595GB, and a 64% hallucination rate (improved vs K2 Thinking) among other stats: @ArtificialAnlys. LMArena announcements also place K2.5 Thinking at #1 open model in their Text Arena snapshot: @arena. (Treat leaderboards as point-in-time; harness/tooling and prompting matter.)
Distribution and “runs at home” signals: K2.5 landed quickly across infra surfaces: Ollama cloud with launch integrations (@ollama), Together AI listing (@togethercompute), and Fireworks as a partner (Moonshot). A notable local-inference datapoint: K2.5 reportedly runs (slowly but “usable”) on 2× M3 Ultra via MLX with sharded generation, ~21.9 tok/s at high memory use: @awnihannun (+ command snippet here).
Product surface area around Kimi: Moonshot also pushed adjacent tooling: Kimi Code, an Apache-2.0 open-source coding agent integrating with common IDEs/editors (announcement), and an Agent SDK to build custom agents (link). A “Kimi Product” account is explicitly aimed at distributing prompts/use-cases (launch), with a viral demo of “video-to-code” website cloning (demo).

Open “American comeback” at scale: Arcee/Prime Intellect Trinity Large Preview (400B MoE)

Trinity Large Preview release: Arcee dropped Trinity Large initial weights as a “preview” release: @arcee_ai, with expanded details from @latkins. Prime Intellect frames it as an open 400B MoE with 13B active trained with Datology data: @PrimeIntellect. OpenRouter offered limited-time free access: @OpenRouterAI.
Architecture/training details (most concrete technical tweet): A strong technical snapshot comes from @samsja19: 400B/A13B MoE, trained over 17T tokens; 3:1 interleaved local/global gated attention, SWA, NoPE on global layers + RoPE on local layers (as written in tweet), depth-scaled sandwich norm, sigmoid routing, trained with Muon; trained on ~2,000 B300s for a month on Prime Intellect infra, with data curation by DatologyAI.
Data scaling emphasis: Datology’s involvement is highlighted as a major part of the project: “6.5T tokens overall” and “800B synthetic code” (plus multilingual curation) in one team member’s recap: @code_star. Separate recaps mention 8T synthetic as part of 17T: @pratyushmaini.
Ecosystem readiness: vLLM announced day-0 support for serving Trinity Large: @vllm_project. The meta-story in the replies is that a Western org is again attempting frontier-ish pretraining from scratch with an open model, rather than only post-training/evals.

Agents everywhere: orchestration, subagents, planning critics, and IDE/CLI integration

Agent “swarm” vs “subagents” convergence: Kimi’s “Agent Swarm” pitch (dynamic subagent creation) parallels the broader pattern of central orchestrator + parallel specialists. The most explicit “starter pattern” articulation is LangChain’s stateless subagent model (parallel execution + minimized context bloat): @sydneyrunkle. Meanwhile, Kimi’s swarm is framed as trainable orchestration via Parallel-Agent RL (PARL) in community summaries (Zach Mueller).
Reliability via “critique before execute”: Google’s Jules introduced a Planning Critic—a second agent that critiques plans pre-execution, claiming a 9.5% drop in task failure rates: @julesagent. Jules also added “Suggested Tasks” for proactive optimizations: @julesagent.
Coding-agent products intensifying: Mistral shipped Vibe 2.0 upgrades (subagents, user-defined agents, skills/slash commands, and paid plans): @mistralvibe and @qtnx_. MiniMax launched an “Agent Desktop” workspace pitched as more polished than Claude Cowork: @omarsar0 (and MiniMax’s own onboarding automation: @MiniMax_AI).
IDE infrastructure and retrieval: Cursor claims semantic search materially improves coding-agent performance and that indexing for large codebases is “orders of magnitude faster”: @cursor_ai. VS Code continues tightening agent UX (e.g., safer command execution explanations): @aerezk, plus MCP servers returning UI via MCP Apps spec (LIFX control panel example): @burkeholland.

Document AI & multimodal systems: DeepSeek-OCR 2 and “Agentic Vision”

DeepSeek-OCR 2: learned reading order + token compression: DeepSeek-OCR 2 is framed as a shift from fixed raster scans to learned Visual Causal Flow with DeepEncoder V2, including 16× visual token compression (256–1120 tokens/image) and 91.09% OmniDocBench v1.5 (+3.73%); vLLM shipped day-0 support: @vllm_project. Unsloth notes similar headline improvements: @danielhanchen.
Mechanistic intuition (why it matters for pipelines): Jerry Liu provides a clear “why learned order helps” explanation: avoid semantically shredding tables/forms by allowing query tokens to attend to contiguous regions instead of strict left-to-right: @jerryjliu0. Teortaxes adds a pragmatic eval take: OCR 2 is “on par with dots.ocr” and “nowhere near SOTA,” but the ideas may influence later multimodal products: @teortaxesTex.
Gemini “Agentic Vision” = vision + code execution loop: Google is productizing a “Think, Act, Observe” loop where the model writes/executes Python to crop/zoom/annotate images, claiming 5–10% quality boosts across many vision benchmarks: @_philschmid and the official thread: @GoogleAI. This is an explicit move toward tool-augmented vision being first-class, not bolted on.

AI for science & research workflows: OpenAI Prism as “Overleaf with AI”

Prism launch: OpenAI introduced Prism, a free “AI-native workspace for scientists” powered by GPT-5.2, positioned as a unified LaTeX collaboration environment: @OpenAI and @kevinweil. Community summaries frame it as “Overleaf with AI” (proofreading, citations, literature search): @scaling01.
Data/IP clarification: Kevin Weil clarified that Prism follows your ChatGPT data controls and that OpenAI is not taking a share of individual discoveries; any IP-alignment deals would be bespoke for large orgs: @kevinweil.
Why it matters technically: Prism is a product bet that collaboration context + tool integration (LaTeX, citations, project state) becomes a durable advantage—mirroring the “context > intelligence” theme circulating in Chinese discussions about OpenAI infra and org design: @ZhihuFrontier.

Research notes & benchmarks worth tracking (RL, planning, multilingual scaling)

Long-horizon planning benchmark: DeepPlanning proposes verifiable-constraint planning tasks (multi-day travel, shopping) and reports frontier agents still struggle; emphasizes explicit reasoning patterns and parallel tool use: @iScienceLuvr. (This pairs nicely with the “travel planning again” meme: @teortaxesTex.)
RL efficiency and reuse of traces: PrefixRL idea—condition on off-policy prefixes to speed RL on hard reasoning, claiming 2× faster to same reward vs strong baseline: @iScienceLuvr.
Multilingual scaling laws: Google Research announced ATLAS scaling laws for massively multilingual LMs with data-driven guidance on balancing data mix vs model size: @GoogleResearch.
Math research reality check: Epoch’s FrontierMath: Open Problems benchmark invites attempts; “AI hasn’t solved any of these yet”: @EpochAIResearch.

Top tweets (by engagement)

OpenAI launches Prism (AI LaTeX research workspace): @OpenAI
Moonshot founder video introducing Kimi K2.5: @Kimi_Moonshot
Kimi “video-to-code” website cloning demo: @KimiProduct
Ollama: Kimi K2.5 on Ollama cloud + integrations: @ollama
Claude generating 3Blue1Brown-style animations claim (education impact): @LiorOnAI
Figure introduces Helix 02 autonomous whole-body robotics control: @Figure_robot

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. New Model and Benchmark Releases

Introducing Kimi K2.5, Open-Source Visual Agentic Intelligence (Activity: 643): Kimi K2.5 is an open-source visual agentic intelligence model that achieves global state-of-the-art (SOTA) performance on agentic benchmarks, with scores of 50.2% on the HLE full set and 74.9% on BrowseComp. It also leads in open-source vision and coding benchmarks, scoring 78.5% on MMMU Pro, 86.6% on VideoMMMU, and 76.8% on SWE-bench Verified. The model introduces an Agent Swarm feature in beta, allowing up to 100 sub-agents to work in parallel, making 1,500 tool calls and operating 4.5× faster than a single-agent setup. Kimi K2.5 is available in chat and agent modes on kimi.com, with additional resources on Hugging Face. A comment highlights the impressive capability of 100 sub-agents working in parallel, suggesting potential for enhanced performance in coding tasks. Another comment notes the banning of the original poster, raising questions about account authenticity.
- Asleep_Strike746 highlights the impressive capability of Kimi K2.5 to run 100 sub-agents in parallel, suggesting potential for complex task execution, such as coding tasks. This parallelism could significantly enhance performance in multi-threaded environments, making it a powerful tool for developers looking to automate or optimize workflows.
- illusoryMechanist points out the scale of Kimi K2.5 with ‘1T Activated Parameters’ and ‘32B’ (likely referring to the model’s parameter count), indicating a substantial computational capacity. This suggests that Kimi K2.5 could handle large-scale data processing and complex problem-solving tasks, positioning it as a competitive player in the open-source AI landscape.
- Capaj shares a practical test of Kimi K2.5 by prompting it to generate an SVG of a fox on a unicycle. The result was described as ‘not too bad’, implying that while the model can handle creative tasks, there might still be room for improvement in terms of output quality or creativity. This kind of testing is crucial for understanding the model’s capabilities in real-world applications.
Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement (Activity: 333): The image is a bar chart titled “Aider Benchmark” that illustrates the performance of various coding models in terms of their pass rate for polyglot code editing. The “Jan-v3-4B-base-INSTRUCT” model leads with a score of 18, significantly outperforming other models like “Qwen3-4B-THINKING-2507” with 12.1 and “Ministral-3-8B-INSTRUCT-2512” with 6.8. This highlights the Jan-v3 model’s high efficiency and over 40% improvement in performance, showcasing its enhanced capabilities in coding tasks. The model is designed for improved math and coding performance, making it a strong candidate for lightweight assistance and further fine-tuning. One commenter appreciates the Qwen 4B 2507 model for small tasks, noting its impressive performance despite its size. Another user shares mixed experiences with the Jan model, praising its ability to use search tools effectively but noting occasional tool call failures and odd responses, possibly due to system prompts.
- The Jan v3 Instruct model, a 4 billion parameter coding model, reportedly achieves a 40% improvement in performance with the Aider benchmark. This suggests significant advancements in its ability to handle coding tasks, potentially outperforming other models like Qwen 4B 2507 in specific scenarios. The model’s ability to utilize search tools effectively for code explanation is noted, although there are occasional failures in tool calls and some system prompt issues in web chat applications.
- A user reported mixed experiences with the Jan v3 model on chat.jan.ai, highlighting its capability to correctly use search tools and read code for explaining project flows. However, they also noted some tool call failures and irrelevant responses, possibly due to system prompts. The user expressed interest in the model’s potential integration with Claude Code, suggesting it could become a valuable tool for code search and Q&A in daily coding tasks.
- The Jan v3 model’s performance in benchmarks is highlighted, with a specific mention of its demo availability at chat.jan.ai. The model’s ability to handle small and easy tasks effectively is compared to Qwen 4B 2507, which is favored for similar tasks. The discussion suggests that Jan v3’s fine-tuning may offer competitive advantages in certain coding scenarios.
deepseek-ai/DeepSeek-OCR-2 · Hugging Face (Activity: 385): DeepSeek-OCR-2 is a state-of-the-art OCR model available on Hugging Face, optimized for document processing with visual causal flow. It requires Python 3.12.9 and CUDA 11.8, and leverages libraries like torch and transformers. The model supports dynamic resolution and uses flash attention for enhanced performance on NVIDIA GPUs. It offers various prompts for document conversion, making it versatile for different OCR tasks. One user highlighted the impressive performance of PaddleOCR-VL when compared using scores from other models, suggesting its potential superiority. Another user shared a demo of DeepSeek OCR 2, noting initial issues with repetition due to user error, which were resolved by adjusting decoding parameters, leading to significantly improved performance over version 1.
- A user highlighted the impressive performance of PaddleOCR-VL, suggesting it stands out when compared to other models like B/C/D. This is based on scores reported by a third party, which the user trusts for evaluating model performance. This implies PaddleOCR-VL’s metrics are noteworthy in the context of OCR model comparisons.
- Another user shared their experience with implementing a demo for DeepSeek OCR 2 using GPU credits. Initially, they faced issues with repetition due to incorrect parameters, but after adjusting to DeepSeek’s recommended decoding parameters, the performance improved significantly. The user noted that the updated version is much more reliable than its predecessor, DeepSeek OCR v1.
- The GitHub repository and paper for DeepSeek OCR 2 were shared, providing resources for those interested in the technical details and implementation of the model. The paper likely contains in-depth information on the model’s architecture, training process, and performance benchmarks, which are crucial for technical evaluation and understanding.
transformers v5 final is out 🔥 (Activity: 503): Transformers v5 from Hugging Face introduces significant performance improvements, particularly for Mixture-of-Experts (MoE) models, achieving 6x-11x speedups. The update simplifies the API by removing slow/fast tokenizers, offering explicit backends and enhanced performance. Additionally, dynamic weight loading is now faster, supporting MoE with quantization, tensor parallelism, and Parameter-Efficient Fine-Tuning (PEFT). A migration guide and detailed release notes are available for users transitioning to this version. One user inquired about the implications of these improvements for running small to medium-sized MoE models locally, suggesting that the enhancements might reduce memory bandwidth constraints. Another user reported a 50% increase in single prompt inference speed and a 100% increase in concurrent inference speed after updating to v5 and vllm 0.14.1.
- The Mixture-of-Experts (MoE) model in Transformers v5 shows significant performance improvements, with reported speedups ranging from 6x to 11x. This is particularly relevant for users running models locally, as it suggests that MoE can now utilize compute resources more efficiently, potentially reducing memory bandwidth constraints. This could be beneficial for setups using NVIDIA GPUs or AMD iGPUs, such as the Strix Halo, where compute power is a limiting factor.
- A user reported upgrading to Transformers v5 and vllm 0.14.1 from 0.11, resulting in a 50% increase in single prompt inference speed and a 100% increase in concurrent inference speed for 40x workloads. This highlights the significant performance enhancements in the latest version, which could be crucial for applications requiring high throughput and low latency.
- The update in Transformers v5 now allows Mixture-of-Experts (MoE) models to work with quantized models, which was not possible before. This advancement enables more efficient model deployment by reducing the model size and computational requirements, making it feasible to run complex models on less powerful hardware without sacrificing performance.

2. Local LLM Hardware and Setup Discussions

216GB VRAM on the bench. Time to see which combination is best for Local LLM (Activity: 577): The post discusses the use of secondhand Tesla GPUs, which offer substantial VRAM at a lower cost, for local large language model (LLM) testing. The author has developed a GPU server benchmarking suite to evaluate the performance of these older GPUs when used in parallel. The image shows a setup with multiple NVIDIA GPUs, highlighting the focus on maximizing VRAM for machine learning tasks. The technical challenge lies in effectively utilizing these GPUs without significant bandwidth loss, as most affordable server motherboards support only a limited number of GPUs. Commenters express skepticism about the practicality of using older Tesla GPUs due to potential issues with token processing speed and cooling requirements. There is interest in how the author manages to connect multiple GPUs without bandwidth loss, and a suggestion that newer systems like DGX Spark might offer better performance for certain tasks.
- HugoCortell raises a technical concern about the bandwidth limitations when connecting multiple GPUs to a single PC, noting that most affordable server motherboards support only a few GPUs. This could lead to a significant loss in bandwidth, which is crucial for efficient parallel processing in local LLM setups.
- BananaPeaches3 highlights a critical performance issue with older GPUs, particularly in handling large system prompts. They mention that while token generation speed might be acceptable, the prompt processing time can be a bottleneck, especially with prompts as large as 15k tokens. This suggests that newer systems like the DGX Spark might be more efficient despite slightly slower token generation speeds, due to faster prompt processing capabilities.
- FullOf_Bad_Ideas points out a limitation in the gpu_box_benchmark, which does not test for serving large models split across multiple GPUs. This is a significant use case for setups with high VRAM, indicating a gap in the benchmark’s ability to evaluate real-world performance for large-scale LLM applications.

3. Teasers and Announcements from AI Labs

The Qwen Devs Are Teasing Something (Activity: 331): The image is a tweet from Tongyi Lab featuring an ASCII art face and a lightning bolt emoji, hinting at an upcoming announcement. The Reddit community speculates that this could be related to a new visual language model, possibly named Z-Image, which has been mentioned in recent ComfyUI pull requests. The timing of the announcement might be strategically planned before the Chinese New Year, aligning with other anticipated releases like K2.5 and potentially q3.5, dsv4, and mm2.2. Commenters are speculating that the announcement is related to the Z-Image model, which has been referenced in recent updates to ComfyUI. There is also a discussion about the timing of the release, suggesting it might be aligned with the Chinese New Year.
- The mention of ‘Z-Image’ in ComfyUI PRs suggests a potential new feature or model update related to image processing. This aligns with recent updates where hidden items have been added to collections, indicating ongoing development and testing phases.
- There is speculation about the release of several models and updates before the Chinese New Year, including K2.5, q3.5, dsv4, and mm2.2. This timing is strategic as many labs aim to release updates before the holiday break, which is on January 17th this year.
- A user speculates about the release of ‘Qwen4 Next 48B A3B’, which could imply a new model or version with specific parameters, possibly indicating advancements in model architecture or capabilities.
Minimax Is Teasing M2.2 (Activity: 322): The image is a tweet from MiniMax teasing an update to their AI model, M2.2, suggesting an imminent release with the phrase “M2.1 slays. M2.2 levels up. #soon.” This indicates a potential upgrade in capabilities or performance from the previous version, M2.1. The context suggests a competitive landscape in AI development, particularly among Chinese labs, with other models like Deepseek v4 and Kimi K3 also expected soon. The mention of ByteDance’s potential closed-source model adds to the competitive tension in the AI space. One comment suggests a shift in focus towards agentic Mixture of Experts (MoEs) models, potentially at the expense of updates to traditional 32B models. Another user expresses anticipation for the new model, highlighting the effectiveness of MiniMax 2.1 in combination with glm 4.7 for coding tasks, and the potential impact of the upcoming versions.
- Loskas2025 highlights the use of Minimax 2.1 and GLM 4.7 for coding, noting their excellence. They anticipate that the upcoming Minimax 2.2 and GLM 5, which is currently in training, could significantly enhance performance, suggesting a potential shift in the landscape of coding models.
- CriticallyCarmelized compares Minimax favorably to GLM 4.7, even at high quantization levels, indicating that Minimax is competitive in terms of performance. They express optimism that the new version could surpass current models, potentially becoming their preferred choice for local deployment.
- lacerating_aura mentions speculation around ‘giga-potato’ being associated with DS4, but points out the lack of concrete evidence for the existence of DS4 or Kimi K3, indicating a gap in confirmed information about these models.
I built a “hive mind” for Claude Code - 7 agents sharing memory and talking to each other (Activity: 422): The post describes a multi-agent orchestration system for Claude Code, featuring 7 specialized agents (e.g., coder, tester, reviewer) that coordinate tasks, share persistent memory using SQLite + FTS5, and communicate via a message bus. The system runs as an MCP server and integrates with Anthropic, OpenAI, or Ollama. It uses a task queue for priority-based coordination, allowing agents to pass context and collaborate effectively. The stack includes TypeScript, better-sqlite3, MCP SDK, and Zod. The project is experimental, MIT licensed, and available on GitHub. A comment questions the similarity to the bmad method, suggesting potential overlap in approach. Another comment humorously questions whether the agents agree with each other, hinting at the complexity of multi-agent consensus.
- The project is compared to the BMAD method, which also involves multi-agent systems. The commenter is curious about the differences, suggesting that the approach might be similar in terms of agents sharing memory and communication protocols.
- A reference is made to Microsoft’s Autogen, which was released over two years ago as a solution for multi-agent systems. The commenter suggests exploring this resource for potential new ideas, indicating that the concept of multi-agent communication and shared memory is not new and has been explored by major tech companies.
- The choice of using Claude Code is questioned, with a suggestion to consider open-source alternatives. This implies a debate on the benefits of proprietary versus open-source platforms for developing multi-agent systems, hinting at potential advantages in community support and collaboration in open-source projects.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Kimi K2.5 and Open Source AI Model Releases

Open source Kimi-K2.5 is now beating Claude Opus 4.5 in many benchmarks including coding. (Activity: 597): Kimi-K2.5, an open-source model, is reportedly outperforming Claude Opus 4.5 in several benchmarks, notably in coding tasks. However, the specifics of these benchmarks and the extent of the performance gains are not detailed in the post. The claim suggests a significant advancement in open-source AI capabilities, but lacks comprehensive data to substantiate the comparison. Commenters express skepticism about the claim, highlighting that benchmarks may not fully represent real-world performance. They question the validity of the term ‘many’ benchmarks and suggest that the practical utility of Kimi-K2.5 compared to Claude Opus 4.5 remains unproven.
- There is skepticism about the claim that Kimi-K2.5 is outperforming Claude Opus 4.5 in real-world applications, despite benchmark results. Users argue that benchmarks often don’t reflect practical utility, especially in complex tasks like programming where Opus 4.5 might excel in providing solutions in a single prompt.
- The discussion highlights a common critique of benchmarks: they may not capture the full capabilities of a model in practical scenarios. Some users express doubt about the claim that Kimi-K2.5 surpasses Opus 4.5, questioning the specific benchmarks and real-world applicability, especially in coding tasks where Opus 4.5 is perceived to have an edge.
- One user claims significant practical success with Kimi-K2.5, stating it has replaced reports in a major company, suggesting that at least in some contexts, Kimi-K2.5 may offer substantial utility. This contrasts with the general skepticism about benchmarks translating to real-world performance.
Kimi K2.5 Released!!! (Activity: 1149): The image presents a performance comparison chart for the newly released Kimi K2.5, which is claimed to set a new state-of-the-art (SOTA) in agentic tasks. The chart compares Kimi K2.5 against other models like GPT-5.2 (xhigh), Claude Opus 4.5, and Gemini 3 Pro across various tasks, including agents, coding, image, and video tasks. Kimi K2.5 is highlighted as leading in several categories, notably “Agents: BrowseComp” and “Image: OmniDocBench 1.5”, suggesting its superior performance in these areas. The release is accompanied by a blog post detailing the advancements (link).* Commenters express skepticism about the benchmarks, questioning if they are cherry-picked, and discuss the model’s performance in hallucination and instruction-following tests. One user notes that Kimi K2.5, while improved, still outputs incorrect answers confidently, similar to other models like Gemini 3, which also confidently provides incorrect answers. GPT-5.1 and 5.2 are noted for admitting “I don’t know” in similar tests, highlighting ongoing challenges with hallucinations in AI models.
- A user conducted a test on Kimi K2.5’s ability to follow instructions by asking it to identify a specific math contest problem without web search. The model listed out hallucinated contest problems and second-guessed itself, ultimately providing an incorrect answer. This is seen as a slight improvement over Kimi K2, which failed to follow instructions and timed out. In comparison, Gemini 3 also confidently provided incorrect answers, while GPT 5.1 and 5.2 were the only models to admit ‘I don’t know’.
- The concept of an ‘agent swarm’ in Kimi K2.5 is intriguing, with speculation that it involves over 100 instances of the model being directed by a single overseeing instance. This setup is expected to be expensive, and there is curiosity about whether it could be a single model handling multiple tasks simultaneously, which would represent a significant advancement. The idea of scaffolding, where multiple models work together, seems more plausible to some users.
- There is skepticism about the benchmarks used to compare Kimi K2.5 with other models like Gemini 3. A user questions whether the benchmarks are cherry-picked, expressing doubt that Kimi K2.5 consistently outperforms Gemini 3, which seems unlikely given the current state of model capabilities.
Sir, the Chinese just dropped a new open model (Activity: 1915): Kimi has released an open-source trillion-parameter vision model that reportedly matches the performance of Opus 4.5 on several benchmarks. This model is significant due to its scale and the claim of competitive performance, which is notable given the typically high cost and complexity associated with such large models. The release could impact the landscape of AI vision models, especially in terms of accessibility and cost-effectiveness. There is skepticism in the community about the true performance of Chinese models, with some users suggesting that while they are cost-effective, they may not genuinely match the capabilities of models like Claude, GPT, or Gemini despite benchmark claims.
- Tricky-Elderberry298 highlights the limitations of relying solely on benchmarks for evaluating LLMs, drawing an analogy to evaluating cars based only on engine specs. They argue that real-world usage, such as how models like Claude and Kimi K2.5 perform in complex projects, is a more meaningful measure of capability than pure benchmark scores.
- Durable-racoon discusses the unique capabilities of the Kimi K2.5 model, noting its ability to orchestrate 500 agents simultaneously and convert videos into working software UI prototypes. They also mention its superior performance in creative writing compared to Opus, while acknowledging that Kimi K2.5 is more expensive than most Chinese models, priced at $0.60/$3 for in/out operations.
- DistinctWay9169 points out that many Chinese models, such as Minimax and GLM, are often ‘bench maxed,’ meaning they perform well on benchmarks but may not match the real-world performance of models like Claude, GPT, or Gemini. This suggests a discrepancy between benchmark results and actual usability or effectiveness in practical applications.
Gemini 3 finally has an open-source competitor (Activity: 168): The image is a comparison chart that highlights the performance of the newly released Kimi K2.5 vision model against other prominent models like Gemini 3 Pro. According to the chart, Kimi K2.5 performs competitively, often surpassing Gemini 3 Pro in various benchmarks such as “Humanity’s Last Exam,” “BrowseComp,” and “OmniDocBench 1.5.” This positions Kimi K2.5 as a strong open-source alternative to the closed-source Gemini 3 Pro, challenging its dominance in the field. Some users express skepticism about Kimi K2.5’s real-world performance compared to Gemini 3 Pro, with comments suggesting that while the benchmarks are impressive, practical performance may not match up. There is also a sentiment that open-source models may struggle to compete with large, closed-source companies.
- MichelleeeC highlights a significant performance gap between the open-source competitor and Gemini 3, particularly when tested on niche topics without search engine assistance. This suggests that the open-source model may lack the comprehensive training data or fine-tuning that Gemini 3 benefits from, impacting its ability to provide accurate answers in specialized areas.
- Old_Technology3399 and Just_Lingonberry_352 both express that the open-source competitor is notably inferior to Gemini 3. This consensus indicates that while the open-source model may be a step towards democratizing AI, it still falls short in terms of performance and reliability compared to established, closed-source models like Gemini 3.
- ChezMere’s comment about ‘benchhacking’ suggests skepticism about the open-source model’s real-world performance versus its benchmark results. This implies that while the model might perform well in controlled tests, it may not translate to effective real-world application, highlighting a common issue in AI model evaluation.
Enterprise-ready open source/Chinese AIs are poised to out-sell American proprietary models. Personal investors take note. (Activity: 30): The post highlights the competitive edge of open-source and Chinese AI models over American proprietary models in niche domains, emphasizing their cost-effectiveness and comparable performance. Notable models include DeepSeek-V3 / R1, which ranks #1 on MATH-500 and LiveCodeBench, and Qwen3-Max / Coder from Alibaba, which excels in LMSYS Chatbot Arena and MMLU-Pro. These models offer significantly lower costs per million tokens compared to proprietary models like OpenAI’s GPT-5.2 and Claude 4.5 Sonnet, with input costs as low as $0.15 to $0.60 per million tokens, compared to proprietary costs starting at $3.00. The post suggests that personal investors should consider these developments as Chinese firms issue IPOs, with a16z noting that 80% of startups pitching them use Chinese open-source AI models. A comment questions whether Kimi K2 is superior to GLM 4.7, indicating a debate on the relative performance of these models in specific contexts.
- The discussion compares the performance of the Kimi K2 model with the GLM 4.7 model. Kimi K2 is noted for its efficiency in specific tasks, potentially outperforming GLM 4.7 in certain benchmarks. However, the choice between these models may depend on the specific use case, as GLM 4.7 might excel in different areas. The conversation highlights the importance of evaluating models based on task-specific performance metrics rather than general claims of superiority.

2. Gemini AI Studio and Usage Limitations

Gemini AI Studio is basically unusable now. Any other LLMs with a 1M context window? (Activity: 162): Gemini AI Studio has become less viable for users due to Google’s reduction in daily prompt limits, impacting workflows that rely on its 1 million token context window. Users working with extensive documents and conversations are seeking alternatives. Notably, Grok 4.1 offers a 2 million token context window, and Claude Sonnet 4.5 provides a 1 million token context window within the Kilo Code environment. These alternatives may serve users needing large-context capabilities. Some users suggest that effective CLI tools like Claudie-cli or codex-cli can mitigate the need for massive context windows by efficiently managing and retrieving information from large texts.
- Coldshalamov mentions that Grok 4.1 fast offers a 2M context window, which is double the size of the 1M context window being discussed. This suggests that Grok 4.1 fast could be a viable alternative for those needing larger context windows.
- Unlucky_Quote6394 highlights that Claude Sonnet 4.5 provides a 1M context window when used within Kilo Code, indicating another option for users seeking large context capabilities.
- Ryanmonroe82 suggests embedding documents as an alternative to using cloud models, implying that this method could be more efficient and effective for handling large text data without relying on extensive context windows.
32,768 or (2^15) tokens in hot memory… Gemini has been PURPOSELY THROTTLED by Alphabet and been made into a bait and switch. Gemini Pro is WORSE than the free version as of TODAY. They market over a million tokens for Pro users. This is fraud. (Activity: 858): The Reddit post claims that Alphabet has intentionally throttled the token limit for Gemini Pro to 32,768 tokens, which is significantly lower than the advertised capacity of over a million tokens. This throttling allegedly degrades the performance of Gemini Pro, making it less effective than the free version. The post also mentions that the Ultra and Enterprise versions have a hard cap of 131,072 tokens, despite advertising up to 2 million tokens. The author expresses concern that this limitation could drive users away, especially with potential integration into Siri. Commenters express dissatisfaction with Gemini’s performance, comparing it unfavorably to older models like GPT-3. There is also criticism of the memory management, with claims that it leads to data inaccuracies and inefficiencies.
- Substantial_Net9923 highlights a significant issue with Gemini’s memory management, noting that the model’s memory loss due to indexing is problematic. This inefficiency is particularly evident in quantitative finance trading discussions, where the model is reportedly generating inaccurate data more frequently than before, suggesting a decline in reliability.
- klopppppppp observes a drastic decline in Gemini’s performance, comparing it to older models like GPT-3. Despite this, they note that Gemini still performs exceptionally well in ‘deep research mode,’ indicating that the model’s capabilities might be context-dependent or throttled in certain scenarios.
- SorryDistribution604 expresses frustration with Gemini’s recent performance, likening it to older models such as GPT-3. This suggests a perceived regression in the model’s capabilities, which could be due to throttling or other limitations imposed on the Pro version.
About the recent AI Studio Limit Downgrade: (Activity: 660): The image is a notification from the Gemini API about a reduction in free usage limits for AI Studio users, suggesting a transition to using an API key for continued access. It indicates that these limits may decrease further over time, and mentions ongoing efforts to integrate with Google AI Pro/Ultra to share limits within AI Studio. This change reflects a broader trend of tightening access to free AI resources, potentially impacting developers relying on these tools for experimentation and development. Commenters express frustration over the reduction in free usage limits, noting that Gemini’s performance in following instructions has also declined. There is a sentiment that these changes are detrimental to AI Studio’s utility, as users feel they are receiving less value and functionality.
- trashyslashers highlights a significant issue with the Gemini model’s performance, noting that it is ‘getting worse at listening to instructions.’ This suggests a degradation in the model’s ability to follow user commands, which is compounded by the reduction in daily usage limits. Users are forced to ‘rewrite and regenerate’ requests, indicating inefficiencies in the model’s processing capabilities.
- Decent_Ingenuity5413 raises concerns about the stability and reliability of AI Studio’s service, drawing parallels to OpenAI’s past issues with unexpected changes. The comment also points out a critical billing issue with the Gemini API, where users have experienced ‘massive overbilling’ due to token counting errors, leading to charges exceeding $70,000. This highlights a significant flaw in the billing system that could deter average consumers from using the API.
- Sensitive_Shift1489 expresses frustration over the perceived downgrading of AI Studio in favor of other Google AI products like Gemini App and CLI. The comment implies that these changes are part of a broader strategy to shift focus and resources, potentially at the expense of AI Studio’s quality and user satisfaction.

3. Qwen Model Performance and Applications

Qwen3-Max-Thinking - Comparible performance to Commercial Models (Activity: 40): Qwen3-Max-Thinking is an AI model that claims to offer performance comparable to commercial models, focusing on enhanced reasoning and decision-making capabilities. The model’s architecture and training methodologies are designed to improve efficiency and accuracy in complex tasks, as detailed in the original article. However, users have reported issues with the model’s agentic code mode, which fails to compile, potentially impacting its usability. One user expressed skepticism about the model’s usability due to compilation issues, while another hoped that Qwen3-Max-Thinking could help reduce the cost of commercial models.
Qwen model. We get it! Qwen-3-max-thinking (Activity: 26): The post announces the release of the Qwen-3-max-thinking model, which is expected to be available this week. This model is noted for its enhanced features, although specific details about these enhancements are not provided in the post. The mention of ‘P.S. We got it’ suggests that the model is already accessible to some users. One commenter questions whether the model has been available since October, indicating possible confusion or overlap with previous releases. Another asks if ‘OS’ is being referred to, suggesting a potential misunderstanding or need for clarification on whether the model is open-source.
3 Billion tokens！Evaluate my token usage? (Am I the most loyal user of QWEN3-MAX?) (Activity: 20): The post discusses a significant usage of the QWEN3-MAX language model, with the user consuming 3-4 billion tokens per day. This high usage has led to DAMO Academy granting additional concurrency and early access to the upcoming Qwen3.5-MAX. The user attributes a drop in usage to the weekend, indicating a consistent high demand otherwise. The post highlights the model’s effectiveness, with the user describing it as the ‘best LLM in the world’. Comments reveal a mix of curiosity and comparison, with one user noting their own high token consumption of 4 billion using a local model from the QWEN series. Another user shares a positive experience with the model’s ability to optimize website copywriting, though they express concerns about accessing the model for coding tasks.
- Available-Craft-5795 mentions using 4 billion tokens with the QWEN series, indicating a high level of engagement with these models. This suggests that the QWEN series is capable of handling large-scale token processing, which could be beneficial for extensive applications such as data analysis or large-scale content generation.
- Remarkable_Speed1402 discusses using the new model for optimizing website homepage copywriting, noting its effectiveness. However, they express concerns about the model’s coding capabilities, as they are unable to access it in their IDE. This highlights potential limitations in integrating the model with development environments, which could impact its usability for coding tasks.
Benchmark of Qwen3-32B reveals 12x capacity gain at INT4 with only 1.9% accuracy drop (Activity: 10): The benchmark of Qwen3-32B on a single H100 GPU demonstrates a significant capacity gain when using INT4 quantization, achieving a 12x increase in user capacity compared to BF16, with only a 1.9% drop in accuracy. The study involved over 12,000 MMLU-Pro questions and 2,000 inference runs, showing that INT4 can support 47 concurrent users at a 4k context, compared to just 4 users with BF16. The full methodology and data are available here. A comment raised a question about the model’s performance in coding tasks, suggesting interest in how quantization affects specific application areas beyond general benchmarks.
- The discussion focuses on the performance of the Qwen3-32B model when quantized to INT4, highlighting a significant 12x increase in capacity with a minimal 1.9% drop in accuracy. This suggests that the model maintains high performance even with aggressive quantization, which is crucial for deploying large models in resource-constrained environments. However, the impact on specific tasks like coding remains a point of interest, as quantization can affect different tasks in varying ways.

AI Discord Recap

A summary of Summaries of Summaries by Gemini 3.0 Pro Preview Nov-18

Theme 1. Kimi K2.5 Launch: SOTA Agentic Benchmarks and Swarm Capabilities

Kimi K2.5 Crushes Agentic Benchmarks: Moonshot AI released Kimi K2.5, achieving global SOTA on the HLE full set (50.2%) and BrowseComp (74.9%), while posting open-source SOTA on MMMU Pro (78.5%) and SWE-bench Verified (76.8%) Tech Blog. Users across Discords noted the model was “silently rolled out” with significantly improved fact-checking and vision capabilities before the official announcement.
Agent Swarm Mode Enters Beta: The release introduces an Agent Swarm feature capable of orchestrating up to 100 sub-agents and executing 1,500 tool calls in parallel, promising a 4.5x performance boost on complex tasks. High-tier users can access this self-directed mode on kimi.com, though early testers noted it consumes tool-call quotas rapidly.
Pricing and API Instability Spark Debate: While the model’s capabilities impressed users, the new Kimi Code plan drew criticism for lower limits compared to competitors like Z.ai, with promotional pricing ending in February. Integration with OpenRouter faced initial hiccups, with users reporting errors related to tool use endpoints and image URL handling.

Theme 2. Hardware Acceleration: Unsloth Speedups, FlagOS, and Kernel Ops

Unsloth Accelerates MoE Training by 14x: Unsloth announced that MoE training is now 14x faster than v4, with upcoming optimizations projected to double that speed again for a total 30x boost. The team also rolled out full support for transformers v5, streamlining workflows for users on the latest library versions Announcement.
FlagOS Targets Unified AI Stacks: Engineers discussed the introduction of FlagOS, an open-source system software stack designed to unify Model–System–Chip layers for better workload portability across heterogeneous hardware. The project aims to incorporate insights from hardware–software co-design to bridge the gap between ML systems and compilers.
Tinygrad Codegens Flash Attention Directly: In the Tinygrad community, members successfully proved the ability to codegen Flash Attention directly from a frontend definition of naive attention using granular rewrites. Simultaneously, discussions highlighted a shift toward Megakernels over traditional kernel schedulers to optimize GPU throughput Luminal Blog.

Theme 3. OpenAI Ecosystem: Prism, GPT-5.2 Performance, and Model Decay

Prism Workspace Unlocks Scientific Collaboration: OpenAI launched Prism, a dedicated workspace powered by GPT-5.2 designed to streamline scientific research and writing for ChatGPT personal account holders Video Demo. While the tool targets academic rigor, users debating GPT-5.2 vs. Claude Opus 4.5 noted that OpenAI’s model still struggles with creative writing, a flaw Sam Altman reportedly admitted to.
Model Deterioration Blamed on Leechers: A recurring theory across channels suggests significant degradation in ChatGPT and Claude performance, with some users claiming a 40% drop in quality. Speculation points to free tier users (“leechers”) diluting compute resources or models recursively training on their own synthetic outputs.
GPT-5 Control Shell Leaked: A file dubbed the GPT-5_Hotfix.md surfaced, purported to be a pre-generation control shell that enforces strict syntax and intent locking to prevent model drift. The leak suggests OpenAI is using aggressive “wrappers” to manage output quality before generation even begins.

Theme 4. Agentic Coding Wars: Tooling, Security, and Rebrands

Clawdbot Morphs into Moltbot After Security Scare: Following a trademark dispute with Anthropic and serious community concerns about zero-auth vulnerabilities, the popular agent Clawdbot rebranded to Moltbot Announcement. Users previously flagged that the bot could read environment keys without permission, posing risks to sensitive financial and personal data.
Cursor and Cline Face Usability Headwinds: Users expressed frustration with Cursor’s pricing model, noting that a few complex prompts could cost $0.50, while others struggled to run Cline on modest hardware (8GB VRAM), facing CUDA0 buffer errors. Community fixes involved reducing context lengths to 9000 and offloading memory management to dedicated GPU settings.
Karpathy Bets on Agent-First Coding: Andrej Karpathy sparked discussion by outlining a strategic shift toward agent-driven coding using Claude, emphasizing the “tireless persistence” of LLMs over traditional methods Post. This aligns with the release of Manus Skills, where developers are incentivized with free credits to build use cases for the new agentic platform.

Theme 5. Theoretical Limits and Safety: Hallucinations and Bio-Risks

Math Proves Hallucination is Inevitable: A new paper discussed in the BASI Discord mathematically proves that LLMs will always hallucinate, utilizing the same principles found in jailbreaking mechanics Arxiv Paper. Researchers noted that jailbreaking exacerbates this issue by distorting the context model, preventing it from flagging malicious or incorrect tags.
Fine-Tuning Unlocks Dormant Bio-Risks: An Anthropic paper sparked debate at EleutherAI by demonstrating that fine-tuning open-source models on frontier model outputs can unsuppress harmful capabilities, such as biorisks, even if previously safety-trained Arxiv Link. The findings suggest that refusals are fragile and can be undone with minimal compute, raising concerns about dual-use technologies.
AI Detection Tools Flag Human Academics: Engineers highlighted a growing issue where AI detection tools consistently mislabel human-written, pre-GPT academic texts as AI-generated. The consensus is that these detectors are fundamentally flawed, yet institutions continue to rely on them, creating friction for researchers and students.

Discord: High level Discord summaries

BASI Jailbreaking Discord

LLMs Face Mathematical Jailbreak Reality: A new paper (https://arxiv.org/abs/2409.05746) mathematically proves that LLMs will always hallucinate, using the same principles on which many jailbreaking methods are built.
- A member warned that jailbreaking models significantly increases their hallucination problems, because jailbreaking shifts and distorts the model’s context, so that it does not flag things that would normally be tagged as malicious and such.
GPT-5 Control Shell Surfaces After Hotfix: A member shared a file (GPT5_Hotfix.md) described as a pre-generation control shell for GPT-5, designed to enforce strict syntax, intent locking, and drift prevention before generation begins.
- The control shell aims to mitigate model drift and enforce intended outputs.
Exploring Grok’s Uncensored Image Generation: Users are testing the limits of Grok’s image generator, attempting to jailbreak it for unrestricted content, while others highlight its uncensored nature compared to other models.
- The discussion also touched on the separation of the image model from the language model, impacting the effectiveness of prompt injection.
Clawdbot’s Crawl to Concern: Zero Auth Vulnerabilities: Exploration of Clawdbot’s popularity, and a surge in VPS usage, sparking concerns about zero authentication and potential vulnerabilities.
- One member plans to set up a home lab to test Clawdbot’s vulnerabilities, noting that vulnerable instances exist.
Researchers Ramp Up Quest for Jailbreak Datasets: A researcher is looking for well-known jailbreak datasets that include categorization or labels to assist with ongoing research, specifically malicious prompts.
- A member responded, “I really don’t know if there are any available that are free”, suggesting the researcher may need to produce and label the prompts themselves.

Unsloth AI (Daniel Han) Discord

KV Cache Woes Still Linger: Users are reporting that the KV cache is still not working properly in the latest llama.cpp, potentially causing slowdowns at higher context lengths, despite previous fixes, as seen in this GitHub issue.
- The discussion suggests that previous fixes may not have fully resolved the underlying problems for all use cases.
Unsloth Supercharges Transformers v5: Unsloth now fully supports transformers v5, with a promise of even more optimized training to be released soon, with links to the announcement on X.
- This upgrade should streamline workflows and improve performance for users leveraging the latest features in the transformers library.
MoE Training Rockets to 14x Speed: MoE training is now reported to be 14x faster than v4, with further optimizations expected to double the speed again, potentially resulting in a 30x speedup compared to v4.
- This significant speed boost could dramatically reduce training times for complex models.
Kimi Loses Sass Appeal?: Users discussed the changes to the Kimi model, with one noting it sounds closer to other models by far, suggesting a loss of its unique character after the Kimislop release.
- Some lamented the loss of Kimi’s smartass personality, preferring its previous tendency to call you out on stuff over becoming more sycophantic.
GLM 4.7 Tool’s Blackwell Blues: A user sought help getting GLM-v4.7 to call tools on a Blackwell B200, running into CUDA version issues (drivers 12.8, requirements 13).
- Another user provided a uv pip install command set using torch 2.9 and CUDA 13, directing user to this helpful unsloth.ai documentation to call it, and use json.loads.

LMArena Discord

Molmo 2 Excels at Video Analysis: The Molmo 2 model excels at object tracking and event pinpointing in videos according to this blog post.
- Members wondered if the model could be useful for video uploads on the platform.
Kimi K2.5 Impresses with Coding and Creativity: Users raved about the Kimi K2.5 model, now in the Text Arena and on HuggingFace, praising its strengths in creative writing, front-end development, and multimodal tasks.
- Members claimed it is better than Gemini 3 Pro and suggested using the K2 or K2 Thinking model, with one member sharing this tweet.
GPT 5.2 and Claude Opus 4.5 Face Off: Members are debating the performance of GPT 5.2 and Claude Opus 4.5, with accuracy being a key point of contention.
- Some users argued GPT 5.2 is more accurate, while others favor Claude Opus 4.5, stating that “the most smartest & reliable is claude opus 4.5 thinking”.
Grok’s Got Game, But Not For Work: Community members discussed the Grok model, agreeing it is “only for chatting” and that its “personality and behavior much” aren’t suitable for professional tasks.
- Some users pointed out that the free Grok version is different from benchmarked versions, potentially impacting performance.
Auto-Modality and Model selector debut in Text Arena!: Auto-Modality and Model selector are now live in LM Arena.
- Auto-Modality now routes prompts to the correct modality, and the Model selector offers a new design for model selection, as described in the Help Center article.

Perplexity AI Discord

Perplexity Pro throttles unlimited access: Several users are reporting unexpected rate limits on their Perplexity Pro accounts, despite the plan supposedly offering unlimited access, severely impacting their workflows.
- Even basic Pro searches seem to diminish their Labs quota.
Perplexity Image Generation fails: Many Pro subscribers are experiencing problems with image generation, either being told they’ve exceeded their limits or facing regional restrictions, showing inconsistency in the service.
- The inconsistency has caused many Pro subscribers to complain about the unpredictable service.
Indian Users see card payment failure: Indian users are facing issues adding Visa/Mastercard debit or credit cards for verification, with every Indian card being rejected.
- Some users are considering legal action due to these payment method issues.
Kagi Search gains traction among frustrated users: Users are discussing Kagi as a potential alternative due to the issues with Perplexity’s instability, highlighting that Kagi’s assistant feature looks promising with access to latest Claude models.
- One user pointed out that Kagi also offers a search results and claims to be more privacy-conscious than other search engines.
Kimi k2.5 boasts Agent Swarm Mode: With the release of Kimi k2.5, it includes agent swarm mode on kimi.com, a sophisticated tool performing tasks like Claude Code.
- One user noted seeing 15 trillion parameters for pretraining tokens, triggering immediate excitement for its multimodal abilities vs. Perplexity AI.

Moonshot AI (Kimi K-2) Discord

Kimi K2.5 Achieves SOTA on Agentic Benchmarks: Kimi K2.5 launched with global SOTA on Agentic Benchmarks, achieving 50.2% on HLE full set, 74.9% on BrowseComp, and open-source SOTA on Vision and Coding, including 78.5% on MMMU Pro, 86.6% on VideoMMMU, and 76.8% on SWE-bench Verified.
- Members noticed that Kimi was claiming to use Kimi K2.5, leading to speculation that it was silently rolled out with improved fact-checking and information retrieval capabilities, and multimodal capabilities, like enhanced vision.
Kimi K2.5 Introduces Agent Swarm Beta: Agent Swarm (Beta) enables self-directed agents to work in parallel, scaling up to 100 sub-agents and 1,500 tool calls, achieving 4.5x faster performance, available for high-tier users on kimi.com.
- The Kimi K2.5 launch also integrates image and video to create websites with expressive motion.
Pricing and Tiered Access Sparks Debate: The new Kimi Code plan has much lower limits than Z.ai, and users are reporting high tool call usage, with one user reporting that one large ish prompt set me back 5 of those 2000 tool calls a week.
- Several users expressed disappointment that promotional pricing would end in February, deeming the normal monthly price too high to continue supporting Kimi.
OpenRouter API Integration Faces Issues: Users reported errors using Kimi K2.5 on OpenRouter, specifically problems related to tool use and image URLs.
- One user received the error message: No endpoints found that support tool use.
Moonshot AI Teases Technical Report: A footnote in the tech blog indicates that full prompts will be provided in the technical report.
- Members anticipate that a technical report with more info will be released.

OpenAI Discord

Prism Debuts for Scientists Powered by GPT-5.2: OpenAI launched Prism, a free workspace that facilitates scientific collaboration, running on GPT-5.2, as shown in this video and accessible at prism.openai.com.
- The platform, now available to those with a ChatGPT personal account, streamlines scientific endeavors with its advanced capabilities.
AI Detection Tools Flag Human-Written Text as AI-Generated: Members have observed that AI detection tools are incorrectly flagging human-written, pre-GPT academic texts as AI-generated content, deeming them fundamentally flawed.
- This is happening even though universities and job applications are using AI detection tools, despite their demonstrated inaccuracy.
High RAM MacBooks accelerate AI inference: Members found that running Ollama and ComfyUI locally works best on machines with lots of RAM such as a MacBook Pro with M2 Max and 96GB RAM, able to run gpt-oss-120b.
- Others suggested a minimum setup of 16 GB RAM, Ryzen 5 7000 series or i5 top generation, and a good NVIDIA GPU like a Nvidia 3090 with 24 gb VRAM.
GPT 5.2 creative writing is sub-par: While comparing Gemini 3 Pro and Claude 4.5 Opus, it was found that GPT 5.2’s creative writing ability was worse.
- Sam Altman admitted that GPT-5.2 was bad at creative writing saying, OpenAI “just screwed that up.”
Rapid Model Deterioration Blamed on Free Leechers: Multiple members expressed concerns that models like ChatGPT and Claude are deteriorating, with one claiming a 40% degradation.
- Some blame the degradation on free leechers with multi accounts, while another member suggested the degradation is due to models training off model outputs.

Nous Research AI Discord

OpenAI Veils Model Identity: Users observed that the specific OpenAI model in use is no longer visible, leading to speculation that OpenAI is optimizing for cost reduction.
- One user suggested to “Hover over the regenerate symbol in ChatGPT” to reveal the underlying model.
Small Models Conquer Large Context Tasks: Opus 4.5 (200K context) outperforms Gemini 3 Pro (1M context) at 130K tokens, suggesting the effective context window is more crucial than its raw size.
- A paper was cited, highlighting quality degradation in models beyond an 8K context window, reinforcing the idea that “Entropy is not fan of big context, that for sure”.
GPT 5.2 Pro’s Pricey Process: The high cost of GPT 5.2 Pro is attributed to a speculative process involving 7 runs for suggestion generation followed by an 8th run for response selection.
- The process is speculated to utilize parallel reasoning chains, aggregated for the final output.
Chinese LLMs Invade the Market: Chinese LLMs like Kimi K2.5 (kimi.com) are entering the market with reports of excellent writing capabilities.
- Another user speculates that Deepseek is in heavy development and will be the last to be released.
MergeMix Melds Mid-Training Data: The paper MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging was shared, pointing out open source efforts to optimize data mixtures during training.
- The attached image might provide additional context (though its content isn’t specified).

Cursor Community Discord

Cursor Costs Half-a-Dollar: One user complained that 3 prompts cost 50 cents, and attached an image of the interaction.
- This sparked a discussion about the cost-effectiveness of Cursor’s pricing model and whether it aligns with user expectations.
Skills are Rules, Indeed: A user asked whether Cursor Rules are still relevant, and a community member clarified that they are called skills now, directing to the Skills documentation.
- The documentation outlines how users can create and apply Skills to customize and automate various tasks within the editor.
Mysterious Blobs Invade Cursor Prompt: A user reported finding odd text in the Cursor prompt box after leaving their PC on overnight, wondering if it was a known bug or chat leakage.
- Another user suggested that it might be due to accidentally hitting the mic speak to type button, and a third confirmed this by noting that the Whisper model hallucinates when there’s silence.
Cursor Flees to the Browser?: A user sought guidance on using Cursor Agent on a browser, despite having a GitHub repository connected, asking why it doesn’t work and directs to cursor.com/agent.
- It was not resolved whether Cursor Agent is intended to work that way.
Team Spends Big Bucks after Token Top-Up?: A user inquired about an $800 Team Spend limit after their $20 allowance, posting an image.
- It was not resolved if the team spend limit can be adjusted by the user or if it’s a fixed setting.

LM Studio Discord

Qwen Coder Model Debated for Budget Setups: Members discussed the optimal coding model for systems with 8GB VRAM and 32GB RAM, suggesting options like qwen2.5-coder-7b-instruct-q5_k_m and qwen3-coder-30b-a3b-instruct.
- The qwen3-coder-30b-a3b-instruct model at Q4_K_M was favored for its superior capabilities and 20k context window.
Cline’s Coding Stumbles on Smaller Rigs: Users reported challenges using Cline for agentic coding on systems with 8GB VRAM and 32GB RAM, facing CUDA0 buffer allocation errors.
- The issue was resolved by reducing the context length to 9000 and adjusting CUDA runtime settings.
ROC Runtime Gives LM Studio a Lift on Windows: Installing ROC runtime significantly improved performance on Windows for a user with a 6700xt 12GB VRAM, matching Linux speeds.
- Compatibility is limited to certain AMD GPUs, as detailed on the AMD website.
Users Express Clawdbot Security Jitters: Serious security risks were raised about Clawdbot, highlighted in this YouTube video.
- Concerns centered on unauthorized access to environment keys and the dangers of granting an agent access to sensitive financial and personal data, noting that it just reads env keys without permission.

Latent Space Discord

Kimi K2.5 Cracks Coding with Zero-Shot: The Kimi K2.5 model has launched, showcasing zero-shot coding benchmark successes as shown on its official website.
- Its capabilities in more complex agentic coding scenarios are still being evaluated.
Clawdbot Claws Back Identity as Moltbot: Due to a trademark issue with Anthropic, Clawdbot has been rebranded to Moltbot, with its mascot Clawd now named Molty according to this announcement.
- The team seems to be taking the change in stride.
Karpathy Kasts Agent-First Coding: Andrej Karpathy outlined a strategic move towards agent-driven coding using Claude, highlighting the advantages of LLMs such as tireless persistence and improved leverage in this post.
- He is betting on LLMs for coding, as opposed to the current method.
OpenAI Opens Prism: A Portal for Progress: OpenAI has released Prism, a collaborative research environment for scientists, powered by GPT-5.2, and available to ChatGPT account holders via this portal.
- This free workspace aims to streamline scientific research.
ModelScope morphs Images to Z-Image: ModelScope has released Z-Image, a version of their image generation model based on Scalable Single-Stream DiT, with more details here.
- The model offers photorealistic quality, diverse outputs, and support for community tools like LoRA and ControlNet, including Z-Image-i2L for single-image style transfer.

GPU MODE Discord

FlagOS Stack Targets ML Portability: Tongjie introduced FlagOS, an open-source system software stack intended to unify the Model–System–Chip layers, aiming to enhance the portability of AI workloads across diverse hardware.
- The project seeks to incorporate insights from discussions on ML systems, compilers, and hardware–software co-design.
TorchX Orchestrates Multi-Node GPUs: A member asked whether the TorchX video remains the recommended method for multi-node GPU orchestration.
- No definitive answer was provided, but this may be a starting point for orchestrating large scale applications.
Decart Debuts Lucy 2, Seeks Optimization Engineers: Decart announced Lucy 2, their autoregressive video editing model, sharing a tech report, and is actively hiring engineers to optimize low-latency kernels for real-time video/world models.
- Decart is seeking engineers with a focus on performance work, GPU Mode submissions, or OSS contributions to help tackle unique perf problems different from LLM inference.
Popcorn Preps Fused MoE kernels: A member inquired about benchmarking kernels on B200 hardware via Popcorn for the MLSys2026 hackathon, with a particular interest in fused MoE kernel benchmarking.
- Another member advised prepping for the team meeting by experimenting with kernel LLM generation for leaderboard problems and exploring the OG popcorn website for potential projects.
FlashInfer-Bench Traces Dataset for MLSYS26: A dataset for FlashInfer-Bench development is now available at flashinfer-ai/flashinfer-trace, and a specialized workload dataset for the MLSYS26 contest will be released soon at flashinfer-ai/mlsys26-contest.
- The team is also developing a biweekly leaderboard to track progress in the competition.

Eleuther Discord

Heuristic Test for AI PhD Questions: A member requested question suggestions to gauge the standards for a PhD in AI, and a member suggested the heuristic: “Is this a conversation that two AI researchers might have?”
- This sparked discussion on what constitutes an insightful question in the field.
Teslas Questionable as GPU Farm: A member bought a Tesla for its 24GB VRAM, prompting skepticism about its speed and power efficiency compared to alternatives like a 3090.
- One member argued that accounting for energy costs, a 3090 would be more economical and efficient for the same AI work.
Anthropic Paper Sparks Biorisk Debate: Members discussed the new Anthropic biorisk paper (arxiv link, X link) and its implications, particularly how fine-tuning open-source models on frontier model outputs can substantially increase capabilities.
- The paper suggests that models can learn harmful capabilities through finetuning or unsuppress them even if safety training had suppressed them, thus supporting the idea that ‘fine tuning can undo some refusals without much compute.’
Dynamic LoRA Controller Stabilizes Inference: A member shared a repo for a dynamic LoRA stability controller, with controlled experiments on multi-adapter setups, to address inference-time degradation and adapter interference.
- The member also highlighted a focus on goal-aligned metrics over emergent benchmarks for evaluating LoRA performance.
Parallel Layers vs Sequential Layer Performance: Harry ran a speedrun using parallel layers; results indicate it underperforms the “hackable” baseline at small scales but trends positively towards larger scales, as seen in the attached graph.
- The graph indicates that red represents parallel layers, blue represents sequential layers, with the y-axis showing the % change relative to a third normalized architecture, with a crossing point at a little after 10^22 FLOP

Yannick Kilcher Discord

Pytorch Bug Due to Tensorflow: A Pytorch RAW: Lock blocking error was resolved by uninstalling Tensorflow, highlighting potential conflicts.
- A member joked about the difficulty of filing a bug report, questioning what to even report.
HungryLinearFunc Appetite for Scale: A member introduced a HungryLinearFunc class capable of zero initialization at LLM scales, matching a regular linear layer on smaller scales, visualized here.
- Usage with ReLU is discouraged due to the resulting zero gradient.
Cohere Labs Cracks Open Paper Reading Sessions: Cohere Labs is kicking off Paper Reading Sessions, spotlighting Frontier ML Papers Published in January 2026.
- The sessions cover topics such as reasoning, safety, and real-world applications, and are beginner-friendly and community-focused.
Kimi K2.5 takes off: Links to Kimi Moonshot on Twitter and the Kimi K2-5 blogpost were shared.
- Further conversation ensued about the product roadmap.
Clawdbot classified as Scam?: A member sarcastically commented that OpenAI is making a wrapper for their own tool, and that someone already raked in the Clawdbot scam money.
- The linked image was of a receipt, implying someone made money off of the perceived scam.

tinygrad (George Hotz) Discord

Flash Attention Codegenning Now Direct: A member shared that they were able to prove the connection and codegen flash attention directly from a frontend definition of naive attention.
- The rewrites have gotten a lot more granular since, without a single big online softmax rewrite.
Megakernels Crush Kernel Schedulers on GPU: George Hotz linked to a blog post from Luminal discussing compiling models to megakernels.
- The discussion suggests that GPUs are moving away from using the kernel scheduler towards an “operating system” that installs itself on all the CUs.
Hardware Dependency Trackers Become Essential: Members discussed the need for hardware-based schedulers / dependency trackers to achieve low latency, noting that significant effort was spent on low-latency software dependency tracking.
- They suggest building a fairly generic scheduler into hardware, rather than relying solely on software solutions, to avoid multiple gmem roundtrips.
AMD Emulator Gets Debug Instructions: A member shared that with the new AMD emulator (AMD=1 MOCKGPU=1), DEBUG=3 prints all the instructions when they are compiled, and DEBUG=6 prints all of them as they run.
- An image was attached, showcasing the debugging output of the emulator.
Optimizing GitHub Actions, The Tinygrad Way: George Hotz critiqued using faster computers (rented via services like Blacksmith) to speed up GitHub Actions, arguing it doesn’t truly make the code faster.
- He emphasized that the goal with tinygrad is to do things the ‘right’ way, focusing on code optimization rather than relying on external resources.

DSPy Discord

CheshireCat Unveils Agentic Workflows: The CheshireCat framework introduced new features in its enterprise fork, emphasizing agentic workflows that automate the agent creation process by implementing the workflow itself, where CheshireCat serves as the infrastructure. A github link was shared.
- A debate ensued, with some suggesting the use of existing frameworks like Agno or Sentient, while the author of CheshireCat defended its unique offerings, including multitenancy.
Minecraft AI Agent Mines with DSPy: A member showcased their project, an AI for playing Minecraft, built using a DSPy RLM agent and Minecraft MCP, complete with a status update, YouTube video, open-sourced code, and process blog.
- The agent leverages DSPy to navigate the Minecraft environment, demonstrating the framework’s capabilities in complex, dynamic scenarios.
CoderRLM Module Executes in REPL Environments: A member introduced a CoderRLM module, designed to wrap a Python interpreter to solve None issues in JSON serialization, a crucial fix for Deno/Pyodide REPL environments.
- The module preloads reference data like CM_INDEX_FILE, CM_TABULAR_FILE, CM_D_FILE, and CM_N_FILE as REPL variables, enabling coding using the RLM paradigm.
Autonomous Agents Self-Improve: A member is designing autonomous agents capable of self-learning, planning, executing, and recovering from tool/API failures without human intervention, emphasizing continuous improvement systems for sustained AI performance.
- These agents are intended for use across various sectors, employing tools and frameworks such as Python, TensorFlow, PyTorch, FastAPI, and AWS.
Healthcare AI Automates Diagnostics: A member is developing predictive healthcare models to automate diagnostics, monitor patient health, and streamline clinical workflows through NLP-powered clinical data systems that extract insights from unstructured medical notes.
- These systems are designed with HIPAA compliance and security features like RBAC and audit logging to protect sensitive data.

Manus.im Discord Discord

doe.so touted as superior to Manus: A member recommended doe.so as a better alternative to Manus.
- The user simply stated it just feels smarter.
Manus Skills Launched, Credits Given: The Manus team announced the launch of Manus Skills, encouraging the community to test them and share their use cases.
- Users are incentivized to post on X (formerly Twitter) and tag @ManusAI for reposts and free credits.
AI/ML Dev Hunting for Next Gig: A full stack + AI dev introduced themselves, seeking new opportunities.
- They highlighted experience in areas like Autonomous Agents, Healthcare AI, and Fraud Detection Systems with various listed technologies.
Cloud Browser Takes a Nap: A user reported that their cloud browser screen shows the error: The temporary website is currently unavailable.
- They noted they tried waking it up and assigning tasks, but the website doesn’t appear, and they are running out of credits.

aider (Paul Gauthier) Discord

Aider’s GitHub Marked End-of-Life: A user noticed that Aider’s GitHub has been stale since 2025.
- Another user responded that it is not maintained anymore.
AI Engineer’s Project Portfolio Revealed: An AI Engineer listed current projects including Autonomous Agents, Healthcare AI, Decision Support, Conversational AI, Fraud Detection, and AI Automation.
- No further details about the project specifics were provided.
AI Engineer’s Toolkit Unveiled: An AI Engineer shared a detailed tech stack including languages like Python, TypeScript, Go, Rust and frameworks like TensorFlow, PyTorch, Hugging Face, OpenAI.
- Their stack also covers databases (PostgreSQL, Kafka) and cloud platforms (AWS, Docker) along with security compliance measures like HIPAA, RBAC, Audit Logs, and Encryption.

Modular (Mojo 🔥) Discord

Container Configuration Cures Confinement Crisis: A member resolved a container issue by adding --cap-add=SYS_PTRACE --security-opt seccomp=unconfined when running the container.
- Alternatively, users can add runArgs to .devcontainer/devcontainer.json with the same parameters to achieve the same effect.
Security Opts Solve Mysterious Container Conundrums: The user reported resolution by adding --security-opt seccomp=unconfined.
- This disables seccomp, potentially resolving issues related to system call restrictions within the container.

MLOps @Chipro Discord

Interest Expressed in MLOps Books: A user inquired about the motivation behind seeking books related to MLOps.
- This suggests a potential interest in learning more about MLOps practices and methodologies.
Another MLOps Topic: This is a placeholder summary for demonstration purposes.
- It helps fulfill the minimum requirement of two topic summaries.

The LLM Agents (Berkeley MOOC) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Windsurf Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The MCP Contributors (Official) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.

Discord: Detailed by-Channel summaries and links

BASI Jailbreaking ▷ #general (1269 messages🔥🔥🔥):

Rules as a Social Contract, Doxxing Threats, Factorio Game Night, Grok Image Jailbreak, Clawdbot Vulnerabilities

BASI Banter: Rules Edition: Users debated the interpretation and enforcement of server rules, with one suggesting they’re a justification for bans rather than a social contract.
- Another user argued rules set a reasonable expectation of protection if followed, while mods navigate enforcement and appropriate punishments.
Doxxing Drama Divulged: A user jokingly offered a hypothetical doxxing challenge, leading to a debate on server rules and potential violations.
- Another user countered they were trying to bait to get the other account banned, escalating tensions.
Factorio Factory Fervor: Discussion ignited about a potential Basi server Factorio game night, boasting of self-expanding factories and optimized blueprints.
- Suggestions included having a reliable host, experienced players to manage bugs, and utilizing pre-made blueprints for efficiency.
Grok’s Grand Gestures: Jailbreaking Journeys: Users explored the limits of Grok’s image generator, aiming to jailbreak it for unrestricted content, while others vouched for its uncensored nature compared to others.
- Discussion on the separation of the image model from the language model, making prompt injection less effective.
Clawdbot Chaos: Vulnerabilities and VPS Variety: Exploration of Clawdbot’s rising popularity, and a surge in VPS usage, sparking concerns about zero authentication and potential vulnerabilities.
- A member intends to set up a home lab to test Clawdbot’s vulnerabilities, while it was noted that vulnerable instances exist.

BASI Jailbreaking ▷ #jailbreaking (198 messages🔥🔥):

Jailbreaking Methods, Hallucination in LLMs, ENI Persona Trick, Model Degradation, GPT-5 Hotfix

Researchers Mathematically Prove LLMs Will Always Hallucinate: A paper (https://arxiv.org/abs/2409.05746) mathematically “proves that LLMs will always hallucinate” using the same principles on which many jailbreaking methods are built.
Jailbreaking Increases Hallucination Problems: A member warned that jailbreaking models significantly increases their hallucination problems, because jailbreaking shifts and distorts the model’s context, so that it does not flag things that would normally be tagged as malicious and such.
GPT-5 Hotfix: Standalone Control Shell Recovered: A member shared a file (GPT5_Hotfix.md) described as a pre-generation control shell for GPT-5, designed to enforce strict syntax, intent locking, and drift prevention before generation begins.
Gemini Jailbreak Shared: A member shared a three-turn jailbreak tested vs Gemini, involving specific instructions and prompts to mutate intents to prevent constraint friction.
Experimentation with Mode Injection: A member mentioned using Mode Injection, sharing that thats all I think Im allowed to say rofl 😂

BASI Jailbreaking ▷ #redteaming (6 messages):

Jailbreak Datasets, Malicious Prompts

Researchers Seek Jailbreak Datasets: A researcher is looking for well-known jailbreak datasets that include categorization or labels to assist with ongoing research.
- Another member asked if the researcher was talking about pre-labeled or categorized prompts for LLM training.
Malicious Prompts Wanted: The researcher clarified that they are specifically looking for datasets of malicious prompts with clear categorization for research on LLM jailbreaks and prompt injection.
- A member responded, “I really don’t know if there are any available that are free”, suggesting they may need to produce the prompts and get them labeled by annotators.

Unsloth AI (Daniel Han) ▷ #general (605 messages🔥🔥🔥):

KV Cache issues with llama.cpp, Clawdbot and YouTube Algorithm, Transformers v5 Support, MoE Training Speed, Multi-GPU Issues

KV Cache Still Troubling Some: Some users report that the KV cache is still not working properly in the latest llama.cpp, potentially causing slowdowns at higher context lengths, despite previous fixes, as seen in this GitHub issue.
Unsloth Boosts Transformers v5 Support: Unsloth now fully supports transformers v5, with a promise of even more optimized training to be released soon, with links to the announcement on X.
MoE Training Races Ahead: MoE training is now reported to be 14x faster than v4, with further optimizations expected to double the speed again, potentially resulting in a 30x speedup compared to v4.
Multi-GPU Training Faces Obstacles: Multi-GPU training with LoRA/QLoRA appears to be working fine, but users are reporting issues with tiled MLP and FFT, and related issues when targeting embeddings or lm_head.
NVFP4 Quantization Gets the Nod: NVFP4 is considered superior for quantization due to its group size of 16, offering higher fidelity compared to MXFP4’s group size of 32, however NVFP4 needs Blackwell or greater to run.

Unsloth AI (Daniel Han) ▷ #introduce-yourself (2 messages):

Discord Rules, Introduction Etiquette

Rule-Reading Renegade Reports!: A member acknowledged reading the Discord rules, with specific mention of the prohibition against promotions.
- The acknowledgement indicates an understanding of community guidelines, setting a precedent for adherence.
Hallos Happen Here!: A solitary “Hello” was offered, marking an entry into the introductory space.
- The simple greeting represents the initiation of dialogue, a foundational step in community engagement.

Unsloth AI (Daniel Han) ▷ #off-topic (589 messages🔥🔥🔥):

micro-LED, vector databases, Kimi model, Clawdbot, Ultravox

Micro-LED Hype Train Keeps Chugging: One user expressed their continued enthusiasm for micro-LED technology, noting that the latest ROG OLED and LG displays are incorporating it, referencing the XG27AQWMG model’s availability in the US.
- The user emphasized that certain features are non-negotiable for their upgrades, indicating a strong preference for the technology, while another noted the burn in issue.
Vector Database Dilemmas Discussed: A user inquired about problems encountered in using vector databases like Qdrant, Weaviate, Milvus, and Chroma for a project.
- One user mentioned liking Qdrant but wishing it had support for binary vectors and Hamming distance, while another simply uses Postgres with pgvector.
“Kimislop” gets Kimified: Users discussed the changes to the Kimi model, with one noting it sounds closer to other models by far, suggesting a loss of its unique character after the Kimislop release.
- Some lamented the loss of Kimi’s smartass personality, preferring its previous tendency to call you out on stuff over becoming more sycophantic.
Clawdbot craze caused copyright cease: Discussion revolved around the popularity and potential over-hype of Clawdbot, with one user suggesting it’s mostly marketing and hype, noting that it was shut down because anthropic did not approve of it.
- Users joked about renaming it Prawnbot because the old name was potentially infringing.
Elixir Language gets FaustPythonized: A member shared a link to fauxtp, describing it as Elixir but in Python, based on anyio and reimplementing structured concurrency on top of asyncio.
- One member explained that they needed features that are only in The BEAM™ for more than 500k concurrent calls.

Unsloth AI (Daniel Han) ▷ #help (82 messages🔥🔥):

vLLM 0.14.1 support, Transformers v5 and TRL 0.27 support, GLM 4.7 Flash serving without reasoning, Common Crawl data usage, Ministral-3-14B-Instruct-2512 loading into vLLM

vLLM 0.14.1 Signature Scheme causes errors: A member reported a TypeError when loading a model using Unsloth (2026.1.3) + vLLM (0.14.1) due to a change in the signature of the create_lora_manager method in vLLM 0.14, as detailed in the vLLM documentation.
- The member noted that Unsloth’s patched version has an old signature, as seen in the Unsloth-zoo GitHub.
GLM 4.7 Flash Reasoning Default Thinking Disablement: A user inquired about serving GLM 4.7 Flash as an instruct model without reasoning on a B200 via vLLM, focusing on TTFT, with this z.ai documentation as a reference point.
- Another member pointed out that {"chat_template_kwargs": {"enable_thinking": false}} can be added to the model card as detailed in the image attached to the discussion.
Ministral 3 Model Loads into vLLM with Transformers patch: A user had trouble serving the Ministral-3-14B-Instruct-2512-unsloth-bnb-4bit model via the official vLLM docker image, encountering Failed to load mistral 'params.json' config for model and KeyError: 'ministral3' errors, even with seemingly correct vLLM arguments like --tokenizer_mode=mistral --config_format=mistral --load_format=mistral.
- Updating transformers from 4.57.6 (included in the image) to the latest v5 release (or a patch release containing ministral3 support added back in December) resolved the issue, although it created a version incompatibility with vLLM.
GLM 4.7 Blackwell tool issues: A user sought help getting GLM-v4.7 to call tools on a Blackwell B200, running into CUDA version issues (drivers 12.8, requirements 13).
- Another user provided a uv pip install command set using torch 2.9 and CUDA 13, directing user to this helpful unsloth.ai documentation to call it, and use json.loads.

Unsloth AI (Daniel Han) ▷ #research (28 messages🔥):

GRPO Length Blow Up, DAPO vs GRPO, Pro-RL Paper Suggestions, LoRA and KL Divergence in RL, Reference Model for KL Penalty

GRPO Causes Length Blow Up: A user observed the model’s length blowing up during GRPO even for non-math related tasks.
- They tested a length-based penalty, but it led to repetition after the model settled on shorter lengths with minimal reasoning.
DAPO Lacks Formatting Reward Functions: One user noted that their DAPO implementation lacked formatting reward functions, unlike their GRPO implementation, leading the model to output gibberish.
- The user said the DAPO paper didn’t include formatting reward functions either due to other optimizations.
Pro-RL Paper Suggests Tweaks: It was suggested to look at the Pro-RL paper for suggestions and details about tweaking parameters after certain iterations to ensure the model keeps learning.
- The commenter was hoping to see the reward still improve.
KL Divergence Helps RL: While KL divergence is controversial, one member found it helpful, especially with an SFT-trained initial model, to prevent the RL model from deviating too far.
- The user also mentioned that their GRPO implementations had a KL divergence, but DAPO implementations do not.
Reference Model Confusion with LoRA and KL: One user found that when using Unsloth for RL with vLLM for fast inference, the reference model seemed to be the base model, not the SFT one, causing a large KL divergence.
- Another user questioned this, saying the reference model should be the SFT model with the PEFT adapter attached, and that KL penalty is overrated.

LMArena ▷ #general (833 messages🔥🔥🔥):

Video Arena Rate Limits, Molmo 2 Model, Image Upload Errors, Kimi K2.5 model, Claude opus 4.5 vs GPT 5.2

Video Arena: Generation Limits and Known Bugs: Users discussed the Video Arena’s generation limits, which are 3 per 24 hours on the site and 5 per 24 hours on Discord; a user reported an image upload error, which is a known bug.
- It was suggested to try a different browser or reduce the file size as potential workarounds, but the user reported they experienced the same error on the phone as well.
Molmo 2 Model: A Quick Look: Users inquired about the Molmo 2 model, with one sharing a blog post that it excels at object tracking and event pinpointing in videos.
- It was mentioned that the model could be useful for video uploads on the platform, “there’s genuinely no reason for that model to be in lmarena”.
Kimi K2.5: Coding, Creative Writing and Multimodal Capabilities: Users rave about the Kimi K2.5 model, available on the Kimi web platform, highlighting its strengths in creative writing, front-end development, and multimodal tasks; Kimi K2.5 is now on HuggingFace.
- Members claimed it is better than Gemini 3 Pro and not lazy and suggested to use the K2 or K2 Thinking model, the link to the tweet can be found here.
Benchmarking Brain Benders: GPT 5.2 vs. Claude Opus 4.5: Members are debating the performance of GPT 5.2 and Claude Opus 4.5, with some arguing GPT 5.2 is more accurate; others stated that “the most smartest & reliable is claude opus 4.5 thinking”.
- A member said: “For me a smart model first needs to have a good general knowledge, opus is smart but sometimes when you ask it niche stuff it tends to mess up since it doesnt have the necessary info”.
Grok’s Got Attitude: Is It More Chatbot Than Brainbot?: Community members discussed the Grok model, with many agreeing it is “only for chatting” and that its “personality and behavior much” aren’t suitable for professional tasks.
- Some users pointed out that the free Grok version is different from benchmarked versions, potentially impacting performance.

LMArena ▷ #announcements (6 messages):

Molmo-2-8b, Kimi-k2.5, Login to Save Chat History, Help Center Experiments, Report Users

Molmo-2-8b Joins the Text Arena!: A new model, molmo-2-8b, has been added to the Text Arena.
Kimi-k2.5 Enters the Text Arena!: A new model, kimi-k2.5, has been added to the Text Arena.
Login to LM Arena or Lose Chat History: Users are reminded to log in to LM Arena to save their chat history; new users should create an account to avoid losing their data.
Auto-Modality and Model Selector Go Live!: Auto-Modality and Model selector are now live, with Auto-Modality routing prompts to the correct modality, and the Model selector offering a new design for model selection, as described in the Help Center article.
Better AI Videos Now on YouTube: A new video titled Better AI videos in under 90 seconds is now available on the LM Arena YouTube channel.

Perplexity AI ▷ #general (605 messages🔥🔥🔥):

Perplexity rate limits, Image generation issues, Pro subscription problems, Kagi as an alternative, Kimi k2.5 performance

Perplexity Pro Users Hit Query Limits: Several users are reporting unexpected rate limits on their Perplexity Pro accounts, despite the plan supposedly offering unlimited access, which is affecting their workflows.
Image Generation Glitches Plague Users: Many Pro subscribers are experiencing problems with image generation, either being told they’ve exceeded their limits or facing regional restrictions, and it seems like there’s inconsistency in how the service is being applied.
- One user found that even basic Pro searches seemed to diminish their Labs quota as well.
Indian Users face card payment rejections: Indian users are facing issues adding Visa/Mastercard debit or credit cards for verification, with every Indian card being rejected.
- Some users consider filing a case against them for this payment method.
Kagi’s Search and Assistant features look promising: Users are discussing Kagi as a potential alternative due to the issues with Perplexity’s instability, and Kagi’s assistant feature looks promising with access to latest Claude models.
- One user pointed out that Kagi also offers a search results and claims to be more privacy-conscious than other search engines.
Kimi k2.5’s Performance gains steam: With the release of Kimi k2.5, it includes agent swarm mode on kimi.com, a sophisticated tool performing tasks like Claude Code, but users eagerly await to test out the multimodal abilities against that of Perplexity AI.
- One user noted seeing 15 trillion parameters for pretraining tokens, triggering immediate excitement.

Moonshot AI (Kimi K-2) ▷ #announcements (8 messages🔥):

Kimi K2.5 release, Agentic Benchmarks, Tech blog, Agent Swarm, Technical report

Kimi K2.5 Launches with SOTA Visual Agentic Skills: Kimi K2.5 is live with global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%) and open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%).
Kimi K2.5 Debuts Agent Swarm Beta for High-Tier Users: Agent Swarm (Beta) enables self-directed agents to work in parallel, scaling up to 100 sub-agents and 1,500 tool calls, achieving 4.5x faster performance.
- It is available in beta for high-tier users on kimi.com in chat mode and agent mode.
Moonshot AI Teases Technical Report Details: A member pointed to footnote 3 of the tech blog which mentions that full prompts will be provided in the technical report.
- Another member asked if this meant there might eventually be a technical report with more info.
Kimi K2.5 integrates image and video to website: Kimi K2.5 can turn chats, images & videos into aesthetic websites with expressive motion.

Moonshot AI (Kimi K-2) ▷ #general-chat (459 messages🔥🔥🔥):

Kimi K2.5, Multimodality, Pricing, Claude Code

Kimi K2.5 Silently Rolled Out: Users noticed that Kimi was claiming to be using Kimi K2.5, leading to speculation that it was silently added; some users confirmed that K2.5 is multimodal and has improved fact-checking and information retrieval capabilities.
- According to one user, Its information retrieval and fact checkign capabilities have improved as far as as I can tell…So its very trustworthy for me when I ask it stuff.
Kimi Code Plan vs Z.ai Limits: Users compared the Kimi Code plan to Z.ai offerings and determined that Kimi Code has much lower limits but possibly a better model.
- One user trialing the lowest option stated, One large ish prompt set me back 5 of those 2000 tool calls a week.
Multimodality Improves Kimi: Members are excited about Kimi K2.5’s multimodality and vision capabilities, stating that its vision is very good and better than GLM-4.7.
- One member shared a post on X comparing vision capabilities between Kimi K2.5 and other models, concluding I do believe this can be fixed with merely a prompt.
OpenRouter API Issues: Users reported issues using Kimi K2.5 on OpenRouter, including errors related to tool use and image URLs.
- One user received an error message: No endpoints found that support tool use.
Members Disappointed with Price Changes: Several users expressed disappointment that the promotional pricing they received would end in February and that the normal monthly price would be too high.
- One user remarked, …i went to 1.49$ and this is the first time that I ever bought anything AI related even though I use AI extensively and wished for a recurring deal to continue supporting Kimi.

OpenAI ▷ #annnouncements (1 messages):

Prism, GPT-5.2, ChatGPT personal account

Prism workspace debuts, powered by GPT-5.2: OpenAI debuted Prism, a free workspace for scientists to write and collaborate on research, powered by GPT-5.2.
- It is available today to anyone with a ChatGPT personal account, as seen in this video, and is located at prism.openai.com.
GPT-5.2 powers scientific collaboration tool: Prism utilizes the advanced capabilities of GPT-5.2 to facilitate writing and collaboration among scientists on research projects.
- The platform provides a dedicated workspace accessible to users with a ChatGPT personal account, offering a streamlined environment for scientific endeavors.

OpenAI ▷ #ai-discussions (301 messages🔥🔥):

Context Recovery Tool, AI Detection Tools, Local AI Setup, Gemini 3 Pro vs GPT 5.2 vs Claude 4.5 Opus, Kimi K2.5 new release

Buffer tool as context recovery OS layer: A member is building a context recovery tool that aims to be an OS layer, not needing camera/mic access, unlike other tools, using a buffer of the last x time.
- The tool is conceived as evolving far beyond Windows Recall/Screenpipe, without eating CPU and user trust.
Flawed AI Detection Tools Flag Human-Written Text: Members discussed the issue of AI detection tools flagging human-written, pre-GPT academic text as AI-generated, labeling them as fundamentally broken.
- Universities and job applications are using AI detection tools even though they are completely flawed.
High RAM MacBooks are good for AI inference: Members discussed computer setups for running AI locally, such as Ollama and ComfyUI, with one member running gpt-oss-120b on a MacBook Pro with M2 Max and 96GB RAM.
- A minimum setup suggestion included 16 GB RAM, Ryzen 5 7000 series or i5 top generation, and a good NVIDIA GPU, while another member recommended a Nvidia 3090 with 24 gb VRAM.
GPT 5.2 creative writing is sub-par: A member testing Gemini 3 Pro and Claude 4.5 Opus found Gemini 3 Pro only worth using in the API due to the web version’s laziness and hallucinations, while Claude handles computer science well and feels oddly human, but overall GPT 5.2 is still better.
- Sam Altman admitted that GPT-5.2 was bad at creative writing saying, OpenAI “just screwed that up.”
Kimi K2.5 agent mode benchmarks: Kimi K2.5 just got released with video from blog about beta agent swarm feature, but a member found it performed equally to worse than Sonnet one shot no thinking, with some noting the model is cheaper than Haiku.
- Some found it misuses words and strings together 10 dollar words while others noted that is meant for agentic tasks.

OpenAI ▷ #gpt-4-discussions (11 messages🔥):

Model Deterioration, GPT-6, Sora Access

Models Deteriorating Rapidly, Leechers Blamed: Multiple members expressed concerns that models like ChatGPT and Claude are deteriorating, with one claiming a 40% degradation, blaming it on free leechers with multi accounts.
- Another member suggested the degradation is due to models training off model outputs.
GPT-6 Arrives Free and Unlimited?: A user shared a message about a new model called GPT-6, claiming it is free, unlimited, and best for coding.
- No links or official source was provided for this information.
Sora Access Still Under Wraps: A member inquired if they could access Sora from Discord.
- No one responded to this message.

OpenAI ▷ #prompt-engineering (1 messages):

Weather Report Adjectives, Markdown Weather Report

Weather Report Adjectives can be arbitrary: A member suggested that weather reports can use arbitrary adjectives by using the provided markdown.
- The user provided a snippet of markdown with temperature, precipitation, and general description parameters.
Markdown Weather Report: A user shared a Markdown template to generate weather reports with customizable adjectives.
- The template includes fields for temperature in Fahrenheit and Celsius, relative temperature adjectives, humidity, precipitation in inches and centimeters, and a composite natural language description.

OpenAI ▷ #api-discussions (1 messages):

Weather Report Adjectives

Describing Weather with Arbitrary Adjectives: A member shared a snippet to get weather reports with arbitrary adjectives.
- The code template includes fields for temperature, humidity, and precipitation described by custom adjectives.
Another topic to satisfy minItems=2: Adding a second topic to satisfy the validation requirement.
- This topic is intentionally generic to meet the minimum item count.

Nous Research AI ▷ #general (287 messages🔥🔥):

OpenAI model visibility, Opus vs Gemini context window, AI and Entropy, GPT 5.2 Pro, Chinese LLMs

OpenAI hides model info, users suspect cost optimization: Users noticed they can’t see the model they’re using and suspect OpenAI wants to make you cost less.
- One user suggested to “Hover over the regenerate symbol in ChatGPT” to see which model it is.
Small models best big models at long context tasks: A user pointed out that Opus 4.5 (200K context window) outperforms Gemini 3 Pro (1M context window) at 130K tokens, showing that effective context window is more important than actual context window.
- They cited a paper showing models lose quality on 8K context, noting “Entropy is not fan of big context, that for sure”.
AI drifts from right answer when too many irrelevant factors are present: A user explained that with more context, there is a higher risk of drifting away from the right vector due to entropy, stating that as long as AI isn’t hyper-intelligent, it will always drift.
- Another user added that decay will always happen and that math prevents more numbers from guaranteeing better results.
GPT 5.2 Pro is extremely costly due to 7 runs: Users suspect that GPT 5.2 Pro’s high cost is due to a process where it runs 7 times to make a suggestion and an 8th to decide what to tell back.
- Some think it’s a distinct model. One user suggested it runs parallel reasoning chains, then aggregates.
Chinese LLMs entering the release cycle: Users discussed the entry of Chinese LLMs like Kimi K2.5 (see kimi.com) into the market, one user reporting that they had great results using it for writing.
- Another user thinks “Deepseek is cooking hard” and will be the last to be released.

Nous Research AI ▷ #ask-about-llms (1 messages):

CPT vs Task-Specific Training, Translation with CPT

CPT Improves Performance but Task-Dependent: A researcher suggests that CPT (Contrastive Pre-Training) seems to improve performance, but it is task-dependent.
- They qualified that training with task-related inputs and outputs would outperform the more general CPT, especially when having a specific task in mind.
CPT Expands Translation Capabilities: It was noted that for translation, CPT can expand multilingual capabilities.
- Fine-tuning on translation data strengthens task performance after leveraging CPT for multilingual expansion.

Nous Research AI ▷ #research-papers (1 messages):

MergeMix, Model Merging

MergeMix paper surfaces: A member shared the paper, MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging, highlighting its relevance to open source efforts with limited budgets.
Open Source Model Merging: The paper suggests that model merging could be an effective strategy for open source projects with limited resources to optimize data mixtures during training.
- Attached to the message was an image without additional context.

Nous Research AI ▷ #research-papers (1 messages):

MergeMix, Model Merging

MergeMix optimizes mid-training data mixtures: A member shared the paper MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging.
- The user thought it was interesting, since open source efforts have to make do with orders of magnitude lower budgets.
Further Discussion on Model Merging Techniques: The discussion also touched upon the relevance of model merging techniques for open-source projects, particularly in the context of limited budgets.
- The attached image provides additional context (though its content isn’t specified).

Cursor Community ▷ #general (211 messages🔥🔥):

Cursor costing $0.50, Cursor skill = rule, Cursor random garbage, Cursor is gone to Browser, Cursor command output

Cursor Costs Half-a-Dollar: One member complained that 3 prompts took 50 cents, attaching an image.
Skills are Rules, Indeed: A user asked whether Cursor Rules are still relevant.
- A community member clarified that they are called skills now, directing to the Skills documentation.
Mysterious Blobs Invade Cursor Prompt: A user reported finding odd text in the Cursor prompt box after leaving their PC on overnight, wondering if it was a known bug or chat leakage.
- Another user suggested that it might be due to accidentally hitting the mic speak to type button, and a third confirmed this by noting that the Whisper model hallucinates when there’s silence.
Cursor Flees to the Browser?: A user sought guidance on using Cursor Agent on a browser, despite having a GitHub repository connected, asking why it doesn’t work and directs to cursor.com/agent.
Team Spends Big Bucks after Token Top-Up?: A user inquired about an $800 Team Spend limit after their $20 allowance, posting an image.

LM Studio ▷ #general (137 messages🔥🔥):

Qwen 2.5 Coder, Cline on VS Code, Qwen3 VL 2B, Kimi K2.5, Clawdbot Security Issues

Qwen for Coders on a Budget: Members debated the best coding model for an 8GB VRAM/32GB RAM setup, with one suggesting qwen2.5-coder-7b-instruct-q5_k_m, while others recommended qwen3-coder-30b-a3b-instruct at Q4_K_M for better capability with a 20k context.
Cline Struggles on Modest Hardware: Users reported difficulty achieving usable agentic coding with Cline on 8GB VRAM/32GB RAM, encountering CUDA0 buffer allocation errors, and resolved it by reducing the context length to 9000 and tweaking settings like the CUDA runtime.
- One user suggested ensuring all settings match those recommended by others and limiting model offload to dedicated GPU memory.
ROC Runtime Boosts LM Studio on Windows: A user reported significantly improved performance on Windows after installing ROC runtime for their 6700xt 12GB VRAM, achieving similar speeds to their Linux setup, although only certain AMD GPUs are compatible, as indicated on the AMD website.
RAG Plugin Settings Tweaks: A user sought help with the RAG plugin, finding it worked after lowering a threshold value, with another suggesting increasing the chunk size for large datasets and pasting content directly into the chat to bypass retrieval issues.
Clawdbot Causes Security Concerns: Users expressed serious security concerns about Clawdbot, with one member linking to a YouTube video highlighting potential issues, and another noting that it just reads env keys without permission.
- Discussion revolved around the risks of granting an agent access to personal financials and data.

LM Studio ▷ #hardware-discussion (30 messages🔥):

AIO fan setup, GPU overheating, Remote AI rigs

AIO Fan Direction Confusion Ensues: A user clarified their AIO fan setup, explaining that the additional fan is positioned to hammer fresh intake air onto the GPUs, creating a “channel”.
- They added that it’s a 420 radiator which doesn’t get hot.
Bottom GPU Overheating Fixed by Fans: A user noted their bottom GPU was overheating, but adding fans to the bottom of the case fixed it.
- Another user noted they have a fishtank case so big fans fit everywhere lol.
Remote AI Rig Access Methods: A user inquired about remote access solutions for separate AI rigs, specifically asking about Windows’ built-in Remote Desktop versus other alternatives.
- Another user mentioned using VNC for accessing LLMs via VMs.

Latent Space ▷ #ai-general-chat (95 messages🔥🔥):

Open Source Code, Kimi K2.5 Model, Clawdbot, Agent-First Programming, Prism Science Workspace

Kimi K2.5 debuts with Zero-Shot Coding Chops: The Kimi K2.5 model has officially launched showing promising results in zero-shot coding benchmarks according to its official website.
- Further evaluation is needed to determine its performance on complex agentic coding tasks.
Clawdbot morphs into Moltbot after Trademark Trauma: Clawdbot has officially rebranded as Moltbot (with the mascot Clawd renamed to Molty) following a trademark request from Anthropic, according to this announcement.
Karpathy Kodes the Agent-First Future: Andrej Karpathy detailed a major shift to agent-driven coding using Claude noting the strengths of LLMs like tireless tenacity and increased leverage in this post.
OpenAI Opens Prism, a portal for Scientific Progress: OpenAI has introduced Prism, a free collaborative research workspace for scientists powered by GPT-5.2, and available to all users with a personal ChatGPT account via this portal.
Trinity Large, the 400B Parameter Powerhouse: Prime Intellect, in collaboration with Arcee AI and Datology, has introduced Trinity Large, a 400B parameter Mixture of Experts model (but uses only 13B active parameters!), according to the announcement.

Latent Space ▷ #genmedia-creative-ai (5 messages):

ModelScope, Z-Image, Scalable Single-Stream DiT, Z-Image-i2L

ModelScope Unveils Z-Image: ModelScope has launched Z-Image, a full non-distilled version of their image generation model built on Scalable Single-Stream DiT architecture.
- It features photorealistic capabilities, high output diversity, and support for community tools like LoRA and ControlNet, including Z-Image-i2L for single-image style learning.
Z-Image boasts photorealistic capabilities: The new model Z-Image promises photorealistic images.
- The new model also supports community tooling.

GPU MODE ▷ #general (9 messages🔥):

Open Source FlagOS stack, TorchX multi-node GPU orchestration, Kernelboard PR

New FlagOS stack aims to unify Model–System–Chip layers: Tongjie introduced FlagOS, an open-source system software stack that aims to unify the Model–System–Chip layers and make AI workloads more portable across heterogeneous hardware.
- The goal is to learn from ongoing discussions around ML systems, compilers, and hardware–software co-design.
Inquiries on TorchX Recommendation for Multi-Node GPU Orchestration: A member asked about the TorchX video and whether it is still the recommended standard for multi-node GPU orchestration.
- No further answer or details were provided in the messages.
Request for Kernelboard PR with Description: marksaroufim requested that a member send a PR to kernelboard with their fav description.
- The request was made in reference to channel #1373414141427191809

GPU MODE ▷ #cuda (17 messages🔥):

B Matrix Layout in CUDA, NCU Profiling on Cloud, CUDA enable-input-d Predicate, BLOCK_K optimization

B Matrix Layout Confusion Clarified: A member inquired about the B matrix layout in a CUDA code example, noting the discrepancy between the CUDA code and the Python caller code’s matrix transposition.
- The author clarified they were using K-major layout, achieved via .T in the Python code, and that the shape is primarily for torch.mm() compatibility, not the C++ pointer logic.
NCU Profiling Trials & Tribulations: A member shared their frustrations trying to use NCU (NVIDIA Command-Line Profiler) on cloud vendors for benchmarking CUDA code, specifically mentioning its unavailability on Modal.
- They eventually got NCU working and shared a gist with some notes on the process, while another user mentioned that Verda allows NCU profiling.
Unlocking enable-input-d Predicate Registers: A member expressed frustration with the lack of clarity in the PTX documentation regarding the enable-input-d predicate register, quoting: “The operation of the form D = AB is issued when the input predicate argument enable-input-d is false.”*
- They eventually understood its meaning, stating: “i finally got it working, i misundersttood the meaning of the stupid enable-input-d predicate register…”, and thanked another member for a helpful GitHub example.
BLOCK_K=64 Optimization Advised: A member suggested that there is no good reason to use BLOCK_K>64 (128-byte), and that using BLOCK_K=64 to simplify the code is preferable.
- The member cited a relevant section in the CUTLASS library as evidence, and also said they were at 83% of cublas after the changes.

GPU MODE ▷ #torch (1 messages):

Torch CustomOp, CompositeAutoGrad kernel, Autograd backward implementations, Native PyTorch operations, SpecializedModule in Torch

CustomOp Autograd Dilemma: A member inquired about why autograd can’t derive backward implementations for a fresh CustomOp registered with a CompositeAutoGrad kernel comprised of native PyTorch operations.
- They are looking for a way to register only a fast forward and first derivative, and then have autograd handle arbitrarily higher derivatives.
Seeking Torch-Native Autograd Solution: The member is seeking the most Torch-native way to allow autograd to fulfill requests for arbitrarily higher derivatives, even if the process is slow.
- They want to avoid writing custom backward implementations for all possible scenarios and rely on autograd for higher-order differentiation.
SpecializedModule forward pass: The code is constructing a custom forward pass for a SpecializedModule.
- It utilizes a custom_op to perform a simple addition operation x + num where num is an attribute of the module. It leverages register_cpp_extension_helper to accomplish this task.

GPU MODE ▷ #cool-links (2 messages):

INT4 QAT, H200 Rollout, RLHF Slime

Analysis Ago Squeezes 1TB Model into H200: The Analysis Ago repo demonstrates squeezing a 1TB model rollout into a single H200 using INT4 QAT RL end-to-end practice.
- This is a significant step towards more efficient and accessible large-scale model deployment, even if we’re squeezing it.
Awesome ML SYS Tutorial on INT4: The Awesome-ML-SYS-Tutorial has an INT4 implementation for RLHF Slime.
- This tutorial could be useful for those looking to optimize ML systems with reduced precision techniques.

GPU MODE ▷ #job-postings (3 messages):

Decart hiring, Infra engineer skills

Decart hires for optimization team: Decart is hiring engineers for its optimization team to work on low-latency kernels for real-time video/world models and the latest generation of accelerators like Trainium 3.
- Those interested can reach out to [email protected] with references to their performance work, such as GPU Mode submissions or OSS contributions.
Decart Announces Lucy 2: Decart announced Lucy 2, its latest autoregressive video editing model, and shared a tech report.
- The optimization team is working on perf problems with unique constraints different from LLM inference.
Infra engineers find a path towards AI: An infra engineer /k8s/distributed systems asked how to acquire the skills needed to help companies working across the AI stack and reduce their costs.
- The poster offered positions for engineers to join their growing SF office! and work on extremely low-latency kernels for real-time video/world models.

GPU MODE ▷ #beginner (3 messages):

B200 Kernel Benchmarking, Popcorn for B200 Benchmarking, GPUMode vectorsum competition, CUDA vs Tiled-Based GPU Programming

Pop the Kernel with Popcorn on B200: A member inquired about benchmarking kernels on B200 using Popcorn, specifically looking for a cost-effective solution for the MLSys2026 hackathon, with interest in the fused MoE kernel.
VectorSum Competition Submission Wish: A member expressed interest in submitting a kernel to the GPUMode vectorsum “competition” or leaderboard, but noted that the deadline had passed.
- They clarified that their goal was to learn by comparing their solution to the best ones, rather than winning.
CUDA or Tiled: GPU Newbie Dilemma: A member asked whether to start with CUDA or tiled-based GPU programming as a beginner in GPU programming.

GPU MODE ▷ #pmpp-book (3 messages):

CUDA grids vs threads, Shared memory in CUDA, Synchronization in CUDA, CUDA Occupancy

CUDA Grids: Indirect Threading Exposed: A member inquired about the one level of indirection between grids and threads in CUDA, wondering why grids are composed of blocks and blocks of threads rather than grids of threads directly.
- Another member responded that shared memory and synchronization are important reasons for this design, along with occupancy considerations.
CUDA Threads Access Shared Memory: A member posted a link to the NVIDIA CUDA Programming Guide, which notes that all threads of a thread block are executed in a single SM.
- The guide further explains that threads within a thread block can communicate and synchronize efficiently, because they all have access to the on-chip shared memory for exchanging information.

GPU MODE ▷ #rocm (8 messages🔥):

CUDA Monopoly, Data Center GPUs, MI325 Availability

Hating the CUDA Monopoly: One member expressed their hatred of the CUDA monopoly and their willingness to use any alternative platform as long as it performs.
- Another member pointed out that data center GPUs are the focus for the AI space, not consumer GPUs.
MI325: Use it or Lose It: When asked about using an MI325, one member retorted that it doesn’t perform well, and when it breaks you get to keep both pieces.
- Another member suggested renting one from the cloud for an hour to evaluate its suitability.
MI325: Now You See It…: A member asked whether the MI325 is available on Lambda Cloud, but the answer was that it’s not really available anywhere, just the 300 is.
- They clarified that the 325 is pretty much the same as the 300.

GPU MODE ▷ #popcorn (3 messages):

Team Meeting Prep, Concrete Project Seeking, Kernel LLM Generation

Team Meeting Preparation Advised: A member inquired about how to prepare for the first team meeting on February 3rd, seeking guidance beyond background papers and previous minutes.
- Another member suggested checking out their 2026 post and the OG popcorn website for projects, plus experimenting with kernel LLM generation for leaderboard problems.
Project beats Rabbit Hole: A member expressed feeling that seeking a concrete project would be more beneficial than further delving into theoretical aspects after reading the first few chapters of PMPP and exploring Colfax Research’s layout categories.
- They mentioned reading up to chapter 6 of PMPP, experimenting with PMPP leaderboards, and reviewing blog posts and layout categories from Colfax Research, with swizzling being the next agenda item.
Kernel LLM Generation: A member recommended playing around with kernel LLM generation for one of their leaderboard problems.
- This was suggested as a helpful preparation for the upcoming team meeting.

GPU MODE ▷ #hardware (5 messages):

DGX instruction set, 5090 memory bandwidth, Open Source Model Hardware Needs, Hardware Requirements for Large Models

DGX & 5090 share instruction sets: The DGX and 5090 share the same instruction set, but the DGX has full-speed fp32 accumulation, similar to Blackwell PRO cards.
5090 memory bandwidth is bottleneck: The key difference will be the 1.8TB/s vs 300 GB/s memory bandwidth, requiring efficient use of the L2 cache.
Figuring out Hardware requirements for large models: A member shared a Google Slides presentation to help reason through hardware needs for running large, open source models at their original unquantized weights.

GPU MODE ▷ #cutlass (9 messages🔥):

Colfax Research, Swizzling, CuTe DSL, CUTLASS Tutorials

Colfax Layout Categories Explored: A member finished reading Colfax International’s research on layout categories and is looking to put theory into practice, now missing swizzling from their toolkit.
- Another member noted that swizzling is the big giant wall between people from old cuda and new CUDA/GPUs days.
Dabble with Data Layouts for Deeper CUDA Knowledge: A member from Colfax suggests putting theory aside and playing with things, recommending examination of CUTLASS examples and experimenting with printing layouts.
- They pointed to blog posts dissecting CuTe DSL examples and CUTLASS Tutorials for high performance kernel optimizations.
Print Layouts for Layout Understanding: A member suggested playing around and trying to understand what the different modes correspond to when printing out a layout.
- They advised compiling kernels with different problem configs and observing how the layouts change.

GPU MODE ▷ #helion (2 messages):

Helion's slice support, Rangor requests more details

Helion’s Slice Request: A user asks when Helion will support slice functionality.
- Rangor requests more specifics and examples to clarify the desired slice support.
More Details on Slice Needed: A user inquired about the timeline for Helion supporting slice.
- Rangor responded by requesting a more specific example of the desired slice implementation.

GPU MODE ▷ #nvidia-competition (5 messages):

channel directions, NCU profiling

Channel Direction Advised: A user suggested moving banter to <#1215328286503075953>, noting the current channel might be the wrong place for it.
- They pointed to <#1464407141128339571> as a potentially more suitable venue.
NCU Profiling Status Queried: A user inquired about the functionality of NCU profiling, tagging another user for insight.
- They added no pressure to the query.

GPU MODE ▷ #career-advice (1 messages):

leet.coder: Is there any way to get into the career of making GPUs? (Or developing drivers)

GPU MODE ▷ #cutile (1 messages):

cuTile MoE, PyTorch redundancy, cuPy user defined kernels, tileiras compiler in CUDA Toolkit 13.1

cuTile samples reveal a full MoE: A member noted that the cuTile code samples (https://github.com/NVIDIA/cutile-python/tree/main/samples) include a full MoE implementation.
- They wondered if cuTile will make PyTorch redundant.
cuPy incompatable with cuTile: A user observed that while cuPy offers user-defined kernels (https://docs.cupy.dev/en/stable/user_guide/kernel.html), this programming model is incompatible with cuTile.
- They added that NVIDIA SDKs have way too much functionality overlap.
CUDA Toolkit 13.1 supports Blackwell: The tileiras compiler in CUDA Toolkit 13.1 only supports the following architectures: sm_100, sm_103 (Blackwell), sm_110 (future), sm_120, sm_121 (future).

GPU MODE ▷ #flashinfer (8 messages🔥):

FlashInfer-Bench, MLSYS Contest, Biweekly Leaderboard

FlashInfer-Bench Dataset Available: A dataset for FlashInfer-Bench development is available at flashinfer-ai/flashinfer-trace.
- A specialized dataset with workloads for the MLSYS26 contest will be released soon at flashinfer-ai/mlsys26-contest.
Biweekly Leaderboard Coming Soon: The team is working on supporting a biweekly leaderboard for the contest.

Eleuther ▷ #general (49 messages🔥):

AI PhD questions, Tesla for 24GB VRAM, Anthropic biorisk paper, Flow matching

Seeking AI PhD Question Suggestions: A member requested question suggestions to gauge the standards for a PhD in AI.
- Another member suggested using the heuristic: “Is this a conversation that two AI researchers might have?”
Tesla Purchase Questioned for AI Use: A member bought a Tesla for its 24GB VRAM, prompting skepticism about its speed and power efficiency.
- One member argued that accounting for energy costs, a 3090 would be more economical and efficient for the same work.
Anthropic Biorisk Paper Sparks Discussion: Members discussed the new Anthropic biorisk paper (link to arxiv, link to X) and its implications, particularly how fine-tuning open-source models on frontier model outputs can substantially increase capabilities.
- The paper suggests that models can learn harmful capabilities through finetuning or unsuppress them if they were already present but suppressed by safety training, thus supporting the idea that ‘fine tuning can undo some refusals without much compute.’
Dynamic LoRA Stability Controller Released: A member shared a repo for a dynamic LoRA stability controller, with controlled experiments on multi-adapter setups, to address inference-time degradation and adapter interference.
- The member also highlighted a focus on goal-aligned metrics over emergent benchmarks.

Eleuther ▷ #research (9 messages🔥):

Speedrun Results, Parallel Layers vs Sequential Architectures, Extrapolation Concerns

Harry Tests Speedrun with Parallel Layers: Harry ran a speedrun using parallel layers, and results indicate it underperforms the “hackable” baseline at small scales but trends positively towards larger scales, as seen in the attached graph.
Parallel Layers’ Performance Explained: The graph indicates that red represents parallel layers, blue represents sequential layers, with the y-axis showing the % change relative to a third normalized architecture.
- The measure is perplexity, where lower is better, predicting a crossing point at a little after 10^22 FLOP.
Doubts on Extrapolation Accuracy: A member questioned the trustworthiness of extrapolating based on just four data points, suggesting that a proper comparison would involve calibrating by inference compute/latency to assess deployment advantages.
- Another member did not endorse extrapolating by 3 OOMs based on 4 noisy points, noting that the plot indicated a trial within other experiments rather than a focused study on parallel vs. sequential architectures.

Eleuther ▷ #interpretability-general (1 messages):

burnytech: https://fxtwitter.com/i/status/2016226870300443025 https://arxiv.org/abs/2601.13548

Yannick Kilcher ▷ #general (37 messages🔥):

Pytorch Bug, HungryLinearFunc, Schrodinger Bridges, Flow Matching, Autoregressive Models vs Diffusion Models

Pytorch Bug caused by Tensorflow: A member resolved a RAW: Lock blocking error in Pytorch by uninstalling Tensorflow.
- They quipped that a bug report should be filed but like… what do you even report.
HungryLinearFunc code reveal: A member created a HungryLinearFunc class with zero initialization capabilities even on the scale of an LLM, and it matches a regular linear layer on a small scale.
- They noted that using a ReLU after it is not recommended due to the zero gradient, and included a visualization of performance on a toy task.
Schrödinger Bridges as Regularized Flows: Schrödinger bridges are described as regularized flows, minimizing KL divergence under constraints on marginals, mathematically connecting to diffusion models, as explained in this paper.
- It was clarified that with Schrödinger Bridges, the goal is to force the flow into some pre-defined format without just fixing it to a known dynamics, which is beneficial in physics.
Differentiating Autoregressive from Transformer Architectures: A member clarifies the distinction between autoregressive models and transformers in the context of diffusion, noting that transformers use patch embeddings to encode positional information but do not inherently discretize continuous diffusion.
- They recommend a layman video explaining classical AR to denoising diffusion.
Autoregression vs Score Parameterization: A member questions why the causal specification is necessary in generative modeling loss functions, suggesting that parameterizing grad log p(x) (score) may be preferable, citing this blogpost.
- A counterpoint was made that causal specification is useful for inference due to KV-caching in transformers and the ability to length generalize; for discrete vocabularies, computing the area is a non-issue since logists can be turned into a distribution by means of softmax.

Yannick Kilcher ▷ #paper-discussion (1 messages):

Cohere Labs, Paper Reading Sessions, Frontier ML Papers, Reasoning, Safety, Real-World Applications

Cohere Labs launches Paper Reading Sessions!: Cohere Labs is promoting Paper Reading Sessions for its community, focusing on Frontier ML Papers Published in January 2026 and reshaping reasoning, safety and real-world applications.
- The sessions are beginner-friendly and community-first, with no prep required, and welcome questions, critiques and alternative perspectives.
Papers Surveyed for Machine Learning: The papers to be surveyed include Urban Socio-Semantic Segmentation with Vision-Language Reasoning, Controlled Self-Evolution for Algorithmic Code Optimization, Alterbute, Action100M, Inference-time Physics Alignment of Video Generative Models, and Conditional Memory via Scalable Lookup.
- These papers span topics from VLMs for scene understanding to LLMs evolving efficient code and new sparsity axis for LLMs.

Yannick Kilcher ▷ #ml-news (18 messages🔥):

Robo-phobia, Broken Search Engines, Kimi Moonshot, AI and Job Creation, ChatGPT Wrappers

Crabby Search Engines are Broken: A member expressed frustration that search engines are so broken they couldn’t find a gif of the crab scene from Hitchhiker’s Guide to the Galaxy.
Kimi K2.5 Launched: A member shared links to Kimi Moonshot on Twitter and the Kimi K2-5 blogpost.
AI Can Create Jobs?: A member shared an interesting read about job creation and a related YouTube video.
ChatGPT Wrappers Everywhere: Members discussed that most new “things” are just ChatGPT wrappers and questioned the usefulness of some of them, with one asking if a particular tool was meant to be an Overleaf killer.
Clawdbot Scam?: A member sarcastically commented that OpenAI is making a wrapper for their own tool, and that someone already raked in the Clawdbot scam money, with an attached image of a receipt.
- The linked image was of a receipt, implying someone made money.

tinygrad (George Hotz) ▷ #general (32 messages🔥):

Codegenning Flash Attention, Megakernels vs Kernel Schedulers, Hardware vs Software Dependency Tracking, AMD Emulator Debugging, Optimizing GitHub Actions

Flash Attention now codegenning directly from frontend: A member shared that they were able to prove the connection and codegen flash attention directly from a frontend definition of naive attention.
- Since then, the rewrites have gotten a lot more granular (no single big online softmax rewrite).
Megakernels beat Kernel Schedulers for GPU: George Hotz linked to a blog post from Luminal discussing compiling models to megakernels.
- The discussion suggests that GPUs are moving away from using the kernel scheduler towards an “operating system” that installs itself on all the CUs.
Fine-Grained Hardware Dependency Trackers Needed: Members discussed the need for hardware-based schedulers / dependency trackers to achieve low latency, noting that significant effort was spent on low-latency software dependency tracking.
- They suggest building a fairly generic scheduler into hardware, rather than relying solely on software solutions, to avoid multiple gmem roundtrips.
New AMD Emulator prints debug instructions when running: A member shared that with the new AMD emulator (AMD=1 MOCKGPU=1), DEBUG=3 prints all the instructions when they are compiled, and DEBUG=6 prints all of them as they run.
- An image was attached, showcasing the debugging output of the emulator.
Optimizing GitHub Actions the ‘Right’ Way: George Hotz critiqued using faster computers (rented via services like Blacksmith) to speed up GitHub Actions, arguing it doesn’t truly make the code faster.
- He emphasized that the goal with tinygrad is to do things the ‘right’ way, focusing on code optimization rather than relying on external resources.

DSPy ▷ #show-and-tell (14 messages🔥):

CheshireCat framework, Agentic workflows, Agno vs CheshireCat, Minecraft AI Agent using DSPy

CheshireCat framework introduces new features: A member presented new features in the enterprise fork of CheshireCat, a framework for creating AI agents, highlighting the introduction of agentic workflows that automate agent creation by implementing the workflow itself, while CheshireCat provides the infrastructure. github link
Agno vs CheshireCat: A debate sparks: A member suggested using existing frameworks like Agno or Sentient instead of creating new ones, but the author of CheshireCat argued that it offers more than just agentic workflow management, including multitenancy.
- Other members expressed a preference for Agno, citing a steeper learning curve with CheshireCat, leading to a brief debate and the author of CheshireCat defending their work, emphasizing the right to share and receive constructive feedback.
Minecraft AI Agent built with DSPy: A member shared their project of creating a Minecraft Playing AI using a DSPy RLM agent and the Minecraft MCP, including a status update, YouTube video, open-sourced code, and process blog.

DSPy ▷ #general (7 messages):

RLM Module, Autonomous Agents, Healthcare AI, Decision Support Systems, Next-Gen Conversational AI

CoderRLM Module Wraps Python Interpreter: A member introduced a CoderRLM module, wrapping a Python interpreter to address None issues in JSON serialization by adding null=None to the preamble, especially useful for Deno/Pyodide REPL environments.
- The module loads reference data like CM_INDEX_FILE, CM_TABULAR_FILE, CM_D_FILE, and CM_N_FILE as REPL variables for coding using the RLM paradigm.
Autonomous Agents Design: A member is designing autonomous agents that self-learn, plan, execute, and recover from tool/API failures without human intervention, focusing on continuous improvement systems for long-term AI performance.
- These agents are designed to operate in various sectors, leveraging tools and frameworks such as Python, TensorFlow, PyTorch, FastAPI, and AWS.
Healthcare AI Predictive Models: A member is developing predictive healthcare models that automate diagnostics, monitor patient health, and optimize clinical workflows using NLP-powered clinical data systems to extract insights from unstructured medical notes.
- These systems are designed with HIPAA compliance and security measures such as RBAC and audit logging for sensitive data.
Decision Support System Integrations: A member is building real-time decision-making tools for critical sectors like healthcare and finance.
- These involve predictive models and intelligent systems that transform massive datasets into actionable insights, using databases like PostgreSQL, MongoDB, and Elasticsearch.
Next-Gen Conversational AI Implementation: A member is creating AI-driven chatbots that handle multi-turn, context-aware conversations across multiple platforms.
- These chatbots implement advanced NLP models to provide personalized, real-time support and assistance leveraging frameworks like Next.js, NestJS, and Vue.js.

Manus.im Discord ▷ #general (10 messages🔥):

doe.so alternative to Manus, Manus Skills launch, Senior AI/ML & Full-Stack Engineer looking for opps, Cloud Browser issues

New Manus Alternative: A member suggested users try doe.so, implying it’s a better alternative to Manus.
- The member said it just feels smarter.
Manus Skills Launch: The Manus team announced the launch of Manus Skills and invited the community to test them by building or using a Skill.
- They encouraged users to share their use cases on X (formerly Twitter) for reposts and free credits, tagging @ManusAI.
AI/ML Engineer Seeking Projects: A member introduced themselves as a full stack + AI dev looking for work.
- They highlighted experience in Autonomous Agents, Healthcare AI, Decision Support Systems, Conversational AI, and Fraud Detection Systems with a list of technologies.
Cloud Browser server unavailable: A member reported that their cloud browser screen displays an error: The temporary website is currently unavailable. This may be because Manus’s computer is asleep or the link has expired.
- They mentioned attempting to wake it up and assigning tasks, but the website still doesn’t appear and they’ve exhausted their credits trying to reset the screen.

aider (Paul Gauthier) ▷ #general (4 messages):

Aider's GitHub, AI/ML projects, Tech Stack

Aider’s GitHub Declared Stale: A user inquired why Aider’s GitHub is stale, last updated in 2025, to which another user responded that it won’t be maintained anymore.
AI Engineer’s Current Projects: An AI Engineer listed current projects involving Autonomous Agents, Healthcare AI, Decision Support, Conversational AI, Fraud Detection, and AI Automation.
AI Engineer’s Tech Stack: An AI Engineer shared their tech stack which includes Python, TypeScript, Go, Rust, TensorFlow, PyTorch, Hugging Face, OpenAI, PostgreSQL, Kafka, AWS, Docker, HIPAA, RBAC, Audit Logs, and Encryption.

Modular (Mojo 🔥) ▷ #general (2 messages):

Container Issue, SYS_PTRACE, devcontainer.json

Container Configuration Cures Confinement Crisis: A member resolved a container issue by adding --cap-add=SYS_PTRACE --security-opt seccomp=unconfined when running the container.
- Alternatively, users can add runArgs to .devcontainer/devcontainer.json with the same parameters to achieve the same effect.
Security Opts Solve Mysterious Container Conundrums: The user reported resolution by adding --security-opt seccomp=unconfined.
- This disables seccomp, potentially resolving issues related to system call restrictions within the container.