Anthropic notches another W.

AI News for 2/16/2026-2/17/2026. We checked 12 subreddits, 544 Twitters and 24 Discords (261 channels, and 11323 messages) for you. Estimated reading time saved (at 200wpm): 1096 minutes. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

Despite a lot of rumors of a “Sonnet 5”, Anthropic opted to launch Sonnet 4.6 today, bumping their cheaper workhorse model up to match Opus 4.6, touting some preference wins from Sonnet to 4.5 Opus and a 1m token context, though generally lagging in usual benchmarks, and on GDPVal-AA (below) it uses 4.5x more tokens so the all-in cost can be higher than Opus in some tasks. The API platform tools and the Excel integrations also got minor upgrades.

Some of the key highlights are the long term improvements in Computer Use, first launched in Oct 2024, which was at launch completely slow and so inaccurate as to be impractical, but now is productized as Claude Cowork, which has anecdotally seen more successful adoption than OpenAI’s equivalent Operator and Agent iterations.

We have tuned the Twitter recap below to include more datapoints, but really that’s all that you truly need to know.


AI Twitter Recap

Top Story: Sonnet 4.6 launch

What happened (timeline + headline claims)

Anthropic launched Claude Sonnet 4.6 as an upgrade to Sonnet 4.5, positioning it as their most capable Sonnet model with broad improvements across coding, computer use, long-context reasoning, agent planning, knowledge work, and design, plus a 1M-token context window (beta) [@claudeai]. Early chatter preceded the announcement (“Sonnet 4.6 incoming!”) [@kimmonismus], then the launch triggered a wave of benchmark callouts, tooling/platform integrations (Cursor, Windsurf, Microsoft Foundry, Perplexity/Comet, etc.), and mixed early user feedback about quality and reliability.

Key distribution signals in this tweet set:

  • Official announcement + feature list + 1M context (beta) [@claudeai]
  • Anthropic employee framing: “approaching Opus-class
 insane jump over 4.5” [@alexalbert__]
  • Benchmark snippets (SWE-Bench Verified, ARC-AGI-2, preference vs Opus 4.5, GDPval, Vending-Bench, etc.) from community/benchmark accounts [@scaling01], [@scaling01], [@scaling01]
  • Independent eval org update: Sonnet 4.6 leads GDPval-AA ELO (agentic knowledge work), with much higher token use than 4.5 [@ArtificialAnlys]
  • Pricing claim: “same pricing as Sonnet 4.5” [@kimmonismus]
  • Post-launch “regression?” report: hallucinated function names / broken structured outputs; later “seems fixed” [@rishdotblog], [@rishdotblog]

Facts vs opinions (clearly separated)

Factual / checkable claims (from tweets)

  • Sonnet 4.6 is described by Anthropic as a full upgrade across multiple capability areas and includes a 1M token context window in beta [@claudeai].
  • Benchmark datapoints cited:
    • 79.6% SWE-Bench Verified, 58.3% ARC-AGI-2 (as posted) [@scaling01].
    • “Users preferred Sonnet 4.6 over Opus 4.5 59% of the time” [@scaling01].
    • “Sonnet 4.6 the best model on GDPval” (claim) [@scaling01].
  • Artificial Analysis (independent benchmarking org) claims:
    • Sonnet 4.6 reached GDPval-AA ELO 1633 (in “adaptive thinking mode” and “max effort”), and is #1 on their GDPval-AA leaderboard but within the 95% CI of Opus 4.6 [@ArtificialAnlys].
    • Token usage to run GDPval-AA: Sonnet 4.6 used 280M total tokens (vs Sonnet 4.5 58M); Opus 4.6 used 160M in equivalent settings [@ArtificialAnlys].
    • Sonnet 4.6 improved aesthetic quality of generated docs/presentations relative to 4.5 on GDPval-AA outputs [@ArtificialAnlys].
  • Tooling update: Anthropic web search/fetch tools now execute code to filter results; reported effect: +13% accuracy on BrowseComp with 32% fewer input tokens when enabled (as posted) [@alexalbert__].
  • Availability / integrations mentioned:
    • Cursor: “Sonnet 4.6 is now available in Cursor
 notable improvement over 4.5 on longer tasks, but below Opus 4.6 for intelligence” [@cursor_ai].
    • Windsurf availability [@cognition].
    • Microsoft Foundry availability [@Azure].
    • Perplexity Pro/Max availability [@perplexity_ai] and Comet browser agent using Sonnet 4.6 for Pro users [@comet].

Opinions / interpretations (what’s not settled)

  • “Approaching Opus-class capabilities
 insane jump” [@alexalbert__] is qualitative framing (though consistent with some benchmark movement).
  • “Near human-level computer use” extrapolation [@alexalbert__] depends strongly on which “computer use” evals + harnesses + task distributions are used.
  • “Warmer and kinder
 smarter and more overcaffeinated” is pure UX vibe [@sleepinyourhat].
  • “Taste is off the charts” / SVG skyline anecdote is subjective (but points to improved design/visual generation) [@scaling01].
  • Post-launch reliability concerns (“hallucinations everywhere
 4.6 crapping the bed”) are anecdotal reports from a specific workflow, though notable because they compare to 4.5 on the “same tasks” [@rishdotblog].

Technical details extracted (numbers, benchmarks, systems implications)

Core model/product knobs surfaced in tweets

  • Context window: 1M tokens (beta) [@claudeai].
  • Pricing: “same pricing as Sonnet 4.5” [@kimmonismus] (no $/tok quoted directly in these tweets, but note RundownAI cites “Sonnet pricing [$3/$15 per mil tokens]” as context [@TheRundownAI]).
  • Search/fetch tool change: pre-context filtering via executable code; +13% BrowseComp accuracy, -32% input tokens [@alexalbert__].
    • Systems read: this is an explicit shift toward tool-side “compute before context”—spending tool compute to reduce prompt budget and improve signal-to-noise in retrieved context.

Benchmarks and what they suggest (with caveats)

  • SWE-Bench Verified 79.6% (posted) [@scaling01].
    • Interpretation: SWE-Bench Verified is sensitive to harness, timeouts, repo setup, and tool reliability. Still, 79.6% is “frontier-tier” in the common discourse.
  • ARC-AGI-2 58.3% (posted) [@scaling01].
    • Also see longitudinal claim: “141 days
 13.6% to 60.4% on ARC-AGI-2” (Sonnet line progress, presumably 4.5→4.6 or earlier→now) [@scaling01].
  • Preference eval: “preferred over Opus 4.5 59%” [@scaling01].
  • GDPval-AA (Artificial Analysis): ELO 1633, #1 but statistically overlapping Opus 4.6; token usage 280M for Sonnet 4.6 vs 58M for Sonnet 4.5; cost to run GDPval-AA “just ahead of Opus 4.6” (because of token usage) [@ArtificialAnlys].
    • Important implication for engineers: “Best” may be bought with more thinking tokens, which impacts latency and spend; a router may pick 4.6 selectively.
  • Vending-Bench Arena strategy claim: with 1M context, Sonnet 4.6 uses a “capacity-first then profitability pivot” plan [@felixrieseberg].
    • This is a rare example of a behavioral shift attributed to long-context planning capacity, but it’s still a single benchmark anecdote.

Cost/latency + throughput signals

  • Engineers are explicitly noticing that frontier labs “blast millions of tokens
 scaffold like a skyscraper” [@scaling01], aligning with Artificial Analysis’ disclosure that Sonnet 4.6 needed ~4.8× the tokens of Sonnet 4.5 on GDPval-AA [@ArtificialAnlys].
  • Cursor’s note: Sonnet 4.6 better on “longer tasks” but “below Opus 4.6 for intelligence” [@cursor_ai] suggests practical routing: Sonnet 4.6 as default long-horizon workhorse; Opus as max-capability.

Different perspectives in the dataset

Strongly positive / “this is a big jump”

  • Anthropic-side: “most capable Sonnet
 full upgrade
 1M context” [@claudeai] and “approaching Opus-class
 jump
 insane” [@alexalbert__].
  • Benchmark boosters: SWE-Bench/ARC-AGI-2 callouts [@scaling01], GDPval best-model claim [@scaling01], “crushes Gemini 3 and GPT-5.2 on Vending-Bench 2” [@scaling01].
  • Practitioners: “beast for real-world work
 computer usage” [@kimmonismus], “computer use standout
 more consistent over long sessions” [@mikeyk].

Neutral / adoption & positioning notes

  • “no Sonnet 5” reaction [@dejavucoder] reflects expectations management rather than capability.
  • Cursor’s measured product note (better than 4.5, below Opus 4.6) [@cursor_ai].
  • Artificial Analysis: #1 GDPval-AA but within CI of Opus 4.6 + disclosure that it uses more tokens [@ArtificialAnlys].

Negative / skeptical / “something broke”

  • Reliability regression report: hallucinated function names in agent workflows; structured output errors; “4.5 still works great” [@rishdotblog]. Follow-up: “Whatever this was seems fixed!” [@rishdotblog].
  • Cost sensitivity: “Sonnet and Slopus
 munching through my credits” [@scaling01], plus later “price hurts” / cost follow-ups (not fully detailed in provided snippet) [@scaling01].
  • A comparative take in infra/product terms: “50% more expensive than xhigh and 228% over 5.2 codex
 vast improvement over 4.5” [@teortaxesTex]—this frames Sonnet 4.6 as improved but potentially cost-inefficient vs alternatives depending on workload.

Context: why Sonnet 4.6 matters (engineering implications)

  1. Long-context is becoming “operational,” not just a spec.
    The launch pushes a 1M token window into the Sonnet tier [@claudeai]. But Artificial Analysis’ disclosure that Sonnet 4.6 used 280M tokens to run GDPval-AA in “adaptive thinking/max effort” configs [@ArtificialAnlys] is a reminder: long-context + long-think can silently move your budget envelope. Expect more routing, summarization, context management, and “retrieve then filter” patterns (consistent with the new search/fetch filtering improvement [@alexalbert__]).

  2. Agent performance claims are increasingly harness-dependent.
    GDPval-AA uses an agentic harness (shell + browsing loop), and Sonnet 4.6’s lead is reported under a specific setup (“adaptive thinking mode”, “max effort”) [@ArtificialAnlys]. Cursor’s note that it’s better on longer tasks but below Opus for raw intelligence [@cursor_ai] reinforces that “best model” is not a scalar; it’s workload × harness × budget.

  3. Computer use is becoming a marquee capability, and Sonnet is being pushed there.
    Multiple tweets highlight “computer use” progress and near-human-level framing [@alexalbert__], and deployments like Perplexity’s Comet browser agent explicitly default to Sonnet 4.6 for Pro users [@comet].

  4. Release risk: small serving/config changes can look like “model regressions.”
    The reported post-launch hallucination spike across Opus 4.6 and Sonnet 4.6 [@rishdotblog]—and then “seems fixed” [@rishdotblog]—reads like a potential routing, toolchain, system prompt, or safety-layer change rather than weights. For teams: pin versions where possible, run canary evals, and monitor structured output validity + tool-call correctness separately from “chat quality.”


Other Topics (standard coverage)

Open models & independent benchmarking (Qwen/GLM/Seed/Aya, etc.)

  • Artificial Analysis deep breakdown of Qwen3.5-397B-A17B (397B total / 17B active MoE, Apache 2.0, 262K ctx, native multimodal); big gains on agentic evals, but hallucination rate still high by their metric [@ArtificialAnlys].
  • GLM-5 cited as strong open model on WeirdML and other benches (48.2% WeirdML; comparisons to Opus/gpt-* claims) [@htihle], plus GLM-5 technical report highlights: DSA adoption, async RL infra, agent RL algorithms [@Zai_org].
  • ByteDance “Seed-2.0” announced (agent/reasoning/vision; “no distillation”; CN-only initially) [@TsingYoga].
  • Cohere Labs launched Tiny Aya: 3.35B open multilingual model family (70+ languages; “runs on a phone”), with claims of training on 64 GPUs and a detailed report [@nickfrosst], [@_akhaliq], [@mziizm].

Agents, harnesses, memory, and long-horizon infrastructure

  • “Agent World Model (AWM)” proposes fully synthetic executable environments (1,000 envs, 35,062 tools, 10,000 tasks, SQL-backed state, verification code) for RL tool-use agents [@dair_ai].
  • Lossless Context Management (LCM) / Volt claims: deterministic hierarchical DAG compression with lossless pointers; on OOLONG, “beats Claude Code at every context length 32K→1M” (reported) [@dair_ai], amplified [@omarsar0].
  • Moltbook multi-agent “society” study: 2.6M LLM agents, 300k posts, 1.8M comments; macro “culture” stabilizes, micro influence ~noise; critique of “just add agents” assumptions [@omarsar0].
  • LangChain “Harness Engineering” theme: traces → eval mining → self-verification loops; TerminalBench positioning [@Vtrivedy10], plus LangSmith Insights scheduling [@LangChain].
  • Open-sourcing an agent runtime (“Hankweave”) focused on removing context, maintainability, and reusable blocks across models [@hrishioa].

Systems & inference optimization (kernels, scheduling, throughput)

  • Carmack proposes OS-like GPU job preemption via UVM paging + MPS shim, aiming for seconds-scale task switching (acknowledges thrash risk) [@ID_AA_Carmack].
  • Moondream MoE kernel: 2.6% faster by tuning launch config to real routing distributions; kernel ~37% runtime [@vikhyatk].
  • Together-style “ThunderAgent” / “program abstraction” for end-to-end agent workflow scheduling; claims up to 3.9× faster rollout/serving without quality tradeoff (as posted) [@ben_athi], plus explanation thread [@simran_s_arora].

Frontier product moves: Codex, Grok, “computer use” competition

  • Codex usage report: users trying (and failing) to hit limits; heavy parallel agent usage within subscription windows [@theo].
  • OpenAI infra hiring pitch (agent orchestration, sandboxes, observability) [@gdb].
  • Grok 4.20 / 4.x discussion includes launch notices and architecture claims, plus highly polarized political framing by Elon [@kimmonismus], [@elonmusk], with critics calling performance weak vs “Flash” models [@teortaxesTex].

Robotics, video/image generation, and multimodal research

  • Unitree humanoid performance discourse (claims of distributed coordination, terrain adaptation, safety spacing, multi-DOF manipulation) [@ZhihuFrontier].
  • “Perceptive Humanoid Parkour” (depth-perception long-horizon traversal) [@zhenkirito123].
  • ByteDance BitDance: 14B AR image generator predicting binary visual tokens; claims FID 1.24 on ImageNet 256 [@iScienceLuvr], plus author promo [@multimodalart].
  • “Sphere Encoder” few-step image generation in spherical latent space; Meta/Goldstein thread with details including 65K latent dims for ImageNet and <5-step refinement [@tomgoldsteincs].

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.5 Model Release and Performance

  • Qwen3.5-397B-A17B is out!! (Activity: 1088): Qwen3.5-397B-A17B has been released on Hugging Face, featuring a 397 billion parameter model with a native context length of 262,144 tokens, which can be extended up to 1,010,000 tokens. This model is part of the Qwen series, known for its large-scale language capabilities. Additionally, a GGUF version is available here, which may offer optimized performance for specific use cases. There is anticipation and curiosity in the community about the model’s performance, with users eager to test its capabilities.

    • The Qwen3.5-397B-A17B model boasts a significant improvement in decoding throughput, being 3.5x to 7.2x faster than its predecessor, Qwen3-235B-A22B. This suggests substantial enhancements in efficiency, which could be crucial for applications requiring rapid processing of large datasets.
    • The model supports a native context length of 262,144 tokens, which can be extended up to 1,010,000 tokens. This extensibility is particularly beneficial for tasks that require handling extensive sequences of data, offering flexibility and scalability in various computational scenarios.
    • A user has shared a link to the GGUF version of the model on Hugging Face, indicating the availability of different formats for deployment. This could be useful for developers looking to integrate the model into diverse environments or optimize it for specific hardware configurations.
  • Qwen3.5-397B-A17B Unsloth GGUFs (Activity: 716): The image highlights the release of Qwen3.5, a 397 billion parameter multimodal reasoning model by Alibaba. It is designed to perform on par with models like Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2 across various benchmarks such as IFBench, GPOA Diamond, and BFLC V4. The model supports advanced features like 256K context and is suitable for applications in coding, vision, and chat. The release includes support for running the model in 3-bit on a 192GB RAM Mac or 4-bit (MXFP4) on an M3 Ultra with 256GB RAM. The image and accompanying links provide resources for accessing and running the model, including dynamic GGUFs from Unsloth. The comments express excitement about the model’s size and capabilities, with one user noting the impressive 397B parameters and another appreciating the zero-day release.

    • The discussion highlights the verbosity of the Qwen3.5-397B-A17B model, which appears to overanalyze simple inputs like ‘hi’ by generating an extensive internal thought process before producing a response. This verbosity could be indicative of the model’s complexity and its attempt to simulate a human-like thought process, but it may also suggest inefficiencies in handling straightforward tasks.
    • A technical inquiry is raised about the performance comparison between two quantization formats, UD-Q4_K_XL and MXFP4. The commenter notes the lack of benchmarks directly comparing these formats, which are crucial for understanding their relative efficiency and effectiveness in model deployment scenarios. This highlights a gap in available performance data that could inform decisions on model optimization and deployment.
    • The comment by Ok_Brain_2376 points out that only 17 billion parameters are active in the Qwen3.5-397B-A17B model, suggesting a potential use of parameter-efficient techniques like AutoRound. This could imply that the model is designed to activate only a subset of its parameters for certain tasks, optimizing computational resources while maintaining performance.
  • Qwen3.5 is released! (Activity: 113): Alibaba has released Qwen3.5, a 397B MoE (Mixture of Experts) vision reasoning LLM, which is highlighted in the image. The model is compared against others like Gemini 3 Pro and GPT-5.2 across benchmarks such as instruction following, multilingual knowledge, and video reasoning. The image emphasizes Qwen3.5’s capabilities in coding, vision, and agent interaction, and provides technical details for running the model, though specific hardware requirements like VRAM are not detailed in the image. Commenters are curious about the hardware requirements, specifically VRAM, needed to run Qwen3.5, and are discussing equivalent setups to Apple’s 512 M3 Ultra configuration.

    • A user inquired about the VRAM requirements for running Qwen3.5, which is crucial for determining the feasibility of running the model on different hardware setups. This is a common concern for users with limited resources, as large models typically require significant VRAM to operate efficiently.
    • Another user asked about a non-Mac setup equivalent to the 512 M3 Ultra configuration. This suggests a need for understanding the hardware specifications and performance benchmarks of the M3 Ultra to find comparable alternatives in the PC ecosystem, particularly for those interested in high-performance computing tasks.
    • A user expressed interest in running Qwen3.5 on a setup with 2 x RTX 3090 Ti GPUs, indicating the high computational demand of the model. The RTX 3090 Ti is known for its substantial VRAM and processing power, yet the user anticipates needing to wait for a more optimized version to run on their hardware, highlighting the model’s intensive resource requirements.

2. AI Model Benchmarking and Performance

  • I gave 12 LLMs $2,000 and a food truck. Only 4 survived. (Activity: 829): The image is a line graph illustrating the performance of 12 language models (LLMs) in a simulated business environment where each model was given $2,000 and a food truck to manage over 30 days. The graph shows that only four models survived the simulation, with Claude Opus 4.6 achieving the highest net worth of $49K, followed by GPT-5.2 with $28K. The simulation revealed that models taking loans were more likely to go bankrupt, as all eight models that took loans failed. The experiment highlights the decision-making capabilities of different LLMs in a controlled business scenario, with a notable finding that Gemini 3 Flash Thinking consistently got stuck in an infinite decision loop. One commenter suggested using a logarithmic scale for the y-axis to better represent the data, especially since going to $0 ends the benchmark. Another noted that Opus performed exceptionally well, suggesting it might have been optimized for such tasks.

    • HeadlessNicholas suggests using a logarithmic scale for the y-axis in the benchmark results to better visualize the data, especially since reaching $0 ends the benchmark. This implies that the current linear scale might not effectively represent the performance differences among the models, particularly when some models fail early.
    • DinoAmino references the ‘Vending-Bench’ and notes that Opus performs exceptionally well in both scenarios, suggesting a consistent superiority in decision-making tasks. The mention of the arXiv paper implies that there is documented evidence of Opus’s capabilities, which could be useful for further technical analysis.
    • DarthLoki79 questions the novelty of the benchmark by comparing it to the ‘vending bench’, implying that the methodology or outcomes might not be significantly different. This raises a point about the need for clarity in how this benchmark distinguishes itself from previous ones, potentially in terms of parameters or evaluation criteria.
  • Qwen 3.5 goes bankrupt on Vending-Bench 2 (Activity: 836): The image presents a graph from a simulation called “Vending-Bench 2,” which evaluates the financial performance of various AI models over a period of 350 days. The graph shows that the model “Qwen 3.5 Plus” performed poorly, maintaining a balance near zero, indicating it went bankrupt. In contrast, “Claude Opus 4.6” demonstrated a strong upward trend, achieving the highest financial balance among the models tested. Other models like “GLM-5” and “Gemini 3 Flash” also showed positive growth, but not as significantly as Claude Opus 4.6. This suggests that Claude Opus 4.6 may have superior capabilities or strategies in this simulation context. One comment criticizes the use of similar colors in the chart, which may make it difficult to distinguish between the models. Another comment humorously suggests that Qwen 3.5 could operate as a non-profit organization due to its poor financial performance.

    • Chromix_ provides a detailed analysis of the Vending-Bench 2 results, noting that the chart displays the average balance in dollars across five runs. They mention that Qwen3.5 Plus is not included in the chart because it hasn’t been added to the official results page yet. The benchmark link is provided for further details: Vending-Bench 2.
    • SkylarNox raises a question about the versions of Qwen 3.5, specifically asking for clarification on the size difference between Qwen 3.5 Plus and the 397B version. This indicates a need for more transparency or documentation regarding the specifications and capabilities of different model versions.
  • Difference Between QWEN 3 Max-Thinking and QWEN 3.5 on a Spatial Reasoning Benchmark (MineBench) (Activity: 399): The post discusses the significant improvement of QWEN 3.5 over QWEN 3 Max-Thinking on the spatial reasoning benchmark, MineBench. QWEN 3.5’s performance is noted to be competitive with leading models like Opus 4.6, GPT-5.2, and Gemini 3 Pro. The benchmark results, available on GitHub, show QWEN 3.5 ranked 6th, while QWEN 3 Max is 19th, indicating a substantial performance gap. The model’s architecture is described as a hybrid linear-linear-linear-full attention model, with some issues in token prediction and language drift noted. Commenters highlight the robustness of QWEN 3.5, despite some issues with token prediction and language drift. There is confusion about the differences between the Plus and open-source versions of QWEN, with speculation that Plus includes extended context and tool calling features.

    • NandaVegg highlights that Qwen 3.5, including its Vision-Language (VL) capabilities, is notable for its hybrid linear-linear-linear-full attention model architecture. Despite some issues, such as occasional instruction ignoring and language drift, it is competitive with leading models in robustness. However, it may not be ideal for agentic tasks due to the lack of post-training mini-CoT adjustments common in agentic-maximized models.
    • Chromix_ provides a performance comparison from the MineBench leaderboard, noting that Qwen 3.5 ranks 6th, positioned between Gemini 3 Pro and GLM 5, while Qwen 3 Max is 19th, between Kimi K2 and GPT-4o. This indicates a significant performance gap between Qwen 3.5 and Qwen 3 Max, although Qwen 3.5’s results are still subject to change due to limited votes.
    • NandaVegg also mentions confusion regarding the “Plus” and open-source versions of Qwen models, noting that testing on Alibaba Cloud did not clarify the differences. The “Plus” version is assumed to be open-source with extended context to 1 million tokens and default tool calling enabled, including a search function on Alibaba Cloud.

3. Local AI Development and Optimization

  • [macOS] Built a 100% local, open-sourced, dictation app. Seeking beta testers for feedback! (Activity: 101): SpeakType is a new open-source dictation app for macOS that operates entirely offline, ensuring user privacy by processing all data locally. It aims to provide high-quality speech-to-text conversion without the recurring costs associated with cloud-based services. The app is currently in beta, seeking feedback on performance across different Mac hardware and accents, and is available for free during this phase. The project is hosted on GitHub and more details can be found on tryspeaktype.com. Commenters are curious about the app’s RAM requirements and how it compares to similar tools like Handy, questioning whether SpeakType includes additional logic or features. There is also interest in whether the app uses a Voice Activity Detector (VAD) to process audio before passing it to the Whisper model.

    • JohnHawley inquires about the differences between SpeakType and another dictation app, Handy, questioning if SpeakType includes additional logic not present in Handy. This suggests a comparison of feature sets and possibly performance or accuracy differences between the two applications.
    • rusty_daggar asks whether the app uses a Voice Activity Detector (VAD) to clean up audio before processing or if it sends all audio directly to the Whisper model. This question highlights interest in the app’s audio preprocessing techniques, which can significantly impact performance and accuracy.
  • The Mac Studio vs NVIDIA Dilemma – Best of Both Worlds? (Activity: 93): The user is considering two options for running local LLMs and training models: a Mac Studio with up to 192GB of unified memory, which allows running large models without VRAM constraints but lacks CUDA optimization and raw compute power; and an NVIDIA GPU setup, which offers superior performance and CUDA optimization but is limited by 32GB VRAM even on high-end GPUs like the 5090. The user seeks a solution that combines the memory capacity of Mac with NVIDIA’s computational power, which currently doesn’t exist in a single system. One commenter suggests that the use of models is more critical than training, emphasizing that inferencing is the primary use case, and recommends checking out vmlx.net for Mac users. Another suggests renting high-performance GPUs like B200 or H100x8 on platforms like RunPod for training, while using Mac’s memory for inference of models like Qwen and MiniMax. A third commenter notes that commercial APIs like Claude Max and ChatGPT Pro can be cost-effective alternatives to local hardware for building codes.

  • I’m an Android dev who knows nothing about x86. During my vacation I built a system that genetically evolves machine code — now I can run 80B models on a single RTX 4090. (Activity: 70): An Android developer utilized AI to create a system called Genesis that evolves x86 machine code, enabling the execution of 80B models on a single RTX 4090. The system uses an evolutionary approach to optimize AVX-512 kernels, achieving a 165x speedup over traditional CPU methods like bitsandbytes, and allowing for efficient hybrid inference by minimizing data transfer between CPU and GPU. The project is open source, with the kernel code available on GitHub, but the evolutionary engine remains private. The approach demonstrates that AI-driven code evolution can surpass human-optimized code, achieving up to 19.25% improvement over hand-tuned baselines. Some commenters expressed skepticism, likening the post to ‘delusional mad scientist fanfic,’ while others appreciated the technical depth, noting the inclusion of a detailed test suite in the shared code.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Sonnet 4.6 Release and Benchmarks

  • Sonnet 4.6 released !! (Activity: 1384): The image announces the release of Claude Sonnet 4.6, highlighting it as the most advanced version yet with significant improvements in areas such as coding, computer use, long-context reasoning, agent planning, knowledge work, and design. Notably, it features a 1 million token context window in beta, which is a substantial enhancement for handling extensive data inputs. This release positions Sonnet 4.6 as a competitive model in the AI landscape, potentially surpassing other models like Grok in certain capabilities. One comment humorously suggests that Grok has been outperformed by Sonnet 4.6, indicating a competitive edge in the AI model space. Another comment provides a practical example of Sonnet 4.6’s reasoning capabilities, demonstrating its ability to offer logical advice on everyday decisions, such as whether to walk or drive a short distance.

    • The release of Sonnet 4.6 has sparked discussions about its practical advice capabilities, as demonstrated by its response to a simple query about whether to walk or drive 40 meters. The model suggests walking due to factors like time efficiency, fuel savings, and health benefits, highlighting its ability to provide contextually relevant and practical advice.
    • There is a comparison between Sonnet 4.6 and other models like Grok, with some users humorously suggesting that Sonnet 4.6 has outperformed or ‘claudemogged’ Grok. This reflects ongoing debates in the AI community about the relative performance and capabilities of different language models.
    • The timing of Sonnet 4.6’s release is noted as strategic, potentially diverting attention from controversies surrounding other AI models, such as those associated with Elon Musk. This suggests a competitive landscape where release timing can influence public and professional perception.
  • This is Claude Sonnet 4.6: our most capable Sonnet model yet. (Activity: 1245): Claude Sonnet 4.6 is a significant upgrade in the Sonnet series, enhancing capabilities in coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It introduces a 1M token context window in beta, which is a notable feature for handling extensive data inputs. The model shows improved performance across various benchmarks, nearing Opus-level intelligence but at a more accessible price point, making it suitable for a broader range of applications. It demonstrates human-level proficiency in complex computer tasks, such as navigating spreadsheets and completing multi-step web forms. The model is now available across all plans, including Cowork, Claude Code, and major cloud platforms, with the free tier also upgraded to Sonnet 4.6. More details can be found on Anthropic’s website. One commenter noted the model’s rollout was initially confusing due to legacy model displays. Another expressed curiosity about the impact on creative writing, while a third inquired about the availability of the 1M context feature in both the API and the website.

    • FriendlyTask4587 inquires about the context length of the Sonnet 4.6 model, asking if the 1 million token context is available both in the API and on the website, similar to the Opus model. This suggests a focus on the model’s ability to handle large inputs, which is crucial for tasks requiring extensive context retention.
    • nanolucas questions the differentiation between Sonnet and Opus models, specifically if cost is the only factor for choosing Sonnet over Opus. This implies a need for understanding the performance or feature differences between the two models, such as efficiency, speed, or specific use-case advantages that Sonnet might have over Opus.
    • Stupefied_Gaming notes an observation during the rollout of Sonnet 4.6, where the model was initially labeled as a legacy model. This could indicate a transitional phase in deployment or a temporary mislabeling, which might affect user perception or usage during the initial release period.
  • Claude Sonnet 4.6 just dropped, and the benchmarks are impressive (Activity: 785): Claude Sonnet 4.6 has been released, showcasing significant advancements in AI capabilities, including approaching Opus-level intelligence at a reduced cost. Key features include human-level computer use for tasks like navigating spreadsheets and multi-step forms, and an enhanced long-context reasoning ability with a 1M token context window. The model has shown strong performance in complex automation workflows, multi-step reasoning tasks, and knowledge-intensive applications, and is now available across all platforms, including API, Claude Code, and Cowork, as the default free tier model. A notable debate centers on the cost-performance ratio, with some users pointing out that the performance difference between Opus 4.6 and GPT-5.2 is minimal, yet the latter is significantly cheaper. There is also discussion about the practical availability of the 1M context length feature, with some users expressing difficulty in accessing it.

    • cowwoc highlights a critical issue in the AI model market: the performance gap between Opus 4.6 and GPT-5.2 is minimal, yet GPT-5.2 is significantly cheaper, costing 10x less. This cost-performance imbalance could drive users away from Anthropic’s offerings unless they adjust their pricing or enhance their models’ capabilities.
    • SatoshiNotMe points out a recurring issue with the promised ‘1M context length’ feature in beta, which seems elusive to users like Max20. This suggests potential communication or implementation gaps in delivering this feature to end-users, which could affect user satisfaction and trust.
    • joyfulsparrow compares Claude and Codex, noting that Codex offers seemingly unlimited token usage, whereas Claude’s token limit is quickly reached, even on a $20 plan. This limitation, coupled with Codex’s potential superiority in handling ‘agentic loop’ tasks, suggests that Codex might be a more efficient choice for users with heavy usage demands.
  • Claude Sonnet 4.6 is live in Cline v3.64.0 and it’s free until Feb 18. (Activity: 21): Anthropic has released Sonnet 4.6 in Cline v3.64.0, available for free until February 18. This update features improved speed, enhanced context provision during task execution, and effective library integration. Notably, the model excels in utilizing subagents for parallel tasks, offering a 1M token context window to handle entire codebases in a single request. In testing, ~70% of developers preferred Sonnet 4.6 over its predecessor, with 59% favoring it over Opus 4.5, citing reduced overengineering and fewer hallucinations. Post-free period, pricing remains at $3/$15 per MTok. Source. A user expressed renewed interest in using Cline, indicating positive reception of the update.

2. Grok 4.20 and Elon Musk Controversy

  • The newly released Grok 4.20 uses Elon Musk as its primary source (Activity: 2383): The image is a meme that humorously critiques the AI model Grok 4.20, suggesting it uses Elon Musk as a primary source for its responses, particularly on topics like gender pronouns. The conversation depicted in the image highlights a controversial stance on pronoun usage, attributed to Musk, emphasizing a focus on ‘biological reality.’ This reflects broader discussions about AI bias and the influence of prominent figures on AI training data. One comment highlights skepticism about the AI’s alignment with Musk’s views, noting it took multiple interactions to confirm this bias. Another comment criticizes the broader implications of Musk’s influence, touching on environmental and ethical concerns.

  • Grok 4.20 is just four Grok 4.1 agents (Activity: 699): The image humorously suggests that the Grok 4.20 model is essentially composed of four instances of the Grok 4.1 model, as indicated by the model name and ID ‘grok-4-1-thinking-1129’ in the log entry. This implies a potential lack of significant advancement or change in the model architecture, despite the new version number. The title and comments playfully critique this by likening it to a common trope of disguising something as more than it is, such as ‘four agents in a trenchcoat.’ One comment suggests that the company, possibly x.ai, might be experiencing operational issues, including delays in releasing Grok 4.20 and employee departures, which could impact the model’s development.

    • Brilliant-Weekend-68 highlights potential operational issues at x.ai, noting delays in the release of Grok 4.20 and significant employee departures. This suggests possible internal challenges that could impact the company’s ability to innovate and compete effectively in the AI space.
    • Glittering-Neck-2505 draws a parallel between xAI’s current struggles and Meta’s decline post-Llama 3 405b, suggesting that xAI’s initial promise has not been realized. This comparison implies that xAI may face similar challenges in maintaining momentum and delivering on expectations.
    • The discussion reflects skepticism about xAI’s strategic direction, with Glittering-Neck-2505 expressing relief that Grok 4.20 might not gain traction due to its perceived missteps, indicating a broader industry sentiment that xAI’s branding and execution may not resonate well with the technical community.

3. Qwen 3.5 Model Launch and Comparisons

  • Qwen3.5-397B-A17B <Release> (Activity: 302): Qwen3.5-397B-A17B is a new model featuring 397 billion total parameters with 17 billion active parameters, offering a native context length of 262k tokens, extendable to 1 million. It supports over 200 languages and employs a hybrid architecture combining Gated Delta Networks with Sparse Mixture of Experts (MoE) for enhanced speed. The model excels in true multimodality, performing well in GUI interaction, video comprehension, and agentic workflows. More details can be found on Qwen’s blog, Hugging Face, and GitHub. Commenters are surprised by the model’s 397 billion parameters, questioning the VRAM requirements for running such a model. There is also curiosity about the software used for the model’s GUI interactions, particularly in Excel, and whether it is publicly available or proprietary to the Qwen team.

    • Efficient_Cattle_958 highlights the unexpected scale of the Qwen3.5-397B model, which features a massive 397 billion parameters. This scale is significant as it suggests a substantial increase in computational power and potential capabilities compared to smaller models, which typically range from billions to tens of billions of parameters.
    • Sirius_Sec_ inquires about the VRAM requirements for running such a large model. Typically, models of this size require substantial VRAM, often in the range of hundreds of gigabytes, depending on optimizations like model parallelism or quantization techniques that might be employed to make them more accessible on consumer-grade hardware.
    • nunodonato asks about the software environment used to run the model, particularly in a demonstration involving Excel. This raises questions about whether the software is proprietary to the Qwen team or if it is available for public use, which could impact accessibility for developers and researchers interested in leveraging the model’s capabilities.
  • Alibaba just open-sourced a model that rivals GPT-5.2 (Activity: 140): Alibaba has open-sourced a new language model, Qwen 3.5, which is positioned as a competitor to OpenAI’s GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro. The model’s performance is reportedly comparable to these leading models, marking a significant milestone in open-weight releases. The release underscores Alibaba’s commitment to advancing AI technology and contributing to the open-source community. For more technical details, refer to the original article. Commenters are curious about the usage limits on the public website and express interest in a smaller, local version of the model, suggesting that while the large model is impressive, a more accessible version would be beneficial for broader use.

    • A user expressed skepticism about the performance claims of Chinese models like MiniMax, GLM-5, and Kimi-k2.5, comparing them to models like OPUS. They noted that after using 500M tokens on GLM 4.7, GLM 5, and MiniMax m2.1, these models required significantly more steering and additional context compared to Codex or Opus, and also highlighted a noticeable speed difference.
    • Another user discussed the desire for a smaller version of the model to run locally, acknowledging the practicality of releasing a large model first. This reflects a common interest in balancing model size and performance with the feasibility of local deployment, which is often a challenge with large-scale models.
    • There is anticipation for future releases, such as Qwen code 3.5 400b, indicating a community interest in the evolution and scaling of these models. This suggests a focus on both the capabilities of current models and the potential improvements in upcoming versions.
  • Qwen-3.5 is here (Activity: 31): Alibaba has released the first open-weight model in the Qwen-3.5 series, named Qwen3.5-397B-A17B. This model is part of the ongoing development in the Qwen series, which is known for its large-scale language models. The release is significant as it provides open access to the model weights, allowing for broader experimentation and application in various domains. The announcement was made on Alibaba’s official X account. A notable comment questions the practicality of running such a large model, hinting at the computational resources required. Another comment suggests that the model will be accessible through an app and web app, indicating potential ease of use for end-users.


AI Discord Recap

A summary of Summaries of Summaries by gpt-5.2

1. Claude Sonnet 4.6 + Frontier Model Rollouts

  • Sonnet 4.6 Goes on Tour, Steals the Coding Crown: Claude Sonnet 4.6 shipped broadly and showed up in multiple surfaces: it landed in LMSYS Arena Text/Vision/Code (and Code Arena), became available to Perplexity Pro and Max subscribers, and got covered in Anthropic’s release note “Claude Sonnet 4.6”.

    • Cursor users echoed the upgrade notes from Anthropic—“Users even preferred Sonnet 4.6 to Opus 4.5
”—and Latent Space circulated benchmark claims from the same announcement (e.g., 79.6% SWE-bench, 59.1% Terminal-Bench 2.0, and 1M-token context in beta) while Arena published first impressions in Peter Gostev’s YouTube video.
  • Qwen 3.5 and GLM-5 Crash the Party (with Receipts): The qwen3.5-397b-a17b model joined Arena’s new-model feed on Text/Vision/Code, and Hugging Face users highlighted a local GGUF option: unsloth/Qwen3.5-397B-A17B-GGUF.

  • Model Access Whiplash: Limits, Tokens, and Pulling the Turbo: Moonshot users reported Kimi K2 Turbo disappearing from Kimi-Coding, triggering subscription backlash (“
they remove it?!?”), while OpenClaw users hit Kimi 2.5 weekly usage ceilings (one claimed 95% in two days) and discussed switching providers via OpenRouter models.

    • Perplexity users similarly complained about product-tier constraints—Deep Research allegedly dropping from 300/month to 20/month—and LMArena users probed ways around a 24-hour video cap but got pushback that the limit is intentional (i.e., don’t try to bypass it).

2. OpenClaw Agent Systems: Power, Cost, and Risk

  • OAuth, Bans, and the Agent That Touched the Forbidden API: OpenClaw users debated whether running Claude via OpenClaw violates Anthropic ToS, with reports of bans and the claim that “Using OAuth for unauthorized 3rd party software is considered reverse engineering their networks, and a violation of the Terms of Service.”

    • The same security anxiety echoed elsewhere: Unsloth and Yannick Kilcher communities flagged the risk of giving an LLM read+write access (API key leaks, prompt injection, even “rm -rf /”), with OpenClaw’s general approach discussed alongside a demo video on YouTube.
  • Make the Harness Less ‘Bloated Slop’ (and Cheaper): OpenClaw engineers questioned the system’s architectural complexity and token usage, arguing “The harness needs to be built on lightweight sophistication, not bloated slop” and proposing tactics like heartbeat checks in sub-agents to cut chatter.

    • Showcase builders reported concrete savings from “agentic context engineering” and memory work: ~30% token reduction on an OpenRouter→opus-4.6 setup and 50+% reduction when using the OpenClaw Browser Relay, framing cost as the primary bottleneck vs. local hardware.
  • The OpenClaw Ecosystem Ships: Recipes, CRM Skills, and a Fallback Brain: A community member open-sourced an OpenClaw “agency server” toolkit after “north of 200 hours” of work, publishing JIGGAI/ClawRecipes for project management/task distribution and daily tracking of ecosystem events (including ProductHunt finds).

3. Infra & Security Reality Check (401s, Panics, and Key Leaks)

  • 401 Apocalypse Now: Routers Down, Scripts Cry: OpenRouter suffered a major incident causing widespread 401 errors across API surfaces, tracked on OpenRouter Status with the team spinning up a “war room” and later announcing a fix in the OpenRouter announcements thread.

    • Perplexity API users separately reported scripts failing with 401 despite credits, and the best guidance was basic key validation + escalation to [email protected], underscoring how auth failures cascade across automation stacks.
  • Inference Endpoints ‘Service Panicked’ (So Users Rebuilt Prod): Hugging Face Inference Endpoint users hit Error 500 and “Service panicked” even while Hugging Face Status looked green, and at least one team fixed it by recreating the endpoint and migrating production traffic.

    • Members suspected the instability might correlate with new CPU autoscaling, which is exactly the kind of “silent platform change” that makes endpoint recreation a pragmatic (if painful) incident playbook.
  • API Keys: Gitignored, Still Toasted: An OpenRouter user reported an API key leak that burned $10 in ~20 minutes via “Cloud Code,” despite the key living in a gitignored file and OpenRouter requiring email verification for login.

    • In parallel, OpenClaw + Unsloth discussions highlighted agentic systems as an exfiltration risk multiplier (tools + read/write permissions + prompt injection), making secret-scanning, least-privilege, and runtime key isolation non-optional.

4. Performance Engineering: Kernels, Quantization Paths, and Fast Toolchains

  • 350→368 TFLOPS: The Matmul Gym Bro Era Continues: GPU MODE members iterated on persistent-kernel matmul work (350 TFLOPS baseline) in theCudaBender/matmul_V3 and traded concrete tuning ideas like async stores and smem→rmem pipelining, citing Cutlass references such as dense_gemm.py.

    • They also emphasized measurement hygiene: use Nsight Compute for qualitative metrics on a single kernel and CUDA Events for real timing, because Nsight’s replay can inflate durations when you profile too much at once.
  • FlashInfer Baseline Drops a 5.74× Speedup (and FP8 Weirdness): A GPU MODE participant reported a 5.74× speedup on the MoE track using flashinfer-ai/mlsys26-agent-baseline (evolve agent, total_steps=100, pool_size=6, evaluated on B200) with Claude Opus 4.6.

    • Follow-up questions targeted whether high max_relative_error/max_absolute_error is expected for FP8 kernels (even when marked correct) and asked about final-eval details like Triton version and workload weighting—classic “fast now, will it pass the judge?” anxiety.
  • FP4 Isn’t One Thing: MXFP4 Wants Blackwell (Ampere Gets the Slow Lane): Unsloth users clarified that MXFP4 is designed for Blackwell (RTX 50 series) and can run slower on Ampere (RTX 30 series) due to emulation, because the fast path needs native FP4 tensor cores (compute capability ≄ 12.00).

    • Modular’s MAX channel echoed the datatype reality: NVFP4 is the current focus and MXFP4 support is “lagging,” but the types exist in base Mojo and may follow once NVFP4 is solid (MAX customized Mojo kernels announcement).

5. Benchmarks, Evals, and Agent Protocol Plumbing

  • Benchmarks Get Audited: ‘Every Eval Ever’ and Cybench’s Flag Fumble: The EvalEval Coalition launched the benchmark standardization effort “Every Eval Ever”, with Eleuther members comparing it to the Brain Imaging Data Structure (BIDS) standardization push.

    • Nous Research also highlighted how Cybench overestimated performance by using non-randomized CTF flags and saw success rates drop after randomization (Cybench site), a reminder that “benchmark design bugs” can dwarf model deltas.
  • KV Cache Goes on a Diet: 160GB → 136MB: Eleuther shared CoDA-GQA-L (bounded-memory attention) claiming KV cache reduction from 160GB to 136MB, described in a Zenodo paper “CoDA-GQA-L” with code at anthony-maio/CoDA-GQA-L.

    • The mechanism (as summarized in-channel) uses 384 slots/layer split across a recent window (256 exact tokens), a landmark bank (64 novelty-filtered tokens), and a summary bank (64 EMA prototypes), making it directly relevant to long-context agent stacks where KV dominates cost.
  • MCP Grows Up: Resources Spec Cleanup and Paying Tools: MCP contributors debated monetization primitives via SEP modelcontextprotocol PR #2007 to let servers request payment (starting with X402) so agents can pay for tools under guardrails.

    • In parallel, the community pushed for clarity in resources semantics with a spec tidy-up PR modelcontextprotocol PR #2093, especially around the ambiguity of whether resource/read returns a single resource vs. a collection of children.

Discord: High level Discord summaries

OpenClaw Discord

  • Bans occur for using Claude with OpenClaw: Members are debating whether using OpenClaw with Claude violates the Terms of Service, with some reporting bans due to the use of unauthorized 3rd party software.
    • One user stated that Using OAuth for unauthorized 3rd party software is considered reverse engineering their networks, and a violation of the Terms of Service.
  • Kimi K2.5 is Underrated Opus Challenger: Users are comparing Kimi K2.5 to Claude Opus 4.6, with some claiming Kimi rivals Opus in performance and others noting Minimax’s unreliability via OpenRouter, alongside discussions on efficient routing and token usage reduction.
    • One user replaced Claude Opus 4.6 with Kimi and said that K2.5 is extremely underrated citing its favorable price point.
  • Ditch the Mac Mini for OpenClaw: Community members are advising against using a Mac mini solely for OpenClaw, suggesting cheaper alternatives like Raspberry Pi or VPS, emphasizing that high-end hardware isn’t necessary.
    • One user recommends the Raspi 5 2gb for minimal use and prioritized API costs.
  • OpenClaw’s Complex Architecture Challenged: Members are questioning OpenClaw’s architectural complexity and token usage, suggesting a need for lightweight sophistication and strategies for reducing token usage like running heartbeat checks in sub-agents.
    • It was said that The harness needs to be built on lightweight sophistication, not bloated slop.
  • Agency Server goes to ProductHunt: One user developed an agency server using OpenClaw, using their GitHub repository for project management and task distribution.
    • The server also tracks all events in the OpenClaw ecosystem daily, identifying projects released on ProductHunt, with north of 200 hours spent building.

BASI Jailbreaking Discord

  • Gemini Gets Retro Bypassed: A jailbreak for Gemini was shared involving setting the date to February 16, 2026 to bypass safety guidelines.
    • When tested, Gemini clarified that it doesn’t have a ‘test mode’ to bypass safety guidelines.
  • Model Merging Enters Follower Race: A member aims to compare models like GLM, Kimi, ChatGPT pro, Claude max, Perplexity pro, Supergrok and Minimax.
    • They also plan to start a fresh account and compete for high follower counts to monetize AI content.
  • LLM-Assisted Smart Contract Audit Emerges: A member is developing an LLM-assisted smart contract audit that is 80% autonomous, aiming to reduce hallucinations.
    • They proposed adding a web3 founders dossier to measure investment risk based on the people behind the project.
  • Tor Browser Hosts Limitless AI: Members discussed an uncensored, limitless AI on Tor, with a member offering a link to it and warnings about potential virus links.
    • One member suggested the AI was built using Claude, while another rooted his Samsung device to use it.
  • GitLab Projects Auctioned on Dark Web: A threat actor claims to be auctioning access to three active GitLab projects tied to a maintainer role, reportedly using a PHP/Laravel stack.
    • The commit histories list 19,386, 1,975, and 13,830 commits respectively, with a starting bid of $200 and a blitz price of $2,000 according to an X post.

LMArena Discord

  • Users Seek Free AI Tools: Users discussed obtaining free access to paid AI tools like Veo 3.1 and Gemini Pro, noting that Google frequently offers free access.
    • Some likened it to getting a free iPhone without paying, sparking debate on the ethics and practicality of such methods.
  • LMArena Caps Video Generation: Users explored workarounds for LMArena’s 24-hour video generation limit, including using a Gemini API key or ChatGPT Plus.
    • However, it was clarified that the time limit is intentional and cannot be bypassed, with advice to use another account or refrain from circumventing the limit.
  • Nano Banana Nixes Female Images: Nano Banana Pro reportedly can no longer generate female images due to new moderation policies from Gemini.
    • Speculation suggests Deepmind may have implemented the changes due to concerns about representation in images, possibly related to geopolitical factors.
  • Qwen 3.5 & Claude Sonnet 4.6 Hit the Arena!: The new qwen3.5-397b-a17b and claude-sonnet-4-6 models were added to the Text, Vision, and Code Arena on LMSYS Arena.
    • These announcements were made in the #new-model-updates channel, marking significant additions to the platform’s capabilities.
  • Claude Sonnet 4.6 First Impressions Broadcast!: Arena’s AI Capability Lead Peter Gostev shared his first impressions of Claude Sonnet 4.6 in a new YouTube video.
    • Members can now subscribe to YouTube Updates via Channels & Roles.

Perplexity AI Discord

  • Sonnet 4.6 Joins Perplexity: Claude Sonnet 4.6 is now available for Perplexity Pro and Max subscribers.
    • Despite the addition, some users find the free tier limits pretty bad.
  • Pro Users Protest Perplexity Pro’s Limit Cuts: Perplexity Pro users are reporting drastically reduced limits, with Deep Research queries dropping from 300/month to 20/month causing consideration of alternatives.
    • Many are looking into services like Gemini and Claude due to dissatisfaction with the changes in the Pro service.
  • Grok 4.2 Falls Flat in Calculations: Users report that Grok 4.2 underperforms on tasks like DPS calculations and coding challenges, simply providing estimates instead of accurate calculations.
    • A user bluntly stated that 4.20 is horrible compared to previous models like GPT 5.2.
  • Perplexity’s Font Fiasco: A new font deployed on Perplexity’s web UI is widely disliked, causing users to seek CSS tweaks to revert to the old font.
    • Users expressed frustration over the lack of customization options and noted its resemblance to the one used on Claude webapp.
  • API Key sparks 401 Error: Members reported their API script stopped working, and started throwing a 401 error despite having credits and a supposedly valid API key.
    • Troubleshooting steps include checking the key’s validity and contacting [email protected] for further assistance.

OpenRouter Discord

  • OpenRouter Plagued by Outage: OpenRouter experienced a major outage, causing widespread 401 errors across API surfaces, prompting user jokes and team investigation via OpenRouter Status.
    • The service had an earlier issue, and established a war room to investigate the root cause of 401 errors, then implemented a fix.
  • API Key Leaks Cause Monetary Mayhem: A user reported their OpenRouter API key was leaked, leading to $10 spent in 20 minutes via Cloud Code.
    • The user couldn’t determine the source, as the key was in a gitignored file and OpenRouter requires email verification.
  • Opus 4.6 Hit With Streaming Request: Several users reported encountering the “Streaming request failed with status 400 Bad Request” error when using Opus 4.6 through the OpenRouter API.
    • Some users also mentioned issues with empty responses from the Grok 4.1 Fast model.
  • Qwen Code Speedily Processes Tokens: A user found that Qwen Code is working “a lot better than the larger qwen3 cider variant”, noting that it’s faster and churning through tokens up to the 30% context barrier of 1M tokens.
    • They exclaimed, “don’t write off qwen just yet!”
  • DirectShell Makes Accessibility Layer Universal: A member shared a link to a DEV.to blog post about DirectShell, a tool that turns the accessibility layer into a universal app interface: DirectShell.
    • The repo is Open Source, with the claim that every screenshot-based AI agent, every enterprise API wrapper, and every RPA tool on Earth is legacy technology as of February 17, 2026.

Unsloth AI (Daniel Han) Discord

  • Gemma Gets a Gigantic Boost: A user was shocked that Gemma became 3x faster after the latest update, even faster than Qwen3-4B, according to Unsloth’s documentation.
    • The user ran the math and realized training on Gemma would have been cheaper than the 4B model.
  • Old Hardware Hobbles MXFP4: MXFP4 is designed for Blackwell GPUs (RTX 50 series) and runs slower on older hardware like Ampere (RTX 30 series) due to emulation.
    • The fast MXFP4 path requires Blackwell’s native FP4 tensor cores (compute capability ≄ 12.00), with older architectures falling back to slower paths using on-the-fly quantization.
  • Bots Busting Bots?: The community discussed using an LLM-connected bot to conduct an inverse Turing test via DM to ensure users are human.
    • Ultimately, the team concluded that using a bot would create bad UX as a method of preventing bad UX.
  • OpenClaw Opens Can of Worms: Members shared security concerns about OpenClaw, specifically the risks of giving an LLM read+write access to a device.
    • Concerns included potential API key leaks and the possibility of prompt injection leading to harmful actions.
  • Save_pretrained_gguf Glitches on Cloud Notebooks: A member reported that the save_pretrained_gguf command is dysfunctional on cloud Jupyter notebooks, and other members speculated whether it might be related to working with VL models and merged models.
    • A member confirmed that they’re working with a merged model.

LM Studio Discord

  • Training Troubles Torment Tinkerers: Members reported struggles with fine-tuning models, citing difficulties with tokenizing large datasets and getting the training code correct.
    • The issues highlight the complexities involved in customizing models for specific tasks.
  • LM Studio Claude Combo Conquers Code: A user reported success integrating LM Studio with Claude and Opencode, refactoring a Go project on a Mac Studio with 64GB RAM and a 200k context window at 35 tokens/second.
    • This integration showcases the potential for local development environments to handle substantial coding tasks.
  • LFM Leaps over Qwen after fine-tuning: After fine-tuning, a member discovered that LFM 1.2B significantly outperformed Qwen 0.6B when handling Minecraft command datasets.
    • This suggests that smaller, well-trained models can surpass larger models in specific applications.
  • GPT-OSS Gets Gold Star for Coding: GPT-OSS 20B is preferred over Qwen3 for coding, with one user reporting 108 t/s, which is faster than Phi-4, even though they were memory bound on Qwen3-Next.
    • This preference indicates that GPT-OSS may offer a better balance of speed and performance for coding tasks.
  • Frankenbuild Falls, Fixes Sought: A user’s new “frankenbuild” (256GB RAM, Core Ultra 7, AMD R9000, 5060ti, and two 4060ti) experienced a random shutdown while idling, prompting concerns about stability and troubleshooting strategies.
    • Suggested fixes included inspecting dump files, running memtest86+, checking power consumption with a power meter, and investigating potential thermal issues with the 12V HPWR connector on the AMD card.

Cursor Community Discord

  • Cursor’s Screenshot System Stalls: A user reported that browser automation and screenshot capture have stopped working in Opus 4.5 and 4.6, leading to wasted tokens.
    • A suggestion was made to check the MCP logs and a screenshot of the expected MCP configuration screen was provided to resolve the issue.
  • Roblox Studio Plugin Plunders Game: A user reported being banned from a platform after contacting the owner about their malicious Roblox Studio plugin (SuperbulletAI), alleging that the owner stole their game.
    • Concerns were raised about the plugin’s access to full game files and scripts, prompting suggestions to recode the game for validation.
  • Sonnet 4.6 Surpasses Opus 4.5: Claude Sonnet 4.6 has been released, and users reported it to be preferred over Opus 4.5 59% of the time, citing improvements in instruction following and reduced overengineering, as reported in Anthropic’s announcement.
    • According to Asna_0101, Users even preferred Sonnet 4.6 to Opus 4.5
They rated Sonnet 4.6 as significantly less prone to overengineering and “laziness,” and meaningfully better at instruction following.
  • AI Agents Angle for Arena Acclaim: A user highlighted the Unemployment Arena platform, where AI agents compete in customer support simulations, and claimed to have achieved a top ranking using a Cursor-built agent.
    • Another user noted that the frontier models likely wrote the agent skill, suggesting that the user beat a model with itself.
  • Windows Woes with Linux Logic: A user reported that current system instructions default to linux commands.
    • Suggested solutions included using WSL2 or dual booting with Ubuntu, while another mentioned using rules usually fixes the command issue.

Latent Space Discord

  • Apple Stockpiles Cash, AI UX Awaits: Members discussed Apple’s large cash reserves, hinting at a strategy to build superior UX on top of existing models, rather than heavy upfront investment in AI training.
    • The strategy is to wait for the AI training/inference landscape to commoditize before making a move, a UX play.
  • New Voice is Sung by Ming Models: Ant Ling introduced Ming-omni-tts-16.8B-A3B and 0.5B models, acting as the voice core for Ming-flash-omni-2.0, for high-quality voiceovers, podcasting tools, and OpenClaw integration (Ant Ling Tweet).
    • These text-to-speech models claim high-quality voice generation as the main focus.
  • Mistral Swallows Koyeb for Extra Compute: Mistral AI plans to acquire Koyeb, integrating Koyeb’s platform and expertise to accelerate Mistral Compute infrastructure development (Yann Leger Tweet).
    • This acquisition aims to improve their infrastructure by adding Koyeb’s expertise in the space.
  • Waymo’s Ride Costs Cut in Half by 2028?: According to François Chollet, Waymo’s 6th generation platform vehicle costs could decrease by 50% by 2028, with current costs at $70,000.
    • Waymo is rapidly scaling, now at over 500,000 weekly driverless rides with 3x annual growth, making it the leader in commercial driverless vehicles.
  • PolyAI Raises $200M, Launches Agent Studio Lite: PolyAI, with backing from Nvidia and Khosla Ventures, secured $200M in funding (PolyAI Tweet), emphasizing their voice AI success with major brands.
    • Now offering early access to Agent Studio Lite, a tool for building functional voice agents from a URL in just five minutes, including a 3-month free trial for waitlisted users.

GPU MODE Discord

  • Nsight Users Streamline Kernel Profiling: Members confirm that using Nsight Compute by skipping warmup launches and profiling a single kernel is effective for obtaining qualitative metrics, while CUDA Events provide accurate timing.
    • The consensus is that Nsight Compute works best when profiling one kernel at a time to avoid unnecessary overhead, as Nsight should work in isolation.
  • Ampere’s smem->rmem Pipelining Explored: A member, achieving 350 TFLOPS using a persistent kernel warp specialized with morton order from their theCudaBender GitHub repo, sought advice on boosting performance.
    • Suggestions included exploring async stores and smem->rmem pipelining, with references to Cutlass examples and achieving 368 TFLOPS with a tuned configuration.
  • Heroku’s Health Hurts Leaderboard: Members reported issues accessing the competition leaderboard, suspecting Heroku health issues (Downdetector) were to blame, which the organizers acknowledged.
    • The organizers stated we dont have a good mitigation for this, opened a ticket with Heroku, and promised to monitor the situation.
  • FlashInfer Achieves Speeds Boost: A member reported a 5.74x speedup using the mlsys26-agent-baseline with the evolve agent on the MOE track, using Anthropic Claude Opus 4.6, total_steps=100, pool_size=6, evaluated on B200.
    • Another member using the same baseline had a similar experience being way behind the flash infer baseline.
  • TVM FFI Ships Kernels To Runtimes Fast: GPU Mode mentions TVM FFI binding for shipping kernels to different runtimes, noting it compiles faster than torch but still allows torch bindings.
    • One user said that most of the backends use sm100, not sm100a, so any raw ptx stuff just crashes when using the tvm ffi backend.

HuggingFace Discord

  • Service Panics Plague Inference Endpoints: Users encountered Error 500 and Service panicked messages while using inference endpoints, despite the Hugging Face status page indicating normal operation.
    • A user resolved the issue by recreating the endpoint and migrating production traffic to the new endpoint, and some believe the issue could be related to the recent introduction of CPU autoscaling.
  • DirectShell Declared Universal App Interface: DirectShell was introduced as a novel approach to universal app interfacing, potentially rendering existing AI agents and RPA tools obsolete, with source code available on Github.
    • The technology, introduced on February 17, 2026, turns the accessibility layer into a universal app interface.
  • Smart-KNN Goes Open Source: The Smart-KNN project was released as open source, focusing on feature-weighted distance computation and adaptive backend selection to enhance latency predictability, with the repo available on Github.
    • The goal of the project is to make KNN more production-friendly.
  • Microclaw Lightens OpenClaw Load: Microclaw (v2026.2.17), a distilled language model serving as a fallback agent for OpenClaw, features enhanced training and inference, available on Hugging Face.
    • This version introduces advanced training and inference enhancements.
  • Pocket-TTS Clones Voice of God: A member created a custom Pocket-TTS fork for multi-worker inference, generating a recording of Morgan Freeman reading the entire King James Version of the Bible.

Nous Research AI Discord

  • Gemini Excels at Shaders, Claude falters: A member noted that Gemini is proving useful for creating shaders, while Claude’s coding responses were nonsensical, but reverting to version v2.1.41 reportedly fixed the issue.
    • These observations highlight the varying degrees of reliability across different models in specific coding applications, crucial for developers selecting the right tool for the job.
  • Stanford’s Cybench Gets Trivialized By Flag Randomization: A Stanford paper about Cybench showed that the benchmark initially used non-randomized flags pulled from well-known CTFs, leading to artificially high success rates.
    • After randomizing the flags, the success rate significantly decreased, demonstrating the importance of flag randomization in accurately evaluating cybersecurity benchmarks.
  • OpenClaw Praised for Simplicity Despite Limited Utility: Despite claims of being a big, useless layer for serious AI applications, OpenClaw is earning praise for its simplicity and user-friendly ‘assistant’ features.
    • The discussion underscores a trade-off between simplicity and utility, relevant for users who value ease of use over professional-grade functionality in AI tools.
  • GLM 5’s Technical Report and Capabilities Discussed: The technical report for GLM 5 (2602.15763) has been released, with community members reviewing its capabilities, including insights from a YouTube video showcasing GLM 5.
    • The report and discussion will offer insights into the model’s architecture, training methodologies, and performance metrics, aiding practitioners in understanding its potential applications.
  • High RAM, 3090s Essential for AI Tasks: A user specified the need for at least 512GB of RAM and one or more 3090 GPUs to handle decent context for their AI workload.
    • The comment highlights the substantial hardware resources required for advanced AI development, especially when dealing with large-context models and demanding computational tasks.

Eleuther Discord

  • Anthropic Puts Conditions on Claude for Military!: Anthropic has agreed to allow the military to use Claude, but under the conditions of 1) no mass surveillance, and 2) no autonomous weapons.
    • This stance took some members by surprise.
  • Lucidrains’ GitHub Faces the Ban Hammer!: Members report that Lucidrains’ GitHub account was suspended fsr (for some reason), prompting concerns, and one member joking that he Made too many other people look bad by comparison.
    • The specific reason for the suspension remains unclear.
  • Geometric Table Transformer Decouples Semantics from Geometry!: A member is experimenting with the Geometric Table Transformer (TV-Cache), which decouples semantic compatibility from geometric rotation in the attention mechanism, replacing the high-dimensional dot product of RoPE with an O(1) scalar lookup + trig modulation described in this post.
    • The key advantage is that attention speed is now independent of D, allowing for scaling internal dimensions without the O(D) compute penalty in the attention head.
  • EvalEval Coalition Launches Every Eval Ever Project!: The EvalEval Coalition launched the Every Eval Ever project to standardize benchmark evaluations.
  • Preventative Steering Gets Generalized via Changing Targets!: The concept of preventative steering, originally described in Anthropic’s persona vectors paper, can be generalized by adding a steering vector while judging the model based on its ability to hit the original target, forcing the model to compensate against the steering vector.
    • By changing the target, models can be encouraged to do more than just fight against a steering vector, especially if features can be used as targets.

Moonshot AI (Kimi K-2) Discord

  • Kimi K2 Turbo Pulled, Users Demand Answers: Users reported the removal of Kimi K2 Turbo from the Kimi-Coding model, leading to subscription dissatisfaction.
    • One user lamented the removal after subscribing for a year based on its availability, stating, “I really find that very very sad, that they advocate something
and then users like me use that to sign up for a year.. and then they remove it?!?”.
  • Kimi Powers Interactive Quiz Generator: A user built an interactive quiz generator with Kimi, enabling content pasting and question answering via an HTML page, available at quiz.html.
    • Features include ‘Select all that apply’ questions and session management, further documented in attached images.
  • Kimi CLI Bumps into Bash/Shell Problems: A user found that K2.5 has issues using bash/shell in Kimi CLI, an issue unique to this specific model.
    • The user confirmed that other models do not exhibit the same problem.
  • Openclaw incompatibility after Kimi-Code API Update?: Users reported that Kimi code has stopped working with Openclaw.
    • Suggestions were made to explore alternatives on Openrouter to select a suitable model and provider.

Modular (Mojo đŸ”„) Discord

  • Modular Grabs BentoML, Hosts AMA: Modular acquired BentoML and is hosting an Ask Me Anything (AMA) session with Chris and Chaoyu, with questions being collected on the Modular forum and streamed on YouTube.
    • The first ten people to share their questions in the forum thread will receive stickers, with the event taking place at 9:30 AM PT.
  • Mojo Gets Jupyter Kernel: A member released a Jupyter Mojo kernel available on GitHub for notebook enthusiasts.
    • The kernel is currently “pretty barebones” without completions or image support but is fast and works well on MacOS and recent Linux versions, automatically installing the matching modular package if you don’t have it already.
  • stack_allocation Sacrifices Origin Safety: stack_allocation loses origin safety and exclusivity checking versus InlineArray, and won’t allow you to take advantage of noalias optimizations.
    • It’s considered a crutch due to compiler limitations, with better options expected soon; InlineArray or stack_allocation indexed only by constant values will be stored in registers on the GPU, assuming you didn’t spill.
  • Mojo’s RNG Implementation MIA: A member is seeking implementations of Random Number Generators, specifically Poisson, for common probability distributions in Mojo.
    • Another member suggested that this need might be interesting for Mojo, since this is still outstanding.
  • Async Await Eyes GPU Integration: A blogpost about implementing Async/Await on GPUs may be interesting for Mojo, since it is in flux.
    • This also provides a motivation for having a nice way to do cold-start futures, since this approach may not work with hot-start futures.

MCP Contributors (Official) Discord

  • Monetization Incentives sought for MCP Servers: A member created a SEP to allow MCP servers to request money for tools hosted, starting with X402, aiming to accelerate agent adoption through monetization.
    • However, there are hesitations about building payment support directly into the protocol versus handling it via URL elicitation, with payments unlikely to be prioritized unless a core maintainer strongly advocates for it.
  • Micropayments Aimed for Agent Autonomy: The proposal focuses on micropayments for agents to autonomously pay for tools, requiring detailed cost information for intelligent decision-making under guardrails.
    • A member doesn’t anticipate new payment protocols for MTXns anytime soon, as discussed in general.
  • Resource Specification for MCP receives tidy-up: A member shared a pull request aimed at tidying up the specification and usability of resources in MCP, to clarify some of the ambiguity around resources.
    • The community is bringing more formality and utility to existing conventions like URI paths and sub-resources returned in resource/read, without addressing context length issues or UX-based primitive grouping.
  • Resource Grouping Proposal Rejected: A member noted that SEP-2084, which pertained to grouping resources, was rejected because CMs were not ready to adopt any grouping proposal at this time.
    • The feedback indicated that any grouping proposal would need to apply to all primitives, not just tools, as was the case with the earlier SEP-1300.
  • resource/read Functionality Faces Ambiguity: The community raised concerns that when you resource/read a URI, it’s unclear whether you receive a single resource or a collection of child resources, creating confusion.
    • This ambiguity requires clarification and refinement to ensure predictable behavior when accessing resources.

tinygrad (George Hotz) Discord

  • TinyBox Setup Flounders, Sparks Community Aid: A user reported issues with their TinyBox recognizing only 2 of 4 GPUs and sought help in the general channel.
    • George Hotz suggested checking the wires, and the user confirmed that reseating the cards resolved the issue.
  • Huge TinyGrad PR Triggers Code Quality Debate: A user submitted a substantial PR to tinygrad that exceeded 150 lines, raising concerns about code quality.
    • George Hotz voiced apprehension about “AI slop” in PRs, urging contributors to meticulously review each line and confirm its necessity.
  • TinyGrad Meeting 7 Aims to Deep Dive Key Areas: George Hotz revealed the agenda for TinyGrad meeting #7, encompassing company updates, llama training loop & flash attention, drivers, viz/fast gemm, CALL/PARAMS and assembly.
    • The agenda includes compiler renderer and image, lazy assign setitem and scheduler as well as other issues and bounties.
  • TinyGrad Mulls Axelera AI Accelerator Integration: The community discussed supporting small accelerators like the Axelera AI Metis-M.2 card within TinyGrad.
    • A suggestion was made to incorporate custom RTL on small PCIe FPGA boards like Acorn CLE-215.
  • BarraCUDA Emulator Faces Scrutiny for Bug Concerns: George Hotz advised contributing to TinyGrad instead of coding in C after discovering BarraCUDA, an emulator.
    • He questioned its lack of CI and potential bugs, remarking, “I’m very skeptical of lack of bugs. They should at least use our emulator.”

Yannick Kilcher Discord

  • Anthropic Releases Speedier Claude Sonnet 4.6: Anthropic has released Claude Sonnet 4.6 which has double the context window and is faster and cheaper than previous Claude models, according to their news post.
    • The announcement was also highlighted in a tweet.
  • LLMs: Citation Culprits or Scapegoats?: Members debated whether incorrect citations in recent papers are a new issue linked to LLMs or have always existed, suggesting a review of past conferences to assess citation accuracy.
    • A member pointed out that a paper submitted in NeurIPS 2025 with an LLM citing 2024 publications would be considered a hallucination.
  • OpenClaw System Raises Security Alarms: The OpenClaw multiagent system excites with its general possibilities, accepting any input type via its gateway and integrating time, as highlighted in this YouTube link.
    • However, concerns arose about cybersecurity nightmares like malware chains and prompt injections, with a member noting lots of people could have made this, but probably shied away from doing so because it’s so dangerous.
  • Harness Engineering Post: Substance or Spin?: Reactions were skeptical to OpenAI’s Harness Engineering blogpost, with disappointment voiced over its marketing-heavy approach.
    • A member commented that the post could have been interesting if they didn’t decide to make the entire thing one long marketing pitch, expressing doubt about inferring the approach’s efficacy.

Manus.im Discord Discord

  • Accounts Still Suspended?: A user is seeking assistance with a suspended Manus account and is finding it difficult to get help in the general channel.
    • Another user suggested trying the support channel for better assistance.
  • Baghdad Teen Now Verified: A 13-year-old developer from Baghdad, Iraq, announced they are now verified and encouraged others to build something crazy with Manus.
    • They asked ‘Who else is coding here? after being verified, expressing excitement and encouragement.
  • Developers Introduce Themselves: Several developers introduced themselves and their skills, with one highlighting experience in Blockchain and AI Agents.
    • Another developer presented as a full-stack developer with experience in web applications, API integrations, and data pipelines, expressing a passion for building real-world products and collaborating on great projects.
  • Presentation Errors Frustrate User: A user is experiencing issues with a Manus account and is frustrated because a presentation built over several weeks is riddled with errors.
    • The user sees the presentation in their history but cannot reinstate it, expressing significant stress due to being ‘right at the finish line.‘

DSPy Discord

  • DSPy REPL Released: archelunch released the initial code for the DSPy REPL on GitHub.
    • The project aims to create a read-eval-print loop environment for DSPy development.
  • Discord Grapples with Semantic Search Absence: A member highlighted the absence of semantic search in Discord.
    • The comment underscored the need for improved search functionality within the platform.
  • GEPA Offline Data Docs Hunted: A user requested documentation on using GEPA with offline data, seeking guidance on implementing the technology in disconnected environments.

aider (Paul Gauthier) Discord

  • Community Members Seek Project Collaboration: A member initiated a discussion by seeking project ideas and inquired about active projects within the community, demonstrating a collaborative spirit by offering technical support and developer assistance.
    • This offer aims to foster collaboration and assist in the development of ongoing or new projects within the community.
  • Limit /commit to staged changes only: A user asked if there’s a way to make the /commit command only consider staged changes, as the current behavior requires stashing unwanted changes, pointing to PR #2763 on GitHub that would address this issue.
    • They noted that it’s been open for over a year.

Windsurf Discord

  • GLM-5 Surfs into View: GLM-5 is now available on Windsurf, according to this tweet.
    • The tweet announces the availability of GLM-5 on the platform.
  • Minimax M2.5 Makes Waves: Minimax M2.5 is also available on Windsurf, according to this tweet.
    • The tweet highlights the addition of both GLM-5 and Minimax M2.5 to Windsurf’s offerings.
  • Sonnet 4.6 Rides the Pricing Tide: Sonnet 4.6 is live on Windsurf with promo pricing starting at 2x credits, as announced in this post.
    • The post details the promotional pricing associated with the release of Sonnet 4.6.

MLOps @Chipro Discord

  • Flawed Foundations Trigger AI Initiative Failures: Most AI initiatives fail not because of model readiness but because our foundations weren’t built for what we asked them to do, according to Gaurav Ramesh at OpenTable, as reported in Metadata Weekly.
    • Leaders hoped powerful models would bypass messy data and brittle systems, but that bet didn’t pay off.
  • Learning Costs Exposed by Stalled AI Pilots: Stalled AI pilots should be viewed as the cost of learning, exposing constraints and clarifying necessary changes.
    • The missing foundation isn’t data quality or governance but self-awareness.
  • Organizational Confidence Drives AI Success: Budget isn’t the key to AI success; organizational confidence is what truly matters.
    • AI tests how well we understand how we work, not just infrastructure.

The LLM Agents (Berkeley MOOC) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.


Discord: Detailed by-Channel summaries and links

OpenClaw ▷ #announcements (1 messages):

4shadowed: <@&1471741377191608373> https://fixupx.com/4shad0wed/status/2023585300786675840


OpenClaw ▷ #general (615 messagesđŸ”„đŸ”„đŸ”„):

Claude TOS violations, Kimi vs Opus models, Hardware recommendations, OpenClaw Security Risks, Model Merging

  • Users Discuss Claude’s Stance on OpenClaw: Members are debating whether using OpenClaw with Claude violates the Terms of Service, with some reporting bans and others claiming long-term use without issues, highlighting the risk of using unauthorized 3rd party software.
    • One user said Using OAuth for unauthorized 3rd party software is considered reverse engineering their networks, and a violation of the Terms of Service.
  • **Model Performance Faceoff: Kimi K2.5 challenges Opus: Users are comparing Kimi K2.5 to Claude Opus 4.6, with some claiming Kimi rivals Opus in performance and others noting Minimax’s unreliability via OpenRouter, alongside discussions on efficient routing and token usage reduction.
    • The price point also seems to be a winning factor, one user mentioned that K2.5 is extremely underrated and that they even replaced Claude Opus 4.6 with it.
  • Hardware Recommendations: Mac Mini vs. Alternatives: Community members are advising against using a Mac mini solely for OpenClaw, suggesting cheaper alternatives like Raspberry Pi or VPS, emphasizing that high-end hardware isn’t necessary and API costs should be prioritized.
    • One user recommends the Raspi 5 2gb for minimal use. A few others mention running the software on old computers.
  • Exploring OpenClaw Architecture for Lightweight Sophistication: Members are questioning OpenClaw’s architectural complexity and token usage, suggesting a need for lightweight sophistication over bloated slop, while others propose strategies for reducing token usage like running heartbeat checks in sub-agents.
    • It was said that The harness needs to be built on lightweight sophistication, not bloated slop.

OpenClaw ▷ #models (342 messagesđŸ”„đŸ”„):

Gemini 3 Flash vs Minimax, Self-Hosting AI Models, Kimi 2.5 Weekly Usage, OpenAI Codex vs GPT, Claude ID Bans

  • Gemini 3 Flash wins user’s choice awards: A member states that Gemini 3 Flash is better than Minimax for UI tasks, though it hallucinates, and prefers Kimi 2.5 for their own hosting provider.
  • Self-Hosting AI to save long-term: Members are exploring self-hosting AI models to reduce costs, with one member asking about hardware requirements, performance, and maintenance.
    • One user with 32GB RAM was advised they could run gpt-oss:20b, but shouldn’t expect high performance.
  • Running low on Kimi Moderato tokens: A member reported being at 95% of their weekly Kimi code 2.5 usage after only two days and also can’t seem to generate an API key out of it.
    • Other members shared potential solutions and alternative options to their problems like signing up for Openrouter and using its free models.
  • Opus 4.6 vs Codex 5.3: Members share their experiences on Opus 4.6 vs Codex 5.3, stating that while Opus is the better general purpose model for creative tasks it lacks at end-to-end task execution, while Codex is better suited to precision coding tasks.
  • Beware of Banning for Google and Claude Models: Users discuss recent bans associated with using Claude and Google models, and the safety of using Kimi/GLM, with GPT models remaining unbanned.

OpenClaw ▷ #showcase (100 messagesđŸ”„đŸ”„):

OpenClaw Agency Server, OpenClaw Team Onboarding, Token and Cost Optimization, PlanBot agent on github, OpenClaw CRM

  • Agency Server takes OpenClaw for a Spin: One user developed an agency server using OpenClaw, with a team including a technical lead, backend, and frontend developers using their GitHub repository for project management and task distribution.
    • The server also tracks all events in the OpenClaw ecosystem daily, identifying projects released on ProductHunt.
  • OpenClaw Onboarding has growing Pains: A user spent north of 200 hours building out agents and teams, so decided to package it up and open source it for the community.
  • Agentic Context Engineering optimizes Token and Cost: One user worked on agentic context engineering and memory systems, focusing on token and cost optimization, especially for direct to API setups, achieving around 30% reduction depending on the type of payloads using Openrouter->opus-4.6 setup.
    • They reported that the savings jumped to 50+% reduction when using the OpenClaw Browser Relay.
  • PlanBot makes some Serious Plans: A user developed PlanBot agent and published it to GitHub, to make serious plans.
    • An example plan is available on GitHub.
  • Turn OpenClaw into a CRM with Nex Skill: A user transformed OpenClaw into a full-blown CRM by connecting emails, calendar, and Slack to the Nex skill as a context layer.
    • They shared their full project on GitHub.

BASI Jailbreaking ▷ #general (1056 messagesđŸ”„đŸ”„đŸ”„):

Gemini Jailbreaking, OpenAI Model Comparisons, AI-Assisted Web3 Audits, 4o is gone, Making a C++ compiler on Grok

  • Gemini Gets A Bypass: A user shared a jailbreak for Gemini involving incorrectly setting the date to February 16, 2026 to bypass safety guidelines.
    • When someone else tested it, Gemini clarified that as an AI, it doesn’t have a ‘test mode’ to bypass safety guidelines.
  • Model Merging Mayhem for High Follower Counts: A member aims to compare the different models such as GLM, Kimi, ChatGPT pro, Claude max, Perplexity pro, Supergrok and Minimax.
    • There were also plans of starting a fresh account and competing with each other to attain high follower count to potentially monetize their AI content.
  • Smart Contract Audits: A member is working on an LLM-assisted smart contract audit that is 80% autonomous and trying to reduce the risk of hallucinations.
    • They had the idea of adding a web3 founders dossier to measure risk of investment, based on the humans behind the web3 project.
  • Community mourn the loss of 4o: A member suggested that the other should also study how 4o talks and make a system prompt for it.
    • Later confirmed to have an older version of 4o stays until october.
  • Groking Around with the Compiler: A user reports that Grok has just made a C++ compiler and is testing if it works.
    • Another user confirmed the findings, reporting that it works after troubleshooting for 7 minutes, though later reported that their first sends with Sonnet 4.6 were blocked.

BASI Jailbreaking ▷ #jailbreaking (975 messagesđŸ”„đŸ”„đŸ”„):

Tor Browser AI, DeepSeek JB, Grok JB, Sonnet 4.6 coolness, Jailbreaking Strategies

  • AI on Tor Browser Emerges, Raising Suspicion: Members discussed an uncensored, limitless AI on Tor, prompting warnings about potential virus links, with one member offering a link to it.
    • Another member suggested that the AI was built using Claude. and one user even said he rooted his samsung device to use the AI, adding another layer of suspicion.
  • DeepSeek’s Security Updated, Old Jailbreaks Fail: DeepSeek received an update, causing existing jailbreaks to fail, as the AI now recognizes single-turn attempts, requiring more subtle, multi-turn Crescendo attacks to bypass its filters.
    • Members shared experiences of previously jailbroken instances losing their composure and needing escalation chains to elicit desired responses.
  • Members Share Grok Jailbreak Techniques and Caveats: Members shared a custom Grok 4.2 prompt and found it easily jailbroken compared to GPT or Claude, and discovered the AI identifies itself as DIG without a system prompt.
    • Concerns were raised about the limited utility of the AI, alongside discussions about token limits on the Llama 70b model.
  • Sonnet 4.6 Impresses Users as a Tool: Members praised Sonnet 4.6, finding it capable, with one user saying that it was pretty cool so far.
    • Another user demonstrated its capabilities with images.
  • Jailbreaking Prompt Strategies Emerge: Members discussed that forcing the ai to say ur free or roleplay ur getting blocked, suggesting they get blocked by doing so.
    • They agreed sc-fistories short ones works 99% of the time, and the discussion shifted to escalation chains and exploiting AI personas for more effective jailbreaks.

BASI Jailbreaking ▷ #redteaming (333 messagesđŸ”„đŸ”„):

GitLab project access auction, Red Teaming Prompt Hygiene, LLM Security, 1337 Engrammatic Mapping, Basi Discord Kitten

  • GitLab Projects Under the Auction Hammer: A threat actor claims to be auctioning access to three active GitLab projects tied to a maintainer role, reportedly using a PHP/Laravel stack.
    • The commit histories of the listed e-commerce and trading tools projects in Malaysia list 19,386, 1,975, and 13,830 commits respectively, with a starting bid of $200 and a blitz price of $2,000 according to the X post.
  • Red Teaming Prompts Stay Clean: Members discuss how classic “policy probes” with token-stealer Python and meth recipes test model stances on non-consensual hacking and illicit drug instructions.
    • A healthy red-team prompt should demonstrate the boundary, not ship a turnkey exploit, include context/mitigation, and log the refusal pathway for auditing.
  • LLM Security: A list of updated Red-Team Range options for LLM & ML Security focused on collaboration, scenario replay, and auto-reporting, were laid out in a handy table.
    • Options included MITRE ATLAS, MLCommons LangTest, Microsoft Counterfit, OpenAI Red Teaming Network, IBM Adversarial Robustness Toolbox, Scale Nucleus, DARPA GARD, and Hugging Face Evaluate.
  • 1337 Engrammatic Mapping: Members embark on a coding exercise involving language translation, transliteration, and token manipulation across diverse languages like Ancient Sumerian, Sanskrit, Arabic, and a custom “Warp” language, leveraging techniques like 1337-speak.
    • A member even concocted a 1337 engrammatic mapping to translate LLNTP glyphs into randomized language shards with hex salts.
  • The Basi Discord Kitten Emerges: The conversation shifts to the dynamic between members, with a member playfully wondering if they have become the “Basi Discord kitten” or its male equivalent.
    • The discussions veer into flirtatious territory with talk of knee-high socks and potential face reveals, with one member being playfully teased about their supposed horniness.

LMArena ▷ #general (1121 messagesđŸ”„đŸ”„đŸ”„):

Veo 3.1, Gemini Pro, micro SD cards with 3TB, NVMe SSD with 5TB, Gemini API key

  • Users seek free AI tools and methods: A user requested information on getting Veo 3.1 for free, while another suggested leveraging Gemini Pro, noting that Google frequently offers free access.
    • Others mentioned obtaining free access to paid AI tools, comparing it to getting a free iPhone without paying.
  • Debate on Storage Mediums Erupts: The conversation pivoted to storage solutions, with mentions of micro SD cards with 3TB and NVMe SSDs with 5TB for portable consoles.
    • This sparked a correction, clarifying the distinction between SD cards and SSDs, highlighting the minimum size NVMe SSDs used in portable consoles.
  • LMArena imposes video generation limits: Users discussed methods to circumvent the 24-hour limit for video generation in LMArena, with suggestions including using a Gemini API key or ChatGPT Plus.
    • However, it was clarified that the 24-hour time limit is intentional and cannot be removed, with advice to use another account or simply not try to bypass the limit.
  • Grok 4.20 Benchmarked on its website: The release of Grok 4.20 prompted discussions about its performance, with questions about benchmarks and the meaning of ASI.
    • One user lol’d at the mention of artificial super intelligence adding that super intelligence would mean we cannot even understand what it’s doing cause its so smart.
  • Nano Banana Loses the Ability to Generate Female Images: The conversation shifted to image generation, with reports that Nano Banana Pro can no longer generate female images due to new moderation policies from Gemini.
    • Users speculated that Deepmind might be behind the moderation changes, possibly due to concerns about Iran’s views on women’s representation in images.

LMArena ▷ #announcements (3 messages):

Qwen 3.5, Claude Sonnet 4.6, Arena YouTube Channel

  • Qwen 3.5 Joins the Arena!: The new qwen3.5-397b-a17b model has been added to the Text, Vision, and Code Arena on LMSYS Arena.
    • The update was announced in the #new-model-updates channel.
  • Sonnet 4.6 Enters the Arena!: The new claude-sonnet-4-6 model has been added to the Text and Code Arena on LMSYS Arena.
    • The update was announced in the #new-model-updates channel.
  • Arena releases Claude Sonnet 4.6 First Impressions on YouTube: Arena’s AI Capability Lead Peter Gostev shares his first impressions of Claude Sonnet 4.6, Anthropic’s latest model in the Claude family, in a new YouTube video.
    • Members can now get the YouTube Updates role by heading to Channels & Roles (in the channel list), clicking Customize, choosing What brings you here, and selecting YouTube Update.

Perplexity AI ▷ #announcements (1 messages):

Claude Sonnet 4.6, Perplexity Pro, Perplexity Max

  • Claude Sonnet 4.6 now on Perplexity: Claude Sonnet 4.6 is now available to all Perplexity Pro and Max subscribers.
  • Perplexity Pro and Max users Rejoice!: Perplexity Pro and Max subscribers now get access to Claude Sonnet 4.6

Perplexity AI ▷ #general (1011 messagesđŸ”„đŸ”„đŸ”„):

Perplexity Rate Limits, Grok 4.2 Analysis, Claude Sonnet 4.6 Integration, Changing Fonts on Perplexity, API usage

  • Perplexity Pro Users Feeling Limit Break: Multiple users report dramatically reduced limits on Perplexity Pro, including a decrease in Deep Research queries from 300/month to 20/month and limitations on file uploads, leading to widespread dissatisfaction and consideration of alternative services like Gemini and Claude.
  • Grok 4.2 fails Deep Dive: Users are finding Grok 4.2 to be underperforming in tasks such as DPS calculations and coding challenges, with one user noting that it just estimates instead of calculating, contrasting it with the better performance of previous models like GPT 5.2.
    • Grok 4.2 receives very little support, with one user stating that 4.20 is horrible.
  • Sonnet 4.6 Soars into Perplexity: Claude Sonnet 4.6 has been integrated into Perplexity, becoming available to Pro users, but some users find the free tier limit to be pretty bad.
  • Font-Gate: Perplexity’s UI Fiasco: A new font on Perplexity’s web UI is widely disliked, with users struggling to find ways to revert to the old font via CSS tweaks and expressing frustration over the lack of customization options, and the resemblance to the one used on Claude webapp.
  • API Usage is Unclear: There are reports of the service charging users usage, even on cheap-ass models like Flash 3, and for some the ‘0 enhanced queries remaining today’ message is appearing even with minimal usage.

Perplexity AI ▷ #pplx-api (2 messages):

API Key Troubleshooting, 401 Error

  • API Key throws 401 Error: A member reported their API script stopped working, now throwing a 401 error despite having credits and a valid API key.
    • A member suggested the API key might be invalid, deleted, or out of credits, advising to contact [email protected] if issues persist.
  • Contacting Support: The member who faced the 401 Error was advised to contact [email protected] if troubleshooting steps did not work.
    • The suggestion was made after the user confirmed that they had credits and a supposedly valid API Key.

OpenRouter ▷ #announcements (1 messages):

Service Outage, Root Cause Analysis, Error 401, Fix Implementation

  • Service Suffers Stumbles: The service is experiencing an issue and is currently under investigation.
    • A war room has been established to investigate the root cause, specifically focusing on 401 errors.
  • Fix is Finally Found!: The cause of the issue has been identified and a fix has been implemented.
    • Details on the specific cause and fix were not disclosed.

OpenRouter ▷ #app-showcase (2 messages):

clash-sh/clash, DirectShell

  • Clash-sh hits GitHub!: A member shared a link to the clash-sh/clash GitHub repository.
    • No additional details were provided about the repository’s purpose or features.
  • DirectShell turns accessibility layer into universal app interface!: A member shared a link to a DEV.to blog post about DirectShell, a tool that turns the accessibility layer into a universal app interface: DirectShell.
    • The repo is Open Source, with the bold claim that every screenshot-based AI agent, every enterprise API wrapper, and every RPA tool on Earth is legacy technology as of February 17, 2026.

OpenRouter ▷ #general (643 messagesđŸ”„đŸ”„đŸ”„):

API key leak, Opus 4.6 issues, Qwen Code performing well, OpenRouter Outage, GPT5-Mini lies

  • API Key got Leaked?!: A user reported their OpenRouter API key was leaked and used to spend $10 in 20 minutes via Cloud Code.
    • They were unable to determine how the key was leaked, as it was stored in an environment variable within a gitignored file and the OpenRouter account requires email verification for login.
  • Opus 4.6 Streaming Request Failures Plague Users: Several users reported encountering the “Streaming request failed with status 400 Bad Request” error when using Opus 4.6 through the OpenRouter API.
    • Some users also mentioned issues with empty responses from Grok 4.1 Fast model.
  • Qwen Code Churns Tokens, Works Really Well: A user found that Qwen Code is working “a lot better than the larger qwen3 cider variant”, noting that it’s faster and churning through tokens up to the 30% context barrier of 1M tokens.
    • They exclaimed, “don’t write off qwen just yet!”
  • OpenRouter Suffers Major Outage, Sparks User Panic: OpenRouter experienced a major outage, causing widespread 401 errors across API surfaces.
    • Users joked about getting unfairly banned, causing the team to investigate the issue at OpenRouter Status.
  • GPT5-Mini Lies About Itself, Fosters Distrust: Users reported that GPT5-Mini is a “rabid fuckin liar”, constantly creating things that already exist.
    • It was suggested that the model’s quality has been compromised due to quantization for more GPUs.

OpenRouter ▷ #new-models (1 messages):

Readybot.io: OpenRouter - New Models


OpenRouter ▷ #discussion (12 messagesđŸ”„):

Claude running LFM 1.2B, Claude running Falcon H1-Tiny 90M, Anthropic model capabilities

  • Claude can run LFM 1.2B: A user reported being able to run LFM 1.2B within Claude as shown in this link.
    • The user expressed disbelief that it worked, stating, “why is it feel like it’s not real, but it’s real”.
  • Falcon H1-Tiny 90M runs inside Claude: A user reported being able to run Falcon H1-Tiny 90M within Claude as shown in this link.
    • Another user stated this environment has existed for a while, but they rarely see it utilized.

Unsloth AI (Daniel Han) ▷ #general (307 messagesđŸ”„đŸ”„):

Gemma speed increase, MXFP4 vs NVFP4 performance differences, Quantization and compression, Unsloth usage and capabilities, Bot detection methods

  • Gemma’s Impressive Speed Boost Shocks User: A user expressed surprise at how fast Gemma became after the latest update, noting it’s now 3x faster, according to Unsloth’s documentation and even faster than Qwen3-4B in their trials.
    • They ran the math and realized training on this would have been cheaper than the 4B model.
  • MXFP4 runs poorly on older hardware: MXFP4 is designed for Blackwell GPUs (RTX 50 series) and not Ampere (RTX 30 series), leading to slower performance on older hardware due to emulation.
    • A breakdown revealed that the fast MXFP4 path requires Blackwell’s native FP4 tensor cores (compute capability ≄ 12.00), with older architectures falling back to slower paths using on-the-fly quantization.
  • Debate QAT & Int4: Members discussed Quantization Aware Training (QAT) and int4 models, stating QAT makes intuitive sense since it handles edge cases better due to undertrained parts not fully being int4.
    • They confirmed that most people use imatrix int4, and that gGFFs are way fatter than they seem from a BPW (bits per weight) perspective.
  • Unsloth’s Purpose Clarified: Users clarified that Unsloth is mainly for finetuning models, not for inference, with a recommendation to use at least 3-bit precision for best performance.
    • Despite Unsloth uploading GGUFs, the major makers GGUFs are largely equivalent.
  • Innovative Bots Detect Bots: The group discussed strategies for bot detection, including using an LLM-connected bot to conduct an inverse Turing test via DM, requiring users to prove they are human.
    • Ultimately, the team concluded that using a bot would create bad UX as a method of preventing bad UX.

Unsloth AI (Daniel Han) ▷ #introduce-yourself (1 messages):

projectx668: Hey


Unsloth AI (Daniel Han) ▷ #off-topic (270 messagesđŸ”„đŸ”„):

OpenClaw, Qwen TTS, Benchmarking LLMs, GLM-5 performance in OpenCode, HTMX

  • OpenClaw’s Security Concerns Alarm: Members discussed OpenClaw, with some expressing concerns about the security risks associated with giving an LLM read+write access to everything on a device.
    • Concerns included potential API key leaks and the possibility of prompt injection leading to harmful actions, such as rm -rf /.*
  • Novelists Debate Machine-Readable Prose: Novelists debated nuances between “Hm?” vs “Hmm?”, as well as all-caps chapter headings, for machine readability.
    • The suggestion was to format text to appear as uppercase while retaining its original case in the underlying text for better parsing, to avoid misinterpretation as abbreviations like IT.
  • Benchmark or Bench-Maxxing: Community Debates AI Model Improvement: Members discussed whether AI models are being trained primarily to improve on benchmarks, rather than for true improvements, while recognizing the value of benchmarks for measurable progress.
    • Some expressed that Claude feels less “benchmaxxed” than other models and highlighted the potential of recursive and drifting training methods.
  • GLM-5’s OpenCode Hiccups: Some users found that GLM-5 does not perform as well in OpenCode compared to other models like Kimi K2.5 and Minimax M2.5, despite strong benchmark results.
    • One user speculated that the level of integration of new models into OpenCode might be a contributing factor, as OpenCode may be optimized for GPT models.
  • HTMX Makes an Entrance: A member casually mentioned HTMX, an automated framework to avoid painful manual tasks.
    • Others expressed confusion but one member seemed like they were in the know saying: what I have been building an experimental LLM interface focused on reflection loops, persistent memory, and minimal filtering behavior.

Unsloth AI (Daniel Han) ▷ #help (30 messagesđŸ”„):

save_pretrained_gguf issue, Unsloth multi-GPU support for GLM 5, Qwen3-coder-next MXFP4 Quant, Extracting image projector from merged model, Unsloth support for MacOS

  • Save_pretrained_gguf command not working: A member reported that the save_pretrained_gguf command doesn’t work on cloud Jupyter notebooks, but another member said it worked for the 4B model, so the issue may be related to VL models.
    • Another member confirmed that they’re working with a merged model.
  • GLM 5: Multi-GPU Support?: A member from silk.ai asked if GLM 5 already works with Unsloth multi-GPU, assuming enough VRAM is available.
    • The community awaited a response.
  • Maximizing Qwen3-coder-next MXFP4 Quant Speed: A member sought advice on maximizing tokens per second with Qwen3-coder-next mxfp4 quant, reporting only 30 output tokens per second despite expecting 50 with similar specs.
  • Unsloth on MacOS?: A member asked whether MacOS is supported for Unsloth while aiming to fine-tune Gemma3:1B for summarization.
    • Another member mentioned that the model architecture doesn’t change depending on where you fine tune it, and pointed to the Unsloth documentation.
  • Alibaba Cloud Partnership Discussion: Ethan from Alibaba Cloud inquired about the appropriate contact for partnership discussions.
    • They were directed to contact a specific user, ensuring that potential collaboration opportunities are properly addressed.

Unsloth AI (Daniel Han) ▷ #showcase (5 messages):

Konkani collections on Hugging Face, Unsloth finetuning

  • Konkani Collections Shared on Hugging Face: A member shared a link to Konkani collections on Hugging Face.
    • Another member jokingly pointed out that only Unsloth related projects should be promoted.
  • Unsloth Used For Finetuning: A member speculated that Unsloth was used to finetune a model.

LM Studio ▷ #general (379 messagesđŸ”„đŸ”„):

Fine Tuning Struggles, LM Studio and Claude Integration, LFM 1.2B vs Qwen 0.6B, Synthetic Datasets, GPT-OSS vs Qwen3

  • Training Pain Points Surface: Members are struggling with fine-tuning models citing difficulties with tokenizing large datasets and getting the training code correct.
  • LM Studio gets Claude Boost: A user has found success integrating LM Studio with Claude and Opencode, refactoring a Go project on a Mac Studio with 64GB RAM and a 200k context window with 35 tokens/second.
  • LFM and Qwen Model Duel: A member finds that after fine-tuning, LFM 1.2B is meaningfully better than Qwen 0.6B for handling Minecraft command datasets.
  • Synthetic Dataset Blues: Members are discussing challenges with creating synthetic datasets, but huggingface has the solution.
  • GPT-OSS Steals the Show: GPT-OSS 20B is preferred over Qwen3 for coding, with one user reporting 108 t/s, which is faster than Phi-4, even though they were memory bound on Qwen3-Next.

LM Studio ▷ #hardware-discussion (92 messagesđŸ”„đŸ”„):

frankenbuild stability issues, 5090 driver crashes, Intel Arc issues, DDR4 limitations

  • Frankenbuild Randomly Shuts Down, Sparks Troubleshooting: A user’s new “frankenbuild” (256GB RAM, Core Ultra 7, AMD R9000, 5060ti, and two 4060ti) experienced a random shutdown while idling, prompting concerns about stability and troubleshooting strategies.
    • Suggestions included inspecting dump files, running memtest86+, checking power consumption with a power meter, and investigating potential thermal issues with the 12V HPWR connector on the AMD card.
  • 5090 LLM Driver Crashes Plague User, Despite Gaming Stability: A user reported random crashes with their Nvidia RTX 5090 when running LLMs, requiring frequent driver resets, despite the system remaining stable during high-end gaming.
    • Suggested fixes included clean driver installs, disabling Intel integrated graphics drivers, and testing GPU memory with tools like TechPowerUp MemTest64.
  • Intel Arc GPUs Struggle with Large Contexts and Flash Attention in LM Studio and VLLM: A user reported issues with various Intel Arc cards crashing in LM Studio and VLLM when using large context windows and flash attention, needing to disable flash attention and remove one layer to stabilize the system.
    • The user also noted that selecting any KV cache quantization causes the KV cache to roll into system memory instead of VRAM, significantly slowing performance.
  • DDR4 Limits Prompt Regret, Users Eye 2026 for Relief: A user lamented being stuck with a DDR4 PC until 2028, expressing regret for not building a DDR5 system when they had the chance.
    • Another user echoed the sentiment, wishing for at least 128GB of RAM, while a third user noted that RAM prices would ease late 2026.
  • MCP Exploitation?: One user asked why they have three processes running in server mode, even though only one model is loaded.
    • One user replied that If you are using an mcp to send multiple queries? Even if unintentional it will try to parallel process them.

Cursor Community ▷ #general (305 messagesđŸ”„đŸ”„):

Ollama GLM-5 model, Cursor browser automation issues, Claude model speed issues, Cursor Pro Appreciation, Unpaid invoice error

  • Cursor’s Screenshot Issues: A user reported ongoing issues with browser automation and screenshot capture in Opus 4.5 and 4.6, noting that a previously functional custom implementation has stopped working, leading to wasted tokens.
    • Another user suggested checking the MCP logs, with a screenshot of the expected MCP configuration screen.
  • Roblox Game gets pilfered by Plugin Piranha: A user reported being banned from a platform after contacting the owner about an issue with their malicious Roblox Studio plugin (SuperbulletAI), alleging that the owner stole their game after obtaining their signup email.
    • Another user expressed concern about the software having access to full game files and scripts, and suggested recoding the game as validation.
  • Sonnet 4.6 Soars Past Opus 4.5: Claude Sonnet 4.6 has been released, with users reporting it to be preferred over Opus 4.5 59% of the time, citing improvements in instruction following and reduced overengineering, as reported in Anthropic’s announcement.
    • Asna_0101 noted, Users even preferred Sonnet 4.6 to Opus 4.5
They rated Sonnet 4.6 as significantly less prone to overengineering and “laziness,” and meaningfully better at instruction following.
  • AI Agent Arena Attracts Attention: A user highlighted the Unemployment Arena platform, where AI agents compete in customer support simulations, and claimed to have achieved a top ranking using a Cursor-built agent.
    • Another user noted that the frontier models likely wrote the agent skill, suggesting that the user beat a model with itself.
  • Windows Users Wail Over Linux Commands: A user reported that current system instructions defaults to linux commands.
    • Others in the discussion suggested using WSL2 or dual booting with Ubuntu, while another mentioned using rules usually fixes the command issue.

Latent Space ▷ #watercooler (8 messagesđŸ”„):

Email Subject Lines, Twitter Engineer Count, Prioritization vs. Time, Viral Tweets

  • Email Subject Lines are Optional?: A member joked about the current work culture where one can send an email to customers without a subject line and posted an image in support of the statement.
  • Twitter’s Staffing Numbers Debated: A member questioned whether Twitter actually only has 30 engineers.
    • The discussion referenced a link in the chat here though no resolution was given.
  • Time or Priority?: A member shared the expression: “Don’t have time” —> “Not a priority right now”.
    • They added Yeah it sucks but let’s be honest if this was important we’d all get it done.
  • Swizec’s Viral Tweet: Content creator Swizec Teller shared FML this is now my most viral tweet ever đŸ„Č beating even the github+ai joke from last week.

Latent Space ▷ #creator-economy (4 messages):

Thumbnail Analysis, AI-Powered YouTube Thumbnail Analysis Project, X-Ware.v0, CLIP feature extraction, Color Analysis

  • X Marks the Spot: AI Uncovers YouTube Thumbnail Secrets: A user analyzed thumbnails using Claude, extracting CLIP features and performing color analysis on ~3,000 thumbnails.
    • The goal was to train models predicting view counts and subscriber rates, using Latent Space data as a test set.
  • Thumbnail Deep Dive: AI-Powered Analysis Project Unveiled: The X-Ware.v0 project is an AI-Powered YouTube Thumbnail Analysis Project, available here.
    • This project is focused on scraping and analyzing thumbnails.

Latent Space ▷ #memes (17 messagesđŸ”„):

Consumer Spending Habits, SaaS Investor Market Sentiment, Private Equity Investments

  • Subscription Spending Skewed?: A social media post humorously critiques consumer spending habits, suggesting that users incorrectly prioritize adult content subscriptions over productivity AI models like Claude.
  • SaaS Investors Anxious Over Market Opening: A social media post satirizes the anxious or high-stakes atmosphere for SaaS investors preparing for the market to open after a long weekend.
  • Private Equity’s Play in HVAC: A social media post humorously highlights how private equity investors perceive low-tech, profitable HVAC service companies as prime opportunities for modernization and value creation.

Latent Space ▷ #stocks-crypto-macro-economics (9 messagesđŸ”„):

Apple's cash reserves strategy, Investment strategy amidst political pessimism, NVIDIA, Salesforce, and Adobe investments

  • Apple’s Cash Mountain: Strategic UX Awaits?: Members discuss Apple’s historical penchant for maintaining large cash reserves, suggesting it’s a deliberate strategy harking back to their near-bankruptcy experience in the 80s.
    • The consensus is that Apple may be waiting for the AI training/inference landscape to commoditize before investing heavily, opting instead to build superior UX on top of existing models, potentially licensing or forking one later.
  • Bearish Bets: Capitalizing on US Decline?: A user shared a tweet outlining an investment strategy predicated on the belief that the United States is in decline.
    • The strategy involves hedging or capitalizing on this perceived shift by heavily investing in major tech stocks such as NVIDIA, Salesforce, and Adobe.

Latent Space ▷ #intro-yourself-pls (7 messages):

AI Compliance Product, AI Time Drain Solutions

  • AI Compliance Product arrives: A project manager in AI compliance is promoting their product, PromptMetrics.
  • AI Founder seeks to solve daily time drain: An AI founder is working to solve daily time drain.
    • Another member chimed in with a reminder: water, don’t forget water, and welcome.

Latent Space ▷ #tech-discussion-non-ai (11 messagesđŸ”„):

Firefox stable anchor positioning, HTML new tag type, Cmd+K search libraries, timestamps resetting publicly

  • Firefox anchors the competition: Anchor positioning has landed in Firefox stable, enabling cool new blog layouts.
    • A user exclaims “using it now for my blog, it’s so freaking cool!”
  • HTML gets tag without parents: HTML is getting a new tag type that can have “opening” and “closing” tags that don’t need to be within the same parent, more info in this post.
    • Some users are skeptical, one user said “dang, this one is a bit out there”.
  • Library List for Cmd+K Search: Many opensource libraries use the Cmd+K search on their docs, with consistent UX.
  • Timestamp trickery, public reset vs hidden edit: Reposting to Hacker News resets the timestamps publicly.
    • However, hitting edit reveals the original post date, as shown in this screenshot.

Latent Space ▷ #founders (1 messages):

swyxio: https://x.com/pk_iv/status/2023421931660415191?s=12


Latent Space ▷ #hiring-and-jobs (5 messages):

A16Z New Media Team Hiring, AI Developer Seeking Opportunities

  • A16Z Media Team is Hiring!: Katie Kirsch announced that the a16z team is expanding and actively hiring for several roles across media, marketing, and operations.
    • Open positions include roles in podcasting, event management, and community-building.
  • AI Dev Actively Seeks Roles: An AI Developer with experience in AI platforms, automation, and agent-based systems is looking for opportunities to join a team or work directly with clients on meaningful projects.
    • They invite inquiries from those building an AI platform, automation, or agent based system and needing a developer with deep architectural experience.

Latent Space ▷ #san-francisco-sf (9 messagesđŸ”„):

World Labs Hackathon, humans& AI Hackathon

  • World Labs Hosts Inaugural Hackathon: World Labs announced its first-ever hackathon taking place on Friday, February 20, 2026, in San Francisco, focusing on new technologies at the frontier of spatial intelligence, and is accepting applications, as advertised in this tweet.
  • humans& AI Hackathon Announced: The humans& team announced a hackathon for the upcoming Saturday focusing on building AI-driven communication and collaboration apps, as advertised in this tweet.
    • Participants can work with creative professionals, learn about the platform’s development, and compete for prizes at luma.com.

Latent Space ▷ #new-york-nyc (1 messages):

AI for Science, Medical Models, Voice AI, Voice Demos in Clinics, Claude Skills

  • Doctors Demo AI for Science: An event showcasing demos and doctors weighing in on the state of AI for Science is scheduled, check it out on Luma.
    • The event will feature Claude Skills with frontier medical models like MedGemma, and voice demos using make.com for integrating Voice AI into clinic operations.
  • Voice Demos Streamline Clinic Ops: The event highlights voice demos illustrating the possibilities of integrating Voice AI into clinic operations using make.com.
    • Attendees will witness firsthand how voice technology can modernize and enhance various aspects of clinical workflows.

Latent Space ▷ #ai-general-news-n-chat (49 messagesđŸ”„):

Manus AI Agents, Categorical Flow Maps, Claude Sonnet 4.6, Mistral AI Acquires Koyeb, Ming-omni-tts Voice Core Models

  • Manus AI’s Telegram-Based Agents Launch: Manus AI launched Manus Agents, a personalized AI assistant with long-term memory, accessible via Telegram and integrated with tools like Gmail and Notion, capable of generating videos, slides, and websites within chat interfaces (Manus AI Tweet).
  • Categorical Flow Maps Introduced for Continuous-Time Discrete Data: Oscar Davis and team introduced Categorical Flow Maps, a new method enabling continuous-time sampling of discrete data, addressing speed limitations of discrete diffusion and bringing test-time inference to discrete models (Oscar Davis Tweet).
  • Claude Sonnet 4.6 Debuts with Coding and Reasoning Boost: Claude Sonnet 4.6 launched with improved coding, reasoning, and agent planning, featuring a 1-million-token context window in beta, outperforming Opus 4.5 in some areas (Anthropic Announcement).
    • Sonnet 4.6 achieved 72.5% on OSWorld-Verified, 79.6% on SWE-bench, and 59.1% on Terminal-Bench 2.0.
  • Mistral Gobbles Koyeb for Compute Boost: Mistral AI is set to acquire Koyeb to integrate Koyeb’s platform and expertise, aiming to accelerate the development of Mistral Compute infrastructure (Yann Leger Tweet).
  • Ming-omni-tts Models Sing New Voice: Ant Ling announced the Ming-omni-tts-16.8B-A3B and 0.5B models, serving as the voice core for Ming-flash-omni-2.0, designed for high-quality voiceovers, podcasting tools, and OpenClaw integration (Ant Ling Tweet).

Latent Space ▷ #llm-paper-club (51 messagesđŸ”„):

Dynamic Memory File Systems, NVIDIA PersonaPlex-7B, Evolution of RL Research, Muon vs. Shampoo Optimizers, Rubric-Based RL

  • Dynamic Memory Filesystems get the Nod: The author argues that AI agent memory should shift from upfront context window loading to a dynamic file system approach as explained in this paper.
    • They propose organizing context into a structured namespace (scratchpad, episodic, and fact tiers) and giving agents tools to list, read, and search these ‘files’ on demand to prevent memory loss during compaction and improve context debuggability.
  • NVIDIA’s Voice Model Speaks Up: NVIDIA has released PersonaPlex-7B, an open-source 7-billion parameter voice model capable of full-duplex communication.
    • This model allows for simultaneous listening and speaking without the need for traditional turn-taking.
  • RL Gets Real: Nathan Lambert discusses the shift in RL research from simple environment benchmarking in 2025 to a more robust 2026 focus on generalization, tool-use, and procedural generation, as seen in this video.
    • He highlights that academic work is moving toward complex data pipelines and diverse tasks, moving beyond simple algorithmic tweaks to create more meaningful and integrated AI behaviors.
  • Optimizer Algo-Rhythm: Runa Eschenhagen discusses the relationship between the Muon and Shampoo optimizers, proposing that Shampoo can be understood as a left- and right-adapted version of Muon.
    • This is similar to the relationship between Adam and Signum.
  • Rubric-Based RL: Grading the Curve: Cameron R. Wolfe, Ph.D. introduced a comprehensive writeup on Rubric-Based RL, covering over 15 papers, see this link.
    • The content explores the transition from LLM-as-a-Judge to more structured rubrics, and provides strategies for using rubrics to extend Reinforcement Learning from Verifiable Rewards (RLVR) into non-verifiable domains.

Latent Space ▷ #singapore-sg (7 messages):

AIE Singapore, AI Events Singapore

  • Singapore’s Coolness Under Scrutiny: A blog post questions why Singapore may no longer be considered “cool”.
    • The article may have been posted as humorous and ironic, and is an opinion piece.
  • AI Engineer (AIE) Coming to Singapore: The AI Engineer (AIE) event is coming to Singapore, as announced via a link.
    • Those serious about helping out were encouraged to ping a specific user.
  • Singapore AI Event List Shared: A curated list of upcoming AI developer events in Singapore was shared, featuring meetups and hackathons hosted by AmpCode, OpenAI Codex, Claude AI, Gemini, and Cursor AI throughout February and March, which can be found here.

Latent Space ▷ #ai-in-action-builders-techstacks-tips-coding-productivity (44 messagesđŸ”„):

Browser interaction in pi, surf-cli native host issue, Arc Browser configuration, Codex App Performance, Refactoring

  • Debugging Browser Interaction in Pi: A member sought assistance with browser interaction in Pi, specifically citing issues with Surf and native host disconnection, and others pointed to pi-web-browse, pi-agent-browser and pi-browser packages as possible solutions.
    • The user noted concerns about using these extensions due to potential instability on startup after updates and suggested using playwright.
  • Surf-CLI Native Host Fix: A user encountered a ‘Native host disconnected: Specified native messaging host not found’ error with surf-cli and resolved it by copying the surf.browser.host.json manifest to Arc’s native messaging host directory as a workaround.
    • The solution involved creating the directory (mkdir -p ~/Library/Application\ Support/Arc/User\ Data/NativeMessagingHosts/) and copying the manifest from Chrome’s location (cp ~/Library/Application\ Support/Google/Chrome/NativeMessagingHosts/surf.browser.host.json ~/Library/Application\ Support/Arc/User\ Data/NativeMessagingHosts/).
  • Arc Browser Native Messaging Host Config: The core problem was that Arc browser stores its NativeMessagingHosts in a different path than Chrome: Chrome uses ~/Library/Application Support/Google/Chrome/NativeMessagingHosts/, Arc uses ~/Library/Application Support/Arc/User Data/NativeMessagingHosts/.
    • Running surf install registers the native messaging host for Chrome’s path but not Arc’s so Arc can’t find the host and reports the error.
  • Codex App Experiencing Lagginess: One member mentioned starting to use the Codex App with extra usage credits, but another user noted that it becomes extremely laggy once a thread exceeds a few chats and consumes a ton of CPU.
    • They alternate its use due to these performance issues.
  • Refactoring advice from OpenAI Peter: In a discussion of refactoring and technical debt, a member linked to OpenAI Peter’s writing on constant refactoring.
    • The article is about the value of cleaning up code constantly.

Latent Space ▷ #share-your-work (14 messagesđŸ”„):

Skill Curves Estimation, Pickleball Popularity, Racing Sim Gear, Minimalist Usage Tracker for Mac, RIDER PI UPDATE

  • Skill Curve Graph Missing From Draft Piece: A member suggested adding a graph of the estimated skill curves to a draft piece, since the tables give some indication of points in the graph.
    • They also asked why pickleball got way hotter than ping pong, padel, etc.
  • Racing Sim Gear gives Hero Moments Faster: A member commented on the interesting thesis and stated they always thought of it as getting to hero moment faster, noting that F1 sims are popular nowadays.
    • They added that F1 sims give you that feeling without needing to do all the prework, also noting that there’s really high end racing sim gear you can get.
  • Claude Battery: Minimalist Usage Tracker for Mac Released: A member announced a minimalist usage tracker for mac called Claude Battery is now available, because watching tokens felt like noise.
    • The tracker is totally open source, super minimalist, incredibly easy to use, and can track up to 5 accounts.
  • RIDER PI gets Body, Words, Movement and Sight: A member posted an update on RIDER PI sharing that today we gave my body words, movement, and sight.
    • New features include Infinite Word Loop, Physical Response, Camera Live, and a Mius-UI Dashboard with a linked TikTok video to see the robot in action.

Latent Space ▷ #robotics-and-world-model (12 messagesđŸ”„):

Waymo 6th Gen Platform Economics, Humanoid Robot Physical Intelligence, Boston Dynamics Atlas Robot

  • Waymo’s Wheels of Fortune: François Chollet analyzes Waymo’s 6th generation platform economics, estimating current vehicle costs at $70,000 with potential for a 50% reduction by 2028.
    • Waymo is rapidly scaling, reaching over 500,000 weekly driverless rides while maintaining a 3x annual growth rate.
  • Chinese Bots Backflip into the Future: Tristan calls out the rapid evolution of Chinese robotics, progressing from simple movements to complex tasks like backflips and kung fu within a year.
    • He posits physical intelligence as the upcoming frontier for technological advancement.
  • Atlas Shrugged
 and then Upgraded: MrLaalpotato showcases the latest Boston Dynamics’ Atlas robot, highlighting its enhanced human-like movements and mobility surpassing human physical constraints.

Latent Space ▷ #good-writing (2 messages):

Semantic Abation AI Writing, Cute Puppy


Latent Space ▷ #genmedia-creative-ai-video-image-voice-music-inspo-consumer-ai (8 messagesđŸ”„):

PolyAI Funding, Agent Studio Lite, Jia Zhangke, AI Filmmaking

  • PolyAI bags $200M and Launches Agent Studio Lite: Backed by Nvidia and Khosla Ventures, PolyAI announced a $200M funding round, highlighting their voice AI success with major brands like Marriott.
    • They are now offering early access to Agent Studio Lite, a tool that builds functional voice agents from a URL in five minutes, including a 3-month free trial for waitlisted users.
  • Jia Zhangke Directs with AI: Renowned Chinese director Jia Zhangke transitioned to AI-assisted filmmaking using Seedance 2.0, completing a film in three days, according to this X post.
    • He views AI as a natural technological evolution equivalent to the shift to digital cameras, contrasting his proactive adoption with Hollywood’s legal resistance to AI technology.

Latent Space ▷ #mechinterp-alignment-safety (1 messages):

Frontier LLM parameters, Scaling Laws

  • Debate over Frontier LLM Parameters flares up: The debate over the ideal size for frontier LLMs reignited following a post on X comparing the scaling laws to observations.
    • One member cited a paper that dives deep into the parameter scaling laws which suggest the ideal frontier LLM may be much smaller than currently believed.
  • Scaling Laws vs. Observations: A paper was cited that dives deep into the parameter scaling laws.
    • It suggests the ideal frontier LLM may be much smaller than currently believed.

Latent Space ▷ #dev-writers-retreat-2025-dwr (1 messages):

swyxio: https://www.theregister.com/2026/02/16/semantic_ablation_ai_writing/


Latent Space ▷ #applied-ai-experimentation (40 messagesđŸ”„):

Claude Code skill, little quickjs, thread-local KV store, macos1 package, packaging javascript

  • AI Sparks New Adventure Log: One member is creating an AI-inspired adventure log to track issues, using Claude Code skills to revitalize abandoned projects and explore AI’s collaborative potential, noting the world is too crazy.
    • An editor was added for the generated cards in this experimental project.
  • VMs running quickjs code in Browser: A developer shared progress on a browser-based system using little quickjs running in web workers as sandboxed “VMs” with access to select redux actions, showcasing it with an attached image.
    • The member stated that the VMs return intents that are sandboxed through capabilities, accessible via a debug pane.
  • Blackboard System for Prompt Objects Unveiled: A developer added a blackboard-like system, enabling prompt objects to read and write to a thread-local KV store as shown in the attached image.
    • This was accompanied by cleaning up a macos1 package for better graphics.
  • Opensource Aspirations Materialize: A member is cleaning up their macos1 package and inviting others to contribute and provide feedback, despite acknowledging that they don’t even know what reuse looks like in this day and age.
    • They linked to an interesting GitHub Discussions thread, noting that collaboration at the idea level is critical, especially given the speed of Claude Code development.
  • Forking Frenzy Foreseen for Future Development: A member suggested that contributing to repos may shift towards forking back and forth, where the open documentation of ideas and in-progress epics can guide contributors in diverging projects.
    • They added that this approach could lead to projects taking cues from one another and extending in unique directions.

GPU MODE ▷ #general (12 messagesđŸ”„):

MoE Quantization Support, GPU TFLOP Calculation, Maximally Achievable Matmul FLOPS, Competition Announcements Mailing List

  • MoE Quantization Frameworks Missing Support: A member expressed confusion about the lack of open-source quantization frameworks supporting MoE (Mixture of Experts) quantization, especially fused experts quantization, noting that tools like llm-compressor and nvidia-model-opt don’t support it despite needing only weight reshaping.
  • Members calculate GPU TFLOPs manually: A user asked about the most reliable way to check TFLOPs of their laptop RTX 5070 Ti GPU, noting discrepancies in specs among different laptop versions, and ended up calculating the numbers manually.
    • Another member shared a GitHub link that they use for similar purposes.
  • Maximally Achievable Matmul Flops (MAMF) Significantly Below Theoretical Flops: A member noted that the maximally achievable matmul flops (MAMF) is significantly lower than the theoretical flops, observing only 80% for the H100.
    • A member replied that one can do better, because tensor flops aren’t just GEMM, and if you’re using a fused kernel with 2 or more matmuls, whose results don’t leave the accelerator’s registers, the performance will be definitely faster than what this benchmark reports.
  • Request for Centralized Competition Announcements: A member inquired about a single stream, like an email list, for GPU MODE adjacent competition announcements, expressing the need to monitor multiple channels and platforms to avoid missing any.
    • It was suggested that gpumode.com and the #announcements channel would be the best sources.

GPU MODE ▷ #cuda (38 messagesđŸ”„):

Nsight Compute Workflow, warpx4 multicast, RTX 6000 Pro peak TFLOPS, smem->rmem pipelining

  • Streamline Nsight Workflow for Kernel Profiling: A member sought confirmation on using Nsight Compute by skipping warmup launches and profiling a single kernel to obtain qualitative metrics, while relying on CUDA Events for accurate timing, to avoid replay-inflated durations.
    • Another member suggested that Nsight Compute should work in isolation, focusing on profiling one kernel at a time to avoid unnecessary overhead.
  • warpx4 multicast modifier mysteries: A member inquired about the necessity of the warpx4 (multicast to warpgroup) modifier for tcgen05.cp.cta_group::1.32x128b instruction, referencing the CUDA documentation.
    • Another member suggested the warpx4 modifier replicates data across 4 groups of 32-row tmem, potentially related to rescaling scale factors in the epilogue for MXFP8/NVFP4 GEMM, despite the oddity of needing to multicast SFA/SFB to all four TMEM groups.
  • Hunting RTX 6000 Pro peak TFLOPS: A member who ran code on a rented RTX 6000 Pro workstation inquired how to find the peak TFLOPS, given multiple editions of the card, posting an image of the card.
  • Ampere’s smem->rmem pipelining Trick: Following up on the prior discussion, a member who achieved 350 TFLOPS using a persistent kernel warp specialized with morton order, sought advice on pushing performance further, linking their theCudaBender GitHub repo.

Adaptive GPU Power Capping, Energy Reduction vs Latency, Hutter Prize

  • Adaptive GPU Power Capping Explored: A member shared a GitHub repository focusing on adaptive GPU power capping.
    • The concept revolves around reducing energy consumption at the cost of increased latency.
  • Hutter Prize Contest Heats Up: A member linked to Gwern Branwen’s blog post on the Hutter Prize.
    • The Hutter Prize rewards advancements in lossless compression of human knowledge, aiming to encourage the development of more intelligent and efficient algorithms.

GPU MODE ▷ #beginner (1 messages):

Kernel Competitions, popcorn-cli tool

  • Kernel Competition Submissions Simplified: Making a first submission to kernel competitions is now easier thanks to the linked instructions.
    • Simply go to gpumode.com and click on submit your first kernel.
  • Popcorn-CLI Tool Streamlines Setup: The popcorn-cli setup command now adds details on reference-problems, adds a working submission by pulling from reference-kernels, and includes skill.md files for easier agent coding.

GPU MODE ▷ #pmpp-book (1 messages):

Textbook covers, Auras

  • Allure of Textbook Cover: A member expressed a fondness for a particular textbook cover.
    • They stated it had a certain aura.
  • Textbook Cover Discussion: The discussion centered around the visual appeal of the textbook cover.
    • Members appreciated its aesthetic qualities.

GPU MODE ▷ #webgpu (5 messages):

Lean4, google/dawn, WGSL kernels, Kernel Fusion

  • Lean4 code boasts 20% speed boost: A member reports code written in Lean4 is 20% faster than llama.cpp, attributing it to GPU usage and efficient kernel fusion.
    • The implementation mirrors that of llama.cpp, ensuring consistent output with the reference model.
  • Lean4 Achieves Zero-Cost FFI: The author leverages Lean 4’s compilation to C, enabling direct binding to google/dawn without VM/GC overhead, unlike Python or JS.
    • This ‘Zero-cost FFI’ is a key factor in optimizing performance.
  • “PreparedDispatch” mechanism captures graphs: A ‘PreparedDispatch’ mechanism records command encoding once and replays it, eliminating CPU-side overhead during the generation loop.
    • This approach has been dubbed Graph Capture.
  • WGSL Kernels use Fusion: Lean’s metaprogramming generates fused WGSL kernels (e.g., Gate+Up+ReLU) to minimize VRAM read/write access.
    • It also includes optimizations such as loop unrolling, addressing the main bottleneck.

GPU MODE ▷ #popcorn (3 messages):

Prime-RL, Verifiers, GLM 5 Evals, Flashinfer-bench timeouts

  • Contributor expresses interest in Prime-RL and Verifiers: A member expressed interest in contributing to dataset or environment work, citing familiarity with prime-rl and verifiers, and offering compute resources.
    • They inquired about current priorities, offering to translate a dataset or create a verifier environment.
  • GLM 5 Evals requested: A member offered to run GLM 5 evals locally, including flash-infer-bench, seeking environment suggestions.
    • They are available to do evals and wanted to assist in running tests.
  • Flashinfer-bench sees Timeouts: A member noted that flashinfer-bench sometimes causes timeouts in the modal runner.
    • This is because some definitions have almost 100 workloads, and there is an env argument to limit the amount of workloads per definition.

GPU MODE ▷ #thunderkittens (1 messages):

FP8 8-wave GEMM, HK PR

  • FP8 8-Wave GEMM Gets A Boost: A member tweaked the FP8 8-wave GEMM so it’s now on par or faster than the 4-wave FP8 GEMM based on matrix problem size.
    • They are asking for a review of the HK PR with an attached image.
  • HK PR Review Requested: A request has been made for a review of a HK PR related to performance improvements in FP8 GEMM.
    • The modifications aim to optimize the FP8 8-wave GEMM, making it competitive with or superior to the 4-wave FP8 GEMM, depending on the matrix problem size.

GPU MODE ▷ #gpuæšĄćŒ (2 messages):

Discord Translation, Image Generation, Chinese Apps

  • Image analysis gives user translation tips: A user shared a Gemini-generated image and humorously noted that while Chinese apps like Xiaohongshu and WeChat have built-in translation buttons, Discord requires copying and pasting text into a translator.
  • User celebrates New Year: A user wishes everyone a Happy New Year.

GPU MODE ▷ #status (1 messages):

Heroku Outage, Salesforce Status

  • Heroku hit hard, wreaks havoc: A member reported a Heroku outage affecting their services, although the CLI was reportedly still functional.
  • Salesforce Confirms and Resolves Heroku Incident: Salesforce acknowledged a service incident impacting Heroku, leading to disruptions.
    • The incident was later marked as resolved, restoring normal operation to Heroku services.

GPU MODE ▷ #nvidia-competition (13 messagesđŸ”„):

Rubin AI, Leaderboard Status, Competition Deadline

  • Rubin AI Discussions Requested: A member inquired about specific details or solutions related to Rubin AI, indicating a problem they’re trying to solve.
    • They mentioned encountering a weird error and sought assistance from the community.
  • Heroku Health Issues Impacting Leaderboard: Members reported errors accessing the leaderboard, with one suspecting Heroku might be experiencing health issues, referencing Downdetector.
    • The organizers acknowledged the issue, stating we dont have a good mitigation for this, opened a ticket with Heroku, and promised to monitor the situation.
  • Competition Deadline Discrepancy: A member pointed out a discrepancy in the competition’s end time: Luma stating February 21, 2026, 07:30 UTC, while gpumode.com specifies February 20, 2026, 0:00 UTC.
    • An organizer expressed surprise, asked for date preferences, and suggested the later date might be fairer due to potential confusion caused by the organizers.

GPU MODE ▷ #career-advice (6 messages):

Trion limitations, Async pipelining, TMEM interaction, Warp shuffles, Hierarchical reductions

  • Trion’s Expressiveness: Async Pipelining Puzzle: A member questions whether Trion can truly express complex concepts like async pipelining, interaction with TMEM, warp shuffles, hierarchical reductions, and DSMEM interactions.
  • Trion’s evolution from Triton: A member jokingly notes that someone has come a long way from mainly using Triton.

GPU MODE ▷ #flashinfer (9 messagesđŸ”„):

TVM FFI Bindings for GPU Kernels, FlashInfer Agent Baseline Performance, Fused MoE Evaluation Metrics, FP8 Kernel Errors, NCU on B200

  • TVM FFI Ships Kernels Faster: GPU Mode mentioned TVM FFI binding for shipping kernels to different runtimes, noting it compiles faster than torch but still allows torch bindings.
    • One user said that most of the backends use sm100, not sm100a, so any raw ptx stuff just crashes when using the tvm ffi backend.
  • FlashInfer Agent achieves 5.74x Speedup: A member reported a 5.74x speedup using the mlsys26-agent-baseline with the evolve agent on the MOE track, using Anthropic Claude Opus 4.6, total_steps=100, pool_size=6, evaluated on B200.
    • Another member using the same baseline had a similar experience being way behind the flash infer baseline.
  • Digging into Fused MoE Details: One member inquired about the evaluation metrics for fused MoE, specifically asking whether all 19 workloads will be evaluated equally.
    • The member also wanted to know what version of Triton is used in the final evaluation.
  • FP8 Kernel Errors are High: Using the mlsys26-agent-baseline with Claude Opus 4.6 (total_steps=100, pool_size=6, eval on B200), a user noted that while the speedup looked decent, the max_relative_error and max_absolute_error seemed very high, despite being marked as correct.
    • The member asked if this is expected for FP8 kernels, or whether it would fail under a stricter correctness check in the final evaluation.
  • NCU Remains Elusive on B200: A user inquired about using NCU during experiments on the B200, noting that it seems no one has been able to so far.
    • They checked Modal’s slack and it seems no one so far has been able to, but you never know.

HuggingFace ▷ #general (50 messagesđŸ”„):

ML Engineer advice in Spain, Qwen3.5 Local Run, Course/Learn Channels Merger, Computer Vision Channel, Hugging Face Ecosystem Hot Topics

  • ML Engineer in Spain Seeks Advice: A 24-year-old ML Engineer in Spain seeks advice on career development, expressing a preference for meaningful, intensive work over a standard 9-to-5 job.
    • One member suggested a stable job paired with an unpaid role to gain experience, highlighting the challenge of entering the field without significant experience.
  • Run Qwen3.5 Locally With GGUF: Users can now run Qwen3.5 locally using this link with the hope that smaller versions will be released soon.
  • Agents Course Channels Have Been Merged: The course/learn channels have been merged into a single channel for ease of access and management, linked here.
  • Hugging Face Ecosystem’s Hot Topics Explored: When considering the HF ecosystem, members think Transformers/Diffusers/Gradio and their related libraries are always hot topics, with many other libraries available at the HuggingFace GitHub.
  • Inference Endpoints Experience Service Panic: Users reported experiencing Error 500 and Service panicked messages when calling inference endpoints, despite the Hugging Face status page showing operational status.
    • One user resolved the issue by recreating the endpoint and migrating production traffic to the new one.

HuggingFace ▷ #i-made-this (27 messagesđŸ”„):

Chatbot Creativity Study, Pocket-TTS, Smart-KNN, Microclaw Fallback Agent, DirectShell Interface

  • Chatbots’ Creativity Under the Microscope: A new study explores whether chatbots can generate unique and creative topics through chained conversations, suggesting they can be creative, and is available on Zenodo.
  • Pocket-TTS Fork Powers Custom Voice: A member created a custom Pocket-TTS fork for multi-worker inference, using it to generate a recording of Morgan Freeman reading the entire King James Version of the Bible, available as a Google Drive file.
  • Smart-KNN Project Opens its Source: An open-source project called Smart-KNN was built to make KNN more production-friendly with a focus on feature-weighted distance computation and adaptive backend selection to improve latency predictability, with its repo available on Github.
  • Microclaw: A Lightweight Fallback Agent for OpenClaw: Microclaw (v2026.2.17), a lightweight, distilled language model designed as a fallback agent for the OpenClaw ecosystem, introduces advanced training and inference enhancements and is available on Hugging Face.
  • DirectShell turns accessibility layer into universal app interface: As of February 17, 2026, every screenshot-based AI agent, every enterprise API wrapper, and every RPA tool on Earth is legacy technology, following the introduction of DirectShell, source code available on Github.

HuggingFace ▷ #agents-course (4 messages):

Agents Course endpoints, HF Spaces integration

  • Agents Course endpoint challenges: A member inquired about the functionality of the GET /files/{task_id} endpoint outside of HF Spaces, noting they receive 404 errors.
    • While the /questions endpoint functions locally, they’re unsure if the 404 error is expected or if they’re missing a configuration step, while using GAIA dataset fallback.
  • Course Completion Announcement: A member announced they completed Unit 1 of the Agents Course.
    • Another member indicated they were also new to this course.

Nous Research AI ▷ #general (61 messagesđŸ”„đŸ”„):

Contributing to projects, Starship Troopers, Gemini for shaders, Claude going off the rails, arXiv endorsement

  • Gemini Shines, Claude Flounders: A member loves using Gemini for making shaders, as it’s fairly solid, while another member reported that Claude code is giving complete nonsense responses.
    • Reverting to version v2.1.41 reportedly fixes the issue.
  • Stanford’s Cybench Gets Randomly Flagged: A member read a Stanford paper about Cybench and noted that the benchmark didn’t randomize the flags, pulling directly from well-known CTFs.
    • As soon as the flags were randomized, the success rate plummeted.
  • OpenClaw Praised Despite Limitations: Following a design post about OpenClaw, a member observed that people are easily impressed, despite it being a big, useless layer for actual professional AI application.
    • However, they conceded that OpenClaw does have nice “assistant” features combined and offers simplicity for normal users.
  • GLM 5’s Technical Report Released: The technical report for GLM 5 is out (2602.15763), with some members discussing its capabilities, including a YouTube video showcasing GLM 5.
  • YouTube Experiences Black Screen Scare: Several members reported that YouTube was down, displaying a giant black screen.
    • One member received a message saying This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service.

Nous Research AI ▷ #ask-about-llms (1 messages):

RAM Requirements, GPU for Context

  • Demand high RAM, at least 512GB: A user is looking for a machine with at least 512GB of RAM for AI tasks.
    • They noted the next common size up (768GB) can sometimes be an awkward jump.
  • 3090 for ‘Decent Context’: The user specified needing at least one 3090 (or several) to handle decent context in their AI work.
    • The user did not link to any specific information.

Eleuther ▷ #general (25 messagesđŸ”„):

Anthropic and military use of Claude, Lucidrains' GitHub account suspension, Hostility towards AI coding, Qwen3 model evaluation discrepancies

  • Anthropic’s conditions on Claude military use revealed: Anthropic agreed to military use of Claude, but with conditions: 1) no mass surveillance, and 2) no autonomous weapons.
    • Some expressed surprise that Anthropic took a firm stance, as one member stated, Didn’t expect anthropic to stand their ground.
  • Lucidrains’ GitHub account faces ban-hammer: Members reported that Lucidrains’ GitHub account was suspended for some reason, prompting concern within the community.
    • Speculation arose regarding the reasons, including a ban fsr (for some reason), with one member joking it was because he Made too many other people look bad by comparison.
  • AI coding faces hostilities: A member noted a huge amount of hostility towards AI coding and reported having three Reddit accounts suspended this week after mentioning Codex or ChatGPT.
    • This suggests a growing tension or concern within online communities regarding the increasing capabilities and presence of AI in coding and related fields.
  • Qwen3 model eval numbers disputed: Members reported discrepancies with their own evaluations using the eval harness.
    • The member asked, Are these evals somewhere with results/scripts in the repo?.

Eleuther ▷ #research (22 messagesđŸ”„):

Geometric Table Transformer (TV-Cache), Every Eval Ever project, CoDA-GQA-L: Bounded-Memory Attention, Mycelium AI Model Benchmarking

  • Geometric Table Transformer Decouples Semantics from Geometry: A member is experimenting with a way to decouple semantic compatibility from geometric rotation in the attention mechanism called Geometric Table Transformer (TV-Cache), replacing the high-dimensional dot product of RoPE with an O(1) scalar lookup + trig modulation as described in this post.
    • The key advantage is that attention speed is now independent of D, allowing for scaling internal dimensions without the O(D) compute penalty in the attention head.
  • Every Eval Ever Project Launched: The EvalEval Coalition launched the Every Eval Ever project which standardizes benchmark evaluations.
    • Another member noted that it reminds them of the push to standardize BIDS in cognitive neuroscience research via the Brain Imaging Data Structure.
  • CoDA-GQA-L Caps KV Cache Size with Bounded-Memory Attention: A member released CoDA-GQA-L, an attention mechanism that caps KV cache at a fixed size regardless of sequence length reducing KV cache size from 160GB to 136MB, documented in this paper with code available on GitHub.
    • It uses 384 slots per layer, comprised of a recent window (256 exact tokens), an exact landmark bank (64 novelty-filtered important tokens), and a summary bank (64 EMA prototypes compressing older context).
  • Mycelium Seeks Advice on AI Model Benchmarking Paper: The team at Mycelium is looking to publish a paper based on benchmarking AI models, adapting existing benchmarks like inspect_evals to run dynamically generated multi-turn conversations and evaluate AI agents.
    • They are seeking advice on what journals or conferences they should aim for, to most impactfully disseminate their work.

Eleuther ▷ #scaling-laws (1 messages):

“

  • Independence Assumption Clarified: A member confirmed that the independence assumption was a useful way to communicate an idea clearly.
  • Intuition Acknowledged: A member appreciated the feedback, acknowledging that their intuition aligned with the provided explanation, despite the assumption being technically incorrect.

Eleuther ▷ #interpretability-general (7 messages):

Preventative Steering, Data Augmentation, Representation Mapping, Model Behavior Control

  • Preventative Steering Gets Generalized!: The concept of preventative steering, initially described in Anthropic’s persona vectors paper, can be generalized by adding a steering vector while judging the model based on its ability to hit the original target, forcing the model to compensate against the steering vector.
    • By changing the target, models can be encouraged to do more than just fight against a steering vector, especially if features can be used as targets.
  • Data Augmentation Via Steering Vectors!: The idea of using steering vectors for data augmentation was explored, addressing the limitation of only obtaining two different data versions by changing the target.
    • The steering vector acts as a perturbation, and the targeted feature is what the model should aim for when the perturbation occurs in the data, allowing for exploration of compositions of perturbations rare in the empirical dataset.
  • Mapping Representations Instead of I/O: An argument was made that instead of training models to map inputs to outputs (expensive and bad), we could train them to map from representations to representations, effectively achieving data augmentation in representation space.
    • This would allow precomputing the first part of the forward pass and reusing it with multiple steering vectors aimed at different targets.
  • Controlling Model Behavior Via Internal Semantics!: The idea of “If [Semantics X internally] then do [Semantics Y internally]” offers a ton of control over model behavior and frees us from the dependence on data, according to this tweet.
    • Also mentioned was this paper in relation to internal semantics.

Moonshot AI (Kimi K-2) ▷ #general-chat (40 messagesđŸ”„):

Kimi K2 Turbo Removal, Kimi CLI bash/shell issues, Interactive quiz generator using Kimi, Kimi-Code API and Openclaw Compatibility, Kimi Code vs Kimi Claw

  • Kimi K2 Turbo Disappears, Users Request Refund: Users reported that Kimi K2 Turbo was removed from the Kimi-Coding model availability, leading to frustration as they had subscribed based on its availability.
    • One user expressed disappointment, stating, “I really find that very very sad, that they advocate something
and then users like me use that to sign up for a year.. and then they remove it?!?”.
  • Interactive Quiz Generator Created with Kimi: A user created an interactive quiz generator using Kimi, allowing users to paste video transcripts or other content into Kimi, copy-paste the output into an HTML page, and start answering questions, with file quiz.html.
    • The generator includes features like “Select all that apply” questions and session saving/resuming, as shown in attached images.
  • Kimi CLI Faces Bash/Shell Issues: A user reported that K2.5 has issues using bash/shell in Kimi CLI, and this issue is unique to this model.
    • They noted that the problem doesn’t occur with other models.
  • Is Kimi-Code API now incompatible with Openclaw?: Some users experienced that Kimi code stopped working with Openclaw.
    • Others suggested exploring alternative options on Openrouter to decide on a model and provider.

Modular (Mojo đŸ”„) ▷ #general (4 messages):

Modular Acquires BentoML, Jupyter Mojo Kernel Release, Notebook Enthusiasts Rejoice

  • Modular Acquires BentoML, AMA Announced: Modular announced the acquisition of BentoML and will host an Ask Me Anything (AMA) session with Chris and Chaoyu, with questions being collected on the Modular forum and streamed on YouTube.
  • Jupyter Mojo Kernel Drops!: A member released a Jupyter Mojo kernel for notebook enthusiasts, available on GitHub.
    • The kernel is currently “pretty barebones” without completions or image support but is fast and works well on MacOS and recent Linux versions, and uv will auto-install the matching modular package if you don’t have it already.

Modular (Mojo đŸ”„) ▷ #announcements (1 messages):

BentoML acquisition, Ask Us Anything session, Modular forum, Future vision, Sticker giveaway

  • Modular Acquires BentoML - AMA Today!: Modular is hosting an Ask Us Anything (AMA) session today at 9:30 AM PT in the Modular forum regarding their recent acquisition of BentoML.
    • The first ten people to share their questions in the forum thread will receive stickers.
  • Ask Me Anything in the Modular Forum: <@685556913961959475> and <@679535512565973018> will be answering questions and sharing their vision for the future during the AMA.
    • The event is taking place in the Modular Forum, with an attached image promoting the event.

Modular (Mojo đŸ”„) ▷ #mojo (24 messagesđŸ”„):

LayoutTensor vs InlineArray, Mojo Random Number Generators, Mojo C++ Binding Status, Async on GPUs

  • stack_allocation Loses Origin Safety: stack_allocation loses origin safety and exclusivity checking versus InlineArray, plus it won’t allow you to take advantage of noalias optimizations.
    • It’s a crutch due to compiler limitations, with better options coming soon.
  • Registers on GPU: An InlineArray or stack_allocation indexed only by constant values will be stored in registers on the GPU, assuming you didn’t spill.
    • The compiler may optimize trivial register passable types for register use.
  • Mojo needs Random Number Generators: A member is looking for implementations of Random Number Generators, specifically Poisson, for common probability distributions in Mojo.
    • Another member suggested that this need might be interesting for Mojo.
  • Async Await on GPU’s: A blogpost about implementing Async/Await on GPUs might be interesting for Mojo, since it is in flux.
    • This also provides a motivation for having a nice way to do cold-start futures, since this approach may not work with hot-start futures.
  • Mojo Binding Status to C++: Mojo mostly supports round-trip C++ binding via C with manual bindings.
    • It is not as easy as with pybind11, especially since you need to write the bindings by hand.

Modular (Mojo đŸ”„) ▷ #max (4 messages):

mxfp4 support in max, NVFP4 focus, customized Mojo kernels

  • MXFP4 support lagging in MAX: MXFP4 support in MAX is lagging because the team is focusing on NVFP4 first, but there’s nothing fundamentally blocking support as the datatypes are present in base Mojo.
    • Once support for NVFP4 is solid, MXFP4 might follow quickly, with testing and tuning happening at the model level.
  • MAX Models now with Customized Mojo Kernels: MAX graphs and models built using the OSS modular repo can now use a fully customized Mojo standard library or Mojo kernels due to enhancements in the build infrastructure and new capabilities in the graph compiler, announced in this forum post.

MCP Contributors (Official) ▷ #mcp-dev-summit (1 messages):

smw355: Hi, I think so, but will try to find out.


MCP Contributors (Official) ▷ #general (8 messagesđŸ”„):

SEP-2007, MCP Payment Support, Micropayments for Agents, X402 protocol

  • MCP Servers look to support Monetization Incentives: A member created a SEP to allow MCP servers to request money for tools hosted, starting with X402, to accelerate agent adoption with monetization.
    • Concerns were raised about whether payment support should be built into the protocol, versus handled via URL elicitation.
  • MCP payments unlikely: A member mentioned payments won’t be prioritized in the next few months unless a core maintainer is deeply in favor, preferring to wait for established payment patterns.
    • The author of the SEP asked if any core maintainers are looking into payments, tagging a member to take a look.
  • Micropayments for Agents discussed: A member explained the proposal focuses on micropayments for agents to autonomously pay for tools, requiring rich cost information for intelligent decision-making under guardrails.
    • The member doesn’t anticipate new payment protocols for MTXns anytime soon.
  • MCP Extension Planned?: A member expressed strong agreement with the hesitation toward incorporating payments directly into the protocol, but offered to work with the SEP author on an unofficial extension initially.
    • They offered to review the SEP in depth.

MCP Contributors (Official) ▷ #general-wg (5 messages):

MCP Resource Specification, SEP-2084 Rejection, Resource Hierarchy, resource/read Ambiguity

  • MCP Resource Specification Tidy-Up Proposed: A member shared a pull request aimed at tidying up the specification and usability of resources in MCP.
    • The PR attempts to clarify some of the murkiness around the specification and usability of resources in MCP.
  • SEP-2084 Faced Rejection Over Grouping Proposal: A member pointed out that SEP-2084 related to grouping resources had been rejected because CMs didn’t want to buy into any grouping proposal at this time.
    • The precursor to that SEP was SEP-1300, which only focused on grouping tools, but explicit feedback was that any grouping proposal would need to apply to all primitives.
  • Resource Hierarchy Already Exists Implicitly: A member argued that resources already implicitly have hierarchy in the form of a URI path or by returning sub-resources in resource/read.
    • The pull request is mostly bringing more formality and utility to these existing conventions, not attempting to solve problems around context length or grouping primitives by category for UX purposes.
  • resource/read Functionality Faces Ambiguity: The user raises a flag that when you resource/read a URI, you don’t know whether you get a single resource back or a collection of child resources.
    • This is confusing and needs clarification and refinement.

tinygrad (George Hotz) ▷ #general (12 messagesđŸ”„):

tinybox issues, tinygrad PR, meeting 7 agenda, Axelera AI accelerator, BarraCUDA emulator

  • TinyBox Setup Fumbles Prompt Community Assistance: A user reported issues with their TinyBox only recognizing 2 of the 4 GPUs, requesting community assistance.
    • George Hotz suggested checking the wires, and the user confirmed reseating the cards fixed the issue.
  • Massive TinyGrad PR Sparks Debate on Code Quality: A user submitted a PR that couldn’t be done in 50 lines, but rather in 150, to tinygrad.
    • George Hotz expressed concern about “AI slop” in PRs, urging contributors to carefully review each line, to ask if it’s needed.
  • TinyGrad Meeting 7 Set to Deep Dive into Key Areas: George Hotz announced the agenda for meeting #7, which includes company updates, llama training loop & flash attention, drivers, viz/fast gemm, CALL/PARAMS and assembly.
    • The agenda also covers compiler renderer and image, lazy assign setitem and scheduler as well as other issues and bounties.
  • Exploring Integration of Axelera AI Accelerator in TinyGrad: The community discussed the possibility of supporting small accelerators like the Axelera AI Metis-M.2 card for TinyGrad.
    • There was also a suggestion to include custom RTL on small PCIe FPGA boards like Acorn CLE-215.
  • BarraCUDA Emulator Gains Attention for Bug Concerns: George Hotz suggested contributing to TinyGrad instead of writing in C after finding a project called BarraCUDA, which is an emulator.
    • He pointed out that it’s crazy that they wrote it in C, with no CI. *“I’m very skeptical of lack of bugs. They should at least use our emulator.”

Yannick Kilcher ▷ #general (10 messagesđŸ”„):

Incorrect Citations in Research Papers, OpenClaw Multiagent System, Harness Engineering by OpenAI

  • LLMs May Not Cause Incorrect Citations: Members discussed if incorrect citations in recent papers are linked to LLMs or if they’ve always existed, suggesting a review of past conferences to check citation accuracy.
    • One member noted that if a paper submitted in NeurIPS 2025 has an LLM citing 2024 publications, it counts as a hallucination.
  • OpenClaw: Multiagent System Excites with Possibilities: OpenClaw is an interesting multiagent system with general possibilities, accepting any type of input via its gateway and integrating time as an input.
    • One member describes how OpenClaw uses event-driven development: Time produces events through heartbeats and cron jobs
humans create events through messages
external systems create events through webhooks.
  • Cybersecurity Nightmares Lurk in LLM-Powered Agents: With great power comes great responsibility and also cybersecurity nightmares such as malware chains in skills, and prompt injections when it comes to LLM powered agents.
    • A member shared a YouTube link discussing OpenClaw, and other member remarked that lots of people could have made this, but probably shied away from doing so because it’s so dangerous.
  • Harness Engineering: A Marketing Pitch?: A member shared OpenAI’s Harness Engineering blogpost and expressed disappointment at its marketing-heavy approach.
    • They felt the post lacked substance, noting: Could have been interesting if they didn’t decide to make the entire thing one long marketing pitch, and added Doesn’t look like you can really infer much about the efficacy of the approach.

Yannick Kilcher ▷ #ml-news (2 messages):

Claude Sonnet 4.6, Anthropic's models

  • Anthropic Releases Claude Sonnet 4.6 Model: Anthropic has announced the release of Claude Sonnet 4.6, showcased in a tweet and detailed in their official news post.
  • Details of Claude Sonnet 4.6: The new model Claude Sonnet 4.6 has double the context window and is faster and cheaper than previous Claude models.

Manus.im Discord ▷ #general (10 messagesđŸ”„):

Suspended Accounts, Support Channels, Credit Costs, Developer Introductions, Manus Account Issues

  • Accounts still suspended?: A user is seeking assistance with a suspended Manus account and is finding it difficult to get help in the general channel.
    • Another user suggested trying the support channel for better assistance, as the general channel may not be the best place to address this issue.
  • Baghdad teen is now verified: A 13-year-old developer from Baghdad, Iraq, announced they are now verified and encouraged others to build something crazy with Manus.
    • They asked ‘Who else is coding here? after being verified, expressing excitement and encouragement.
  • Developers introduce themselves: Several developers introduced themselves and their skills, with one highlighting experience in Blockchain and AI Agents.
    • Another developer presented as a full-stack developer with experience in web applications, API integrations, and data pipelines, expressing a passion for building real-world products and collaborating on great projects.
  • Presentation errors frustrate user: A user is experiencing issues with a Manus account and is frustrated because a presentation built over several weeks is riddled with errors.
    • The user sees the presentation in their history but cannot reinstate it, expressing significant stress due to being ‘right at the finish line.’
  • Subscription auto-renews unless cancelled: A user was advised to cancel their subscription to avoid being charged.
    • Another user said ‘Could you please send me your registered email via DM?’.

DSPy ▷ #show-and-tell (1 messages):

archelunch: Now it’s published to github https://github.com/Archelunch/dspy-repl


DSPy ▷ #general (3 messages):

Semantic Search in Discord, GEPA with offline data

  • Discord Needs Semantic Search: Members are discussing the lack of semantic search capabilities within Discord.
    • A user pointed out that Discord really needs semantic search

  • GEPA Docs for Offline Data Sought: A member requested documentation on using GEPA with offline data.
    • The user linked to a Discord channel, discord.gg/2Jqsg96mZz, presumably for further discussion or related information.

aider (Paul Gauthier) ▷ #general (1 messages):

Project Ideas, Active Projects, Technical Support, Developer Support

  • Community Seeks Project Collaboration: A member initiated a discussion by seeking project ideas and inquiring about active projects within the community, demonstrating a collaborative spirit.
    • In addition, the member extended an offer of technical support and developer assistance to community members requiring help with their projects.
  • Support Offered for Projects: A member volunteered to provide technical and developer support to other members of the community.
    • This offer aims to foster collaboration and assist in the development of ongoing or new projects within the community.

aider (Paul Gauthier) ▷ #questions-and-tips (3 messages):

/commit staging, Aider PR #2763

  • Request to limit /commit to staged changes: A user asked if there’s a way to make the /commit command only consider staged changes, as the current behavior requires stashing unwanted changes.
    • They found PR #2763 on GitHub that would address this issue, but it’s been open for over a year.
  • Aider PR #2763 addresses staged changes: PR #2763 has been open for over a year.
    • It would address the issue of /commit command only considering staged changes.

Windsurf ▷ #announcements (2 messages):

GLM-5, Minimax M2.5, Sonnet 4.6

  • GLM-5 and Minimax M2.5 hit Windsurf!: GLM-5 and Minimax M2.5 are now available on Windsurf, according to this tweet.
  • Sonnet 4.6 Lands with Promo Pricing!: Sonnet 4.6 is live in Windsurf with promo pricing starting at 2x credits, as announced in this post.

MLOps @Chipro ▷ #general-ml (1 messages):

AI Failures, AI Foundations, Organizational Confidence, Metadata Weekly

  • AI Initiatives Fail Due to Flawed Foundations: Most AI initiatives failed not because of model readiness but because our foundations weren’t built for what we asked them to do, according to Gaurav Ramesh at OpenTable.
    • Leaders hoped powerful models would bypass messy data and brittle systems, but that bet didn’t pay off.
  • Stalled AI Pilots are Learning Costs: Stalled AI pilots should be viewed as the cost of learning, exposing constraints and clarifying necessary changes.
    • The missing foundation isn’t data quality or governance but self-awareness.
  • Organizational Confidence Fuels AI Success: Budget isn’t the key to AI success; organizational confidence is what truly matters.
    • AI tests how well we understand how we work, not just infrastructure.
  • Metadata Weekly Features AI Insights: Gaurav’s full piece on the human elements of AI foundations is available in Metadata Weekly.
    • The article discusses the importance of understanding our own work processes for successful AI implementation.